20 Feb AiroAV Malware Declare: Building Confidence in Healthcare Systems Through Chaos…
Chesser: I’m Carl Chesser. This talk is going to be going over a story of what we did at Cerner, introducing chaos engineering and chaos engineering principle and with a big, strong effort of how can build confidence in our systems. As Michelle mentioned earlier, a lot of this could be referenced more like continuance resilience, but we’ll be talking about chaos engineering in this talk.
Before I begin, I’ll give you a little bit about myself. I usually like to explain to people where I’m from. One, a lot of people do not know what Cerner is. Cerner is a large healthcare IT company. It’s one of the leading healthcare IT companies in the world. We have about a little under 30,000 people, and so we have a pretty large engineering work force. We work on a lot things. If you think about electronic medical records, device integrations, so smart pumps, smart rooms as well as even up to more aggregate levels for population health and assessing populations of all that data.
As part of that, we’re in Kansas City, Missouri. Usually when I bring up I’m from Kansas City, people are , “I’ve heard of Kansas City. That’s in Kansas.” That’s true, but there’s also Kansas City, Missouri. It’s right on the border in the center of the continental United States. Cerner is a very large company in Kansas City. We have campuses throughout the city. I’m in the Missouri side, but there’s also campuses in the Kansas side. Actually we’ll drive across the state border just going to different campuses.
Another interesting fact about Kansas City is, it’s known as the City of Fountains. There’s about 200 operating fountains in Kansas City. If you ever come visit, you might be, “They have a lot of problem in this town.” No, these are intended. They’re called fountains. Another interesting point about Kansas City is, as probably most of you heard, they have really good barbecue. Everyone from Kansas City normally has a very strong opinion of what is good barbecue. There’s a large event that happens every year in the fall called the American Royal. It’s the world’s competition for barbecue. Everyone who wants to be the best at barbecue, they come to Kansas City for that event.
My opinion about barbecue, if you were to visit Kansas City, is to go to Joe’s Kansas City Barbecue. It used to be called Oklahoma Joe’s, but of course that made it confusing being in Kansas City. It’s Joe’s Kansas City. If you go there, order the Z-man and go to their gas location. I guarantee you’ll be happy with that choice. You have one new thing that you didn’t think you were going to learn that you got out of this talk.
From this talk, there’s a little bit of this lineage of what we’ll go through. First, I’m going to start off with a story. I’m going to talk through the things that we worked through at Cerner, and it starts with a lot of the problems we’re facing. I’ll briefly talk through certainly technologies we were facing and trying to change, but you get quite a bit of those examples. Substitute technology X for technology Y. It’s really how we were trying to evolve these very complex systems. As I talk through this story, I’m going to briefly cut away into a concept of which we’ll talk about as traffic management patterns. Then we’ll conclude about the story of things we went to change, and then I’m going to share all the lessons learned of how we had to start introducing chaos engineering. One, of the challenges doing it in a large corporation. The other one was around issues of just in an industry that’s very sensitive to these type of critical workloads. Hopefully, at the end of this, you’re going to several lessons that we had learned that you can hopefully apply back at your own companies.
First, about our story. This story, I don’t think is very unique, because a lot of you relate to the challenges of just technology. Technology keeps changing over time. Cerner is a very large company, we just hit our 40th anniversary. We’ve been around for a while, we have lots of different technologies that exist, and we have a lot of systems that are adding value. Since we’re adding value, they keep living. Some of those technologies have been around for quite some time, and have grown quite complex. As we’ve been working through this, my role is on this group called platform engineering. We work on a lot of their cross-cutting technologies at Cerner, and you get exposed to a lot of these different types of things you’re having to work on.
The first case was a challenge. The challenge was we have a lot of these type of service deployments in a very complex environment, and they were serving a highly critical workload. It’s like if you were to open up a door and you see the inner workings of clock going on, and someone’s, “Go change that gear.” Everyone’s, “I don’t know. That seems like that would cause this massive compounding issue.” It was because you knew all these systems were working quite well for what they supported, but they had a lot of knowledge of how they came to be and what they were supporting.
In this example, we had a lot of different ways we deployed services over time, and we were unifying those to make sure they were being evolved into the future that we wanted. One example is, we had a lot of service deployments. We had these Java service workloads using IBM WebSphere Application Server. There’s a lot of complexity with that technology, and we had a lot of very experienced operators because we hosted all this.
One interesting fact with Cerner is that we host a lot of our clients. In Kansas City, we have a large fiber network, so all these hospitals remote into our data centers to access that software. We know how to operate that software really well, but when you go look at what’s running in production, there’s a lot of tribal knowledge that was built into how to optimally run that workload. We wanted to change that workload, but we realized there was a lot more than just the code that’s running there. There’s a lot that’s about that environment.
We knew that with that same problem that what we wanted to change, we wanted to essentially incrementally build something but allowing it for change. We wanted to have a simple way to understand the system, because we’re going to keep discovering new things about this very complex existing environment, and we wanted to carry that environment forward. Instead of us coming in and saying, “Yes, we want to do Kubernetes and forget all the rest of that old deployments. We’re not doing it anymore,” we said, “No, we have to think about how we transition that existing technology to carry it forward.” Figured out the ways to pass it because we didn’t want more multiple ways of doing deployments.
One way we started going about this, which I found to be very valuable, is we created our own little declaration of what the service workload is. Most things now, it has to be in AMIL, so it’s a declarative type thing that we have in a central repository. It was just a collection of all the facts about a service. A service is declared in this way. We know all the facts about it. We knew about the humans associated with that service, so we called them service owners. We had this declaration of what something served as a service, and we had a way of regenerating or rebuilding a system based off that.
If we change some other technology, we’re using our own specification of what a service was and generating those things. These service profiles served in many different ways for communicating. We used these also to communicate out to teams with our scorecarding of a service, so we can assess what those are and it communicates that to the service owners that are tied to those.
As we were building these systems, we realized we were trying to seek further and further more functions of availability and that was just growing more and more complexity in the system. Given that earlier example that we already were dealing with IBM WebSphere Application Server type deployments, we were wanting to introducing container deployments of newer services we were building. We were trying to merge those together. All that technology getting composed together was obviously more than what one person could put in their head, and it became difficult to understand how far can we take this, how safely can we manage this larger amount of that workload.
As we’re going through it, we realized, we’re entering in a newer mode that we don’t understand all those boundaries of safety, because we’re hitting lots of complexity at once, but we wanted to make sure we could actually change that. How do we live in this world where we know we’re getting to more complex states, because we wanted to have higher…