Bring a little chaos

Chaos Engineering leads to higher resilience

Posted by Graham

Chaos Engineering sounds counter intuitive, it sounds completely opposite to getting the most reliable software deployment possible out there and into Production. But it works, and it works because it forces people to think differently.

What is Chaos Engineering?

The Principles of Chaos Engineering website defines it as “the discipline of experimenting on a system, in order to build confidence in the system’s capability to withstand turbulent conditions in production”.

On first glance this seems, simple. We try this in test systems all the time, we look for issues and then we fix them, job done.

Not, exactly.

The key word, somewhat underplayed, in the initial definition is “production”. In Chaos Engineering the experiments are run on the Production system, while it is live, and while it is serving real customers.

What?! Why?

No other system behaves quite like a Production system. It does not matter how carefully a test environment is crafted, some of the variables in the system are different. Deploying infrastructure as code certainly helps this, and some companies (and systems) have gone as far as recording the traffic to a system and replaying that.

But it is still different. The roundtrip times to different clients are different in the real world, in test systems they are (mostly) the same. Random bits of infrastructure fail in the real world, and these random pieces might be near the client’s end rather than yours, subtly changing the order of blocks received.

That is before we begin to explore the nightmare that is bugs that appear in Production but nowhere else. This Monkey User strip illustrates it perfectly.

How does it work?

Events are automatically, and randomly, generated in the system that simulate real-world occurrences; network appliances might go offline, server components may go offline, or traffic might spike.

The ideal response is that an end user will not see any impact. While that is unlikely to be the outcome on day one, it is surprising how quickly this ideal is reached.

Why does it change people's thinking?

A traditional way of thinking about system reliability is to concentrate on the MTBF (Mean Time Between Failure). This prioritises system uptime but ignores how long the system is unavailable when a problem does occur. This is great for a production line turning out goods that sit as inventory in a warehouse but doesn’t work so well with digital systems.

Example

I have worked with a customer whose main CRM system became “unreliable” due to an issue in production. Entire call centres became disconnected on a seemingly random basis and it eventually got to a point where alternate systems needed to be considered because the customer reps could not see who customers were. This problem persisted for over a month.

But it was only one “failure”. So, the MTBF still looked healthy.

Chaos Engineering makes us think less about MTBF and more carefully about MTTR (Mean Time To Recover). In the above example the MTTR would be severely affected, and this would more closely represent the impact on customers, on staff, and on the business.

How is the changed thinking expressed?

The Software Development team quickly shifts their mindset from trying to build a “perfect” artifact to building artifacts that fix themselves, or adopt the “Cattle vs Pets” philosophy.

They do this because the random nature of Chaos Engineering means that, unless they make their architecture resilient, they will repeatedly but unpredictably, need to deal with an outage. This is very disruptive at both a personal and professional level and so the team as a whole is very motivated to eliminate the reasons that it happens.

As this mindset changes, the way that the team approaches new functions, or the modification of existing functions changes, there is an increased focus on testing, on monitoring and on all the operational aspects that allow problems to be both detected and diagnosed. Architectures change from single points of failure to redundant clusters of services, with circuit breakers, fall backs and rich user stories underpinning the changes.

It is disruptive, it is a very scary approach to start but it is ultimately a gateway to a much higher standard of deployed software.

"'Pick-Up Sticks'" by Bulldog Pottery - Bruce Gholson and Samantha Henne is licensed under CC BY-ND 2.0

Don't skip to the end

DevOps needs proper investment

Scrum: It's supposed to be simple

How difficult have we made it?