Introduction to Chaos Engineering

Plus, lessons from 8 years of Kubernetes, how Slack's CPO makes product decisions and more.

Hey Everyone!

Today we’ll be talking about

  • Introduction to Chaos Engineering

    • In 2021, Facebook lost over $80 million dollars because of 7 hours of downtime

    • Large distributed systems can be incredibly tricky to reason-about and debug.

    • Chaos Engineering applies principles from the scientific method to test your backend and find potential bugs/issues

    • We’ll talk about these principles and also delve into how big tech companies are implementing Chaos Engineering

  • Tech Snippets

    • Lessons from 8 Years of Kubernetes

    • How I Put My Whole Life into a Single Database

    • Design and Evaluation of IPFS

    • How Slack’s CPO Makes Product Decisions

    • Software Engineering Culture Metrics

    • Distributed Systems for Fun and Profit

An Introduction to Chaos Engineering

As “software eats the world”, the cost of downtime is getting higher and higher for companies. Gartner estimates the average cost of IT downtime at ~$336,000 per hour. This scales up massively for big tech companies.

Facebook was down for 7 hours in 2021 due to a DNS issue and they earned little to no advertising revenue during that period. They make ~$13 million an hour from ads on Facebook, Instagram, etc. so that incident cost Facebook more than $80 million. Not good.

The largest tech companies are also operating massive distributed systems with thousands of services that are communicating over the network. These are built on cloud services with hundreds of thousands of computers, hard drives, routers, etc. involved.

When you have a complex system like this, it becomes impossible to just reason about your system. Testing and reasoning about an entire microservices-oriented backend generally becomes much more difficult than doing so with a monolith. There are far more things that can go wrong and building a complete mental model around all the hundreds/thousands of different services is close to impossible.

“A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable”

Leslie Lamport

To understand and learn about possible failure modes in these massive networks, we can do what researchers do for other complex systems (the human body, climate, the biosphere, etc.). You apply the scientific method to test your system.

Chaos Engineering is using the scientific method to learn more about your system and how it reacts under various conditions.

Chaos Engineering has become quite popular amongst the largest companies in the world. A paper published in the ACM Symposium on Cloud Computing by researchers at CMU found that of the 50 largest corporations in the US, at least 12 have publicly spoken about their use of Chaos Engineering for improving reliability.

Philosophy Behind Chaos Engineering

Chaos Engineering is the discipline of applying the scientific method to your distributed system to learn more about possible failures.

The goal is to run experiments where the experiments follow 4 steps:

  1. Measure the Steady State - Define metrics that indicate how your system should perform when things are normal. Quantify this with things like queries per second, 99th percentile latency, errors, time to first byte, etc.
    You want to identify what key performance indicators (KPIs) are relevant to your application and measure them. You should make sure that you can continue to measure these KPIs while running the experiment.

  2. Hypothesize how the system will perform under a disruption - Establish a control group, an experiment group and disruptions you want to test. Examples of disruptions are terminating virtual machines, severing network connections, filling up disk space of an instance, adding latency to backend service calls, packet loss, and more (simulate any faults you might face). Create a hypothesis for how the KPIs of your experiment group will be affected from the disruption.

  3. Introduce Chaos - Run the disruptions on your experiment group. Ideally, this should be done in production on a small subset of traffic in order to be the most useful. However, you should obviously also minimize business impact. This might mean excluding requests that have high business value from being selected into the experiment group or retrying failed requests and putting them in the control group. Many companies (like Netflix for example) also allow users to opt out of being placed in experiment groups.

  4. Check Hypothesis - See how your hypothesis about the KPIs held up. Was there a big difference in KPIs between the control group and experimental group? If so, then you should analyze whether engineering time should be dedicated to fixing that possible failure mode.

The key principles with Chaos Engineering are

  • Run Experiments in Production - When starting Chaos experiments, you should be using a staging environment. After getting experience and gaining confidence, you should eventually start running your Chaos experiments in Production. The reason why is that it simply isn’t possible to replicate everything in a staging environment (if you’re operating at the massive scale where doing Chaos engineering makes sense).

  • Minimize Blast Radius - Experimenting in production can cause customer pain, so it’s your responsibility to make sure the experiments are minimized and contained. Run the experiment on a small selection of traffic, database replicas, servers, etc. You should know your blast radius, which is the number of things the experiment can affect. Make the blast radius as small as possible so that you narrow down the things you’re testing for. This makes it easier to do a root cause analysis afterwards and it also reduces potential business impact.

  • Automate Experiments to Run Continuously - Create small, focused Chaos experiments and then automate them to run continuously. Eventually, as teams mature to the Chaos framework, these experiments can be run without warning. This can help ensure that teams implement best practices and adhere to them. However, this should be done with a small blast radius and tests should be easy to roll back.

The point is to introduce faults into your system on purpose in order to detect bugs and failures with the ultimate goal of increasing confidence in your distributed system.

Fault Injection Testing vs Chaos Engineering

There’s a lot of overlap with Chaos Engineering and Fault Injection Testing. The main difference is that Chaos Engineering is meant to generate new information and find potential “unknown unknowns”. Fault injection is moreso meant to test a specific condition.

Running continuous Chaos experiments in production where you do things like force system clocks to be out of sync with each other or inject latency in a certain service is meant to generate new information about how the system operates under stress/faults and uncover unknown unknowns that could’ve led to failures.

This is the first part of our article on Chaos Engineering.

In the second part, we’ll go through

  • Implementation and tools like AWS Fault Injection Simulator, Chaos Monkeys, etc.

  • How Facebook and LinkedIn use Chaos Testing

  • Audible and Twitch’s usage of Chaos Engineering

and more.

You can read the full article by subscribing to Quastor Pro.

Tech Snippets