Introduction to Chaos Engineering

Plus, lessons from 8 years of Kubernetes, software engineering culture metrics and more.

September 26, 2024

Hey Everyone!

Today we’ll be talking about

Introduction to Chaos Engineering
- In 2021, Facebook lost over $80 million dollars because of 7 hours of downtime
- Large distributed systems can be incredibly tricky to reason-about and debug.
- Chaos Engineering applies principles from the scientific method to test your backend and find potential bugs/issues
- We’ll talk about these principles and also delve into how big tech companies are implementing Chaos Engineering
Tech Snippets
- Lessons from 8 Years of Kubernetes
- How I Put My Whole Life into a Single Database
- Software Engineering Culture Metrics
- Distributed Systems for Fun and Profit

An Introduction to Chaos Engineering

As “software eats the world”, the cost of downtime is getting higher and higher for companies. Gartner estimates the average cost of IT downtime at ~$336,000 per hour. This scales up massively for big tech companies.

Facebook was down for 7 hours in 2021 due to a DNS issue and they earned little to no advertising revenue during that period. They make ~$13 million an hour from ads on Facebook, Instagram, etc. so that incident cost Facebook more than $80 million. Not good.

The largest tech companies are also operating massive distributed systems with thousands of services that are communicating over the network. These are built on cloud services with hundreds of thousands of computers, hard drives, routers, etc. involved.

When you have a complex system like this, it becomes impossible to just reason about your system. There are far more things that can go wrong and building a complete mental model around all the hundreds/thousands of different services is impossible.

“A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable”

Leslie Lamport

To understand and learn about possible failure modes in these massive networks, we can do what researchers do for other complex systems (the human body, climate, the biosphere, etc.). You apply the scientific method to test your system.

Chaos Engineering is using the scientific method to learn more about your system and how it reacts under various conditions.

Chaos Engineering has become quite popular amongst the largest companies in the world. A paper published in the ACM Symposium on Cloud Computing by researchers at CMU found that of the 50 largest corporations in the US, at least 12 have publicly spoken about their use of Chaos Engineering for improving reliability (this paper was published in November 2021 so it’s probably gone up since then).

Philosophy Behind Chaos Engineering

Chaos Engineering is the discipline of applying the scientific method to your distributed system to learn more about possible failures.

The goal is to run experiments where the experiments follow 4 steps:

Measure the Steady State - Define metrics that indicate how your system should perform when things are normal. Quantify this with things like queries per second, 99th percentile latency, errors, time to first byte, etc.
You want to identify what key performance indicators (KPIs) are relevant to your application and measure them. You should make sure that you can continue to measure these KPIs while running the experiment.
Hypothesize how the system will perform under a disruption - Establish a control group, an experiment group and disruptions you want to test. Examples of disruptions are terminating virtual machines, severing network connections, filling up disk space of an instance, adding latency to backend service calls, packet loss, and more (simulate any faults you might face). Create a hypothesis for how the KPIs of your experiment group will be affected from the disruption.
Introduce Chaos - Run the disruptions on your experiment group. Ideally, this should be done in production on a small subset of traffic in order to be the most useful. However, you should obviously also minimize business impact. This might mean excluding requests that have high business value from being selected into the experiment group or retrying failed requests and putting them in the control group. Many companies (like Netflix for example) also allow users to opt out of being placed in experiment groups.
Check Hypothesis - See how your hypothesis about the KPIs held up. Was there a big difference in KPIs between the control group and experimental group? If so, then you should analyze whether engineering time should be dedicated to fixing that possible failure mode.

The key principles with Chaos Engineering are

Run Experiments in Production - When starting Chaos experiments, you should be using a staging environment. After getting experience and gaining confidence, you should eventually start running your Chaos experiments in Production. The reason why is that it simply isn’t possible to replicate everything in a staging environment (if you’re operating at the massive scale where doing Chaos engineering makes sense).
Minimize Blast Radius - Experimenting in production can cause customer pain, so it’s your responsibility to make sure the experiments are minimized and contained. Run the experiment on a small selection of traffic, database replicas, servers, etc. You should know your blast radius, which is the number of things the experiment can affect. Make the blast radius as small as possible so that you narrow down the things you’re testing for. This makes it easier to do a root cause analysis afterwards and it also reduces potential business impact.
Automate Experiments to Run Continuously - Create small, focused Chaos experiments and then automate them to run continuously. Eventually, as teams mature to the Chaos framework, these experiments can be run without warning. This can help ensure that teams implement best practices and adhere to them. However, this should be done with a small blast radius and tests should be easy to roll back.

The point is to introduce faults into your system on purpose in order to detect bugs and failures with the ultimate goal of increasing confidence in your distributed system.

Fault Injection Testing vs Chaos Engineering

There’s a lot of overlap with Chaos Engineering and Fault Injection Testing. The main difference is that Chaos Engineering is meant to generate new information and find potential “unknown unknowns”. Fault injection is moreso meant to test a specific condition.

Running continuous Chaos experiments in production where you do things like force system clocks to be out of sync with each other or inject latency in a certain service is meant to generate new information about how the system operates under stress/faults and uncover unknown unknowns that could’ve led to failures.

Implementation

There are tons of different SaaS apps, open source libraries, toolkits, etc. for implementing Chaos engineering.

AWS provides Fault Injection Simulator, which is a managed service that makes it easy to run Chaos experiments on EC2 instances, EKS clusters, RDS instances, etc. You can run tests like incorporating a delay in autoscaling or max out CPU on an ec2 instance.

Microsoft offers Azure Chaos Studio, which is a managed Chaos service for their cloud platform where you can run similar experiments like adding CPU pressure, network latency, infrastructure outages, and more.

There are also startups that are building products targeted specifically toward Chaos Engineering like Gremlin and LaunchDarkly.

Depending on your tech stack, there’s also a huge number of open source Chaos tools for whatever you’re using.

Chaos Monkey is an open source tool by Netflix to randomly terminate VM instances. kube-monkey is an implementation of Chaos Monkey for kubernetes clusters. Byte-monkey allows you to test failure scenarios for JVM apps. Apache Kafka has Trogdor for injecting faults.

Examples

We’ll go through a couple examples of big tech companies and how they employ Chaos principles.

Facebook

Project Storm is an effort at Facebook that was initiated after 2012’s Hurricane Sandy, where two of Facebook’s data centers were threatened by the superstorm. To see if their systems can orchestrate the traffic shift, Facebook will take down services, data centers or entire regions during working days as part of controlled Chaos experiments. After their 7 hour outage in 2021, Facebook published a post mortem blog post where they credited their storm exercises for helping to prepare them to quickly bring things back online and manage the increasing loads.

Here’s an article on Project Storm.

LinkedIn has more than 10,000 microservices in production and one way they test them is with the Waterbear project. LinkedOut is their in-house failure injection framework designed to simulate failures across LinkedIn’s application stack and see how user experience degrades. They run request disruption in production by adding latency, returning an exception or just timing out and then measure how the user experience is impacted.

Here’s an article from LinkedIn’s tech blog on how they do resilience engineering.

Audible

Audible commonly runs Chaos experiments where they’ll see how backend services react to high network latency, packet loss, CPU spikes, data center unavailability and more. The KPIs engineers look at are things like audiobook playback starts per second, membership sign ups, audiobooks added to cart and more. They have an automated framework that will spin up an experimental version of the backend service and then they’ll route traffic to the experimental cluster as well as to unaltered backend service servers. As they run the Chaos experiment, they compare KPIs between the experimental and control clusters.

Here’s a talk from Audible on how they run Chaos experiments.

Twitch

Twitch has hundreds of microservices in production that are abstracted away from their front-end clients via a single GraphQL API. They run their Chaos experiments by simulating failures at the GraphQL resolver side where the resolver looks at the request, checks for any headers that indicate a Chaos experiment and executes based on that. A common failure that happens for Twitch is a microservice erroring out and failing to serve its portion of the data. Therefore, they test that through Chaos experiments by including a header that tells the GraphQL resolver to short circuit a call to specific microservices on the backend. They can then test to see how the frontend clients are able to handle this failure.

Here’s a blog post from Twitch engineering on how they do Chaos testing.

Target

Target has Game Days which are events where they conduct Chaos experiments against components in their system. Before the game day, they make sure to outline the application undergoing experimentation, the type of attack, duration of the attack, scope, hypothesis, and more. They also ensure that proper observability is in place so they can see how the system is responding. An example of a game day is injecting 100 milliseconds of latency to their proxy service for a timespan of 15 minutes and see how this affects the system.

Here’s Part 1 and Part 2 of a series Target did on Chaos Engineering for their tech blog.

Tech Snippets

Learnings From 8 Years of Kubernetes

Anders Jönsson is the head of engineering and product at Urb-it and he wrote a terrific blog post about his experience using Kubernetes. They initially self-managed on AWS but ended up moving to Azure Kubernetes Service.

Anders shared his lessons learned around vendor selection, resource management, disaster recovery and more.

medium.com/@.anders/learnings-from-our-8-years-of-kubernetes-in-production-two-major-cluster-crashes-ditching-self-0257c09d36cd

How I Put My Whole Life into a Single Database

The “Quantitative Life” movement is where you track metrics around your life and health so you can figure out how to optimize them.

You collect data on things like
- how many hours you slept
- your weight
- hours worked

Analyzing this data can give a lot of insights on how to improve your happiness/fulfillment.

Felix wrote a fantastic blog post about how he does this.

krausefx.com//blog/how-i-put-my-whole-life-into-a-single-database

Software Engineering Culture Metrics

This is a great doc that defines how you can think about the engineering culture of your organization.

Areas you can measure are things like
- level of adoption of internal platform features
- average time to merge code
- deployment health
- engagement during all-hands meetings and stand ups

and more

davidxiang.com/2021/02/10/software-engineering-culture-metrics/?utm_source=blog.quastor.org&utm_medium=referral&utm_campaign=how-quora-scaled-mysql-to-100k-queries-per-second

Distributed Systems for Fun and Profit

This is a terrific series of in-depth blog posts that give an introduction to working with distributed systems.

Topics covered include a discussion around time (vector clocks, global vs. local vs. no-clock assumptions), replication (synchronous, asynchronous, partition tolerance), CRDTs and more!

book.mixu.net/distsys