Introduction to Chaos Engineering

Plus, lessons from 8 years of Kubernetes, how Slack's CPO makes product decisions and more.

February 09, 2024

Hey Everyone!

Today we’ll be talking about

Introduction to Chaos Engineering
- In 2021, Facebook lost over $80 million dollars because of 7 hours of downtime
- Large distributed systems can be incredibly tricky to reason-about and debug.
- Chaos Engineering applies principles from the scientific method to test your backend and find potential bugs/issues
- We’ll talk about these principles and also delve into how big tech companies are implementing Chaos Engineering
Tech Snippets
- Lessons from 8 Years of Kubernetes
- How I Put My Whole Life into a Single Database
- Design and Evaluation of IPFS
- How Slack’s CPO Makes Product Decisions
- Software Engineering Culture Metrics
- Distributed Systems for Fun and Profit

An Introduction to Chaos Engineering

As “software eats the world”, the cost of downtime is getting higher and higher for companies. Gartner estimates the average cost of IT downtime at ~$336,000 per hour. This scales up massively for big tech companies.

Facebook was down for 7 hours in 2021 due to a DNS issue and they earned little to no advertising revenue during that period. They make ~$13 million an hour from ads on Facebook, Instagram, etc. so that incident cost Facebook more than $80 million. Not good.

The largest tech companies are also operating massive distributed systems with thousands of services that are communicating over the network. These are built on cloud services with hundreds of thousands of computers, hard drives, routers, etc. involved.

When you have a complex system like this, it becomes impossible to just reason about your system. Testing and reasoning about an entire microservices-oriented backend generally becomes much more difficult than doing so with a monolith. There are far more things that can go wrong and building a complete mental model around all the hundreds/thousands of different services is close to impossible.

“A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable”

Leslie Lamport

To understand and learn about possible failure modes in these massive networks, we can do what researchers do for other complex systems (the human body, climate, the biosphere, etc.). You apply the scientific method to test your system.

Chaos Engineering is using the scientific method to learn more about your system and how it reacts under various conditions.

Chaos Engineering has become quite popular amongst the largest companies in the world. A paper published in the ACM Symposium on Cloud Computing by researchers at CMU found that of the 50 largest corporations in the US, at least 12 have publicly spoken about their use of Chaos Engineering for improving reliability.

Philosophy Behind Chaos Engineering

Chaos Engineering is the discipline of applying the scientific method to your distributed system to learn more about possible failures.

The goal is to run experiments where the experiments follow 4 steps:

Measure the Steady State - Define metrics that indicate how your system should perform when things are normal. Quantify this with things like queries per second, 99th percentile latency, errors, time to first byte, etc.
You want to identify what key performance indicators (KPIs) are relevant to your application and measure them. You should make sure that you can continue to measure these KPIs while running the experiment.
Hypothesize how the system will perform under a disruption - Establish a control group, an experiment group and disruptions you want to test. Examples of disruptions are terminating virtual machines, severing network connections, filling up disk space of an instance, adding latency to backend service calls, packet loss, and more (simulate any faults you might face). Create a hypothesis for how the KPIs of your experiment group will be affected from the disruption.
Introduce Chaos - Run the disruptions on your experiment group. Ideally, this should be done in production on a small subset of traffic in order to be the most useful. However, you should obviously also minimize business impact. This might mean excluding requests that have high business value from being selected into the experiment group or retrying failed requests and putting them in the control group. Many companies (like Netflix for example) also allow users to opt out of being placed in experiment groups.
Check Hypothesis - See how your hypothesis about the KPIs held up. Was there a big difference in KPIs between the control group and experimental group? If so, then you should analyze whether engineering time should be dedicated to fixing that possible failure mode.

The key principles with Chaos Engineering are

Run Experiments in Production - When starting Chaos experiments, you should be using a staging environment. After getting experience and gaining confidence, you should eventually start running your Chaos experiments in Production. The reason why is that it simply isn’t possible to replicate everything in a staging environment (if you’re operating at the massive scale where doing Chaos engineering makes sense).
Minimize Blast Radius - Experimenting in production can cause customer pain, so it’s your responsibility to make sure the experiments are minimized and contained. Run the experiment on a small selection of traffic, database replicas, servers, etc. You should know your blast radius, which is the number of things the experiment can affect. Make the blast radius as small as possible so that you narrow down the things you’re testing for. This makes it easier to do a root cause analysis afterwards and it also reduces potential business impact.
Automate Experiments to Run Continuously - Create small, focused Chaos experiments and then automate them to run continuously. Eventually, as teams mature to the Chaos framework, these experiments can be run without warning. This can help ensure that teams implement best practices and adhere to them. However, this should be done with a small blast radius and tests should be easy to roll back.

The point is to introduce faults into your system on purpose in order to detect bugs and failures with the ultimate goal of increasing confidence in your distributed system.

Fault Injection Testing vs Chaos Engineering

There’s a lot of overlap with Chaos Engineering and Fault Injection Testing. The main difference is that Chaos Engineering is meant to generate new information and find potential “unknown unknowns”. Fault injection is moreso meant to test a specific condition.

Running continuous Chaos experiments in production where you do things like force system clocks to be out of sync with each other or inject latency in a certain service is meant to generate new information about how the system operates under stress/faults and uncover unknown unknowns that could’ve led to failures.

This is the first part of our article on Chaos Engineering.

In the second part, we’ll go through

Implementation and tools like AWS Fault Injection Simulator, Chaos Monkeys, etc.
How Facebook and LinkedIn use Chaos Testing
Audible and Twitch’s usage of Chaos Engineering

and more.

You can read the full article by subscribing to Quastor Pro.

Tech Snippets

Learnings From 8 Years of Kubernetes

Anders Jönsson is the head of engineering and product at Urb-it and he wrote a terrific blog post about his experience using Kubernetes. They initially self-managed on AWS but ended up moving to Azure Kubernetes Service.

Anders shared his lessons learned around vendor selection, resource management, disaster recovery and more.

medium.com/@.anders/learnings-from-our-8-years-of-kubernetes-in-production-two-major-cluster-crashes-ditching-self-0257c09d36cd

How I Put My Whole Life into a Single Database

The “Quantitative Life” movement is where you track metrics around your life and health so you can figure out how to optimize them.

You collect data on things like
- how many hours you slept
- your weight
- hours worked

Analyzing this data can give a lot of insights on how to improve your happiness/fulfillment.

Felix wrote a fantastic blog post about how he does this.

krausefx.com//blog/how-i-put-my-whole-life-into-a-single-database

Design and Evaluation of IPFS: A Storage Layer for the Decentralized Web

IPFS (InterPlanetary File System) is a peer-to-peer protocol for storing files across a network of computers. It lets you create permanent, immutable links and reduce reliance on centralized servers.

This is a great review of the paper behind IPFS and it goes through the papers contributions, how the system works, evaluation and more.

www.micahlerner.com/2022/10/31/design-and-evaluation-of-ipfs-a-storage-layer-for-the-decentralized-web.html

How Slack’s Chief Product Officer makes Product Decisions

This is a terrific blog post based on an interview of Noah Desai Weiss, Slack’s Chief Product Officer.

Noah goes through
1. his approach to product leadership. He emphasizes the importance of taking calculated risks for outsized improvement rather than just incremental improvements.

2. distinguishing between permanent and reversible decisions (and encouraging faster action on the former)

3. his three-step process for making decisions in product development

and more.

review.firstround.com/how-to-take-bigger%2C-bolder-product-bets-lessons-from-slacks-chief-product-officer

Software Engineering Culture Metrics

This is a great doc that defines how you can think about the engineering culture of your organization.

Areas you can measure are things like
- level of adoption of internal platform features
- average time to merge code
- deployment health
- engagement during all-hands meetings and stand ups

and more

davidxiang.com/2021/02/10/software-engineering-culture-metrics/?utm_source=blog.quastor.org&utm_medium=referral&utm_campaign=how-quora-scaled-mysql-to-100k-queries-per-second

Distributed Systems for Fun and Profit

This is a terrific series of in-depth blog posts that give an introduction to working with distributed systems.

Topics covered include a discussion around time (vector clocks, global vs. local vs. no-clock assumptions), replication (synchronous, asynchronous, partition tolerance), CRDTs and more!

book.mixu.net/distsys