Chaos Engineering at Twitch

June 10, 2022

Hey Everyone!

Today we’ll be talking about

Automated Chaos Testing on the front-end at Twitch
- A brief intro to Chaos Engineering and its 4 step process
- How Twitch integrated Chaos Testing into GraphQL
- Twitch’s Resilience Dashboard and how they monitor client resilience
Why LinkedIn changed their data analytics tech stack
- LinkedIn previously used third party proprietary platforms for their data analytics tech stack.
- This approach led to scaling problems and made it hard to evolve the systems.
- LinkedIn switched to using open source software and the Hadoop ecosystem.
Tech Snippets
- Kubernetes Best Practices 101
- 50 Design Tips To Improve User Interfaces
- Exploring how CPython works
- Best Practices for Logging

How Twitch does Chaos Engineering

Chaos Engineering is a methodology pioneered at Netflix for testing the resiliency of your system.

You simulate different failures across your system using tools like Chaos Monkey, Gremlin, AWS Fault Injection Simulator, etc. and then measure what the impact is. These tools allow you to set up simulated faults (like blocking outgoing DNS traffic, shutting down virtual machines, packet loss, etc.) and then schedule them to run randomly during a specific time window.

Chaos Engineering is meant to be done as a scientific process, where you follow 4 steps.

Define how your system should behave under normal circumstances using quantitative measurements like latency percentiles, error rates, throughput, etc.
Create a control group and an experiment group. In an ideal implementation, you are running experiments directly on a small portion of real user traffic. Otherwise, use the staging environment.
Simulate failures that reflect real world events like server crashes, severed network connections, etc.
Compare the difference in your quantitative measurements between the control and experimental group.

Typically, Chaos Engineering is used for measuring the resiliency of the backend (usually service-oriented architectures).

However, engineers at Twitch decided to use Chaos Engineering techniques to test their front-end. The question they wanted to answer was “If some part of their overall system fails, how does the front-end behave and what do end users see?”

Joaquim Verges is a senior developer at Twitch and he wrote a great blog post on Twitch’s process for chaos testing.

Here’s a summary

Twitch is a live streaming website where content creators can stream live video to an audience. They have millions of broadcasters and tens of millions of daily active users. At any given time, there’s more than a million users on the site.

Twitch uses a services-oriented architecture for their backend and they have hundreds of microservices.

The front end clients use a single GraphQL API to communicate with the backend. GraphQL allows frontend devs to use a query language to request the exact data they’re looking for rather than calling a bunch of different REST endpoints.

The GraphQL server has a resolver layer that is responsible for calling the specific backend services to get all the data requested.

The most common fault that happens for their system is one of their microservices failing. In that scenario, GraphQL will forward partial data to the client and it’s the client’s job to handle the partial data gracefully and provide the best degraded experience possible.

Engineers at Twitch decided to use Chaos Engineering to test these microservice failure scenarios.

They created Chaos Mode, where they could pass an extra header to GraphQL calls in their staging environment. Within the header, they pass the name of the backend service that they want to simulate a failure for.

The GraphQL resolver layer will read this header and stop any call to that service.

The main issue with this approach was that Twitch would need to send the name of the backend service within the GraphQL header. Therefore, they would have to maintain a list of all the various backend services to test.

Features and services are constantly changing, so manually mapping specific services to test was not scalable. They needed a way for the test suite to “discover” the services that they should be simulating failures for.

To solve this, Twitch added a debug header in their GraphQL calls which enabled tracing at the GraphQL resolver layer. The resolvers record any method call done to internal service dependencies, and then send the information back to the client in the same GraphQL call.

From there, the client can extract the service names that were involved and use that as input for the Chaos Testing suite.

Visualization

Twitch has many end-to-end tests for all their clients that test the various user flows (navigating to a screen, logging in, sending a chat message, etc.)

They try each of these tests with all of the Chaos Mode microservice failures and see whether the test was successful. Then, they aggregate all the Chaos Mode test results for each user flow and use that to calculate a resilience score for that particular user action test.

Resilience scores are displayed on a Dashboard where it’s easy to see any anomalies in performance. They run Chaos Mode tests every night for their Android, iOS and Web clients.

Next Steps

Twitch has been able to use this testing tool to boost resilience across all their clients.

Next they want to add the ability to test secondary microservices (services that are called from another service rather than just testing services that are called directly from the GraphQL resolver layer).

They also want to add the ability to simulate failures for multiple services at once.

For more details, you can read the full blog post here.

Tech Snippets

Kubernetes Best Practices 101 - This is a great list of best practices that you can follow when using Kubernetes. There are tips for cost optimization, security, scalability and more.

50 Design Tips to Improve User Interfaces - If you’re like me, your UIs look like they were designed by a 6 year old. This is a great PDF with 50 design tips on how to develop better looking UIs.

Behind the scenes of CPython - A series of blog posts that digs into how CPython (the default and most widely used implementation of Python) works under the hood.

Best Practices for Logging - You probably have some type of a logging system in place. Here are some useful tips to make your logs more meaningful.

Log after, not before - You should be logging things that happened and not things that you’re going to do.

Separate parameters and messages - A typical log message contains two parts: a handwritten message that explains what was going on, and a list of parameters involved. Separate both!

Distinguish between WARNING and ERROR - Log levels exist for a reason, so make sure you use them appropriately.

If you did some operations that worked, but there were some issues then that is a WARNING.

If you did some operation and it didn’t work, then that is an ERROR.

Read the full article for more details (and more best practices)

Evolving LinkedIn’s Analytics Tech Stack

Steven Chuang, Qinyu Yue, Aaravind Rao and Srihari Duddukuru are engineers at LinkedIn. They published an interesting blog post on transitioning LinkedIn’s analytics stack from proprietary platforms to open source big data technologies.

Here’s a summary

During LinkedIn’s early stages (early 2010s), the company was growing extremely quickly. To keep up with this growth, they leveraged several third party proprietary platforms in their analytics stack.

Using these proprietary platforms was far quicker than piecing together off-the-shelf products.

LinkedIn relied on Informatica and Appworx for ETL to a Data Warehouse built with Teradata.

ETL stands for Extract, Transfer, Load. It’s the process of copying data from various sources (the different data producers) into a single destination system (usually a data warehouse) where it can more easily be consumed.

illustration-of-linkedins-legacy-analytics-tech-stack

This stack served LinkedIn well for 6 years, but it had some some disadvantages:

Lack of freedom to evolve - Because of the closed nature of this system, they were limited in options for innovation. Also, integration with internal and open source systems was a challenge.
Difficulty in scaling - Data pipeline development was limited to a small central team due to the limits of Informatica/Appworx licenses. This increasingly became a bottleneck for LinkedIn’s rapid growth.

These disadvantages motivated LinkedIn engineers to develop a new data lake (data lakes let you contain raw data without having to structure it) on Hadoop in parallel.

You can read about how LinkedIn scaled Hadoop Distributed File System to 1 exabyte of data here.

However, they did not have a clear transition process, and that led to them maintaining both the new system and the legacy system simultaneously.

Data was copied between the tech stacks, which resulted in double the maintenance cost and complexity.

illustration-of-maintaining-redundant-data-warehouses

Maintaining redundant systems led to unnecessary complexity

Data Migration

To solve this issue, engineers decided to migrate all datasets to the new analytics stack with Hadoop.

In order to do this, the first step was to derive LinkedIn’s data lineage.

Data lineage is the process of tracking data as it flows from data sources to consumption, including all the transformations the data underwent along the way.

Knowing this would enable engineers to plan the order of dataset migration, identify zero usage datasets (and delete them for workload reduction) and track the usage of the new vs. old system.

You can read exactly how LinkedIn handled the data lineage process in the full article.

After data lineage, engineers used this information to plan major data model revisions.

They planned to consolidate 1424 datasets down to 450, effectively cutting ~70% of the datasets from their migration workload.

They also transformed data sets that were generated from OLTP workloads into a different model that was more suited for business analytics workloads.

The migration was done using various data pipelines and illustrated bottlenecks in LinkedIn’s systems.

One bottleneck was poor read performance of the Avro file format. Engineers migrated to ORC and consequently saw a read speed increase of ~10-1000x, along with a 25-50% improvement in compression ratio.

After the data transfer, depreciating the 1400+ datasets on the legacy system would be tedious and error prone if done manually, so engineers also built an automated system to handle this process.

They built a service to coordinate the deprecation where the service would identify dataset candidates for deletion (datasets with no dependencies and low usage) and then send emails to users of that those datasets with news about the upcoming deprecation.

The service would also notify SREs to lock, archive and delete the dataset from the legacy system after a grace period.

The New System

The design of the new ecosystem was heavily influenced by the old ecosystem, and addressed the major pain points from the legacy tech stack.

Democratization of data - The Hadoop ecosystem enabled data development and adoption by other teams at LinkedIn. Previously, only a central team could build data pipelines on the old system due to license limits with the proprietary platforms.
Democratization of tech development with open source projects - All aspects of the new tech stack can be freely enhanced with open source or custom-built projects.
Unification of tech stack - Simultaneously running 2 tech stacks showed the complexity and cost of maintaining redundant systems. Unifying the technology allowed for a big boost in efficiency.

LinkedIn’s new business analytics tech stack

The new tech stack has the following components

Unified Metrics Pipeline - A unified platform where developers provide ETL scripts to create data pipelines.
Azkaban - A distributed workflow scheduler that manages jobs on Hadoop.
Dataset Readers - Datasets are stored on Hadoop Distributed File System and can be read in a variety of ways.They can be read by DALI, an API developed to allow LinkedIn engineers to read data without worrying about its storage medium, path or format.They can be read by various Dashboards and ad-hoc queries for business analytics.

For more details on LinkedIn’s learnings and their process for the data (and user) migration, read the full article.

Interview Question

You are given an integer array called nums. You are also given an integer called target.

Find 3 integers in nums such that the sum is closest to target.

Return the sum of the 3 integers.

Each input will have exactly one solution

Here is the question in LeetCode.

Previous Question

As a refresher, here’s the previous question

You are given a sorted array nums.

Write a function that removes any duplicates in nums in-place.

You must do this using O(1) space, you cannot allocate extra space for another array.

Here’s the question in LeetCode.

Solution

We can solve this question using the Two Pointer Method.

We have a slow and fast pointer, where our slow pointer will always be less than the fast pointer.

Since we have to remove duplicates only if they appear more than twice, we start slow and fast at index 2.

We iterate through the sorted array nums and check if nums[slow - 2] == nums[fast].

If this is the case, then that means we have a group of 3 or more equal elements and we should rewrite the slow pointer with a future value (thus removing the element from nums). Therefore, we’ll keep the slow pointer at the current value and increment our fast pointer.

If it is not the case, then we can rewrite the value at the slow pointer with the value at the fast pointer. Then, we can increment both the slow and fast pointers.