How Robinhood Reduced Load Related Incidents by 75%

October 11, 2022

Hey Everyone!

Today we’ll be talking about

How Robinhood Load Tests their Backend
- The Architecture of a Load Testing Service Robinhood built that reduced their load-related backend incidents by 75%.
- Building a Request Capture System that captures production traffic, cleans it and stores it in S3
- Building a Request Replay system that replays the captured production traffic to load test backend services
Tech Snippets
- Free Software and Web Apps that you can host yourself
- Reducing our AWS S3 costs by 90% with Apache Iceberg
- Expectations of Professional Software Engineers
- How Grab cut their static JavaScript assets by two-thirds.

Plus, we have a solution to our last Google interview question and a new question from Microsoft.

Learn Best Practices for Security

Want to put your security skills to the test?

Compete in Fetch the Flag CTF and solve 16 challenges, including pwn, crypto, and web.

Play individually or bring your team to come out on top and win prizes.

If you're new to CTF, you can join the CTF 101 Workshop to learn best practices to get you ready to compete.

How Robinhood Load Tests their Backend

Robinhood is a mobile app that allows you to purchase stocks, ETFs, cryptocurrencies, options and other financial products. As of March 2022, the app has more than 16 million monthly active users and over 23 million accounts.

The company has scaled up very quickly and has had multiple days where it’s app was the #1 most downloaded app in the iOS/Android App Store (it’s extremely rare for a financial brokerage to achieve that kind of growth).

This hyper growth has led to outages and scaling issues, which has caused some controversy. As you might imagine, it can be very frustrating for users if they can't place trades during times of market turmoil due to a Robinhood outage.

To improve their confidence in their system’s ability to scale, the Load and Fault team at Robinhood created a system that regularly runs load tests on services in Robinhood’s backend.

After implementing load testing, Robinhood saw a 75% drop in load related incidents. They wrote a great blog post discussing the architecture of the service and how they’re using it.

Here’s a Summary

Robinhood wanted to build confidence that their services could handle an increase in the number of queries per second without sacrificing latency.

To accomplish this, engineers built a load testing framework with the following principles:

High Signal - The load test system needed to provide high signal data on scalability to service owners, so they decided to run in production whenever possible. They would also replay real production traffic instead of sending simulated traffic.
Safety - Robinhood is a financial brokerage, so running load tests in production must be done very carefully. Testing should have minimal production/customer impact.
Automate - Running load tests should be automated to run regularly (with the goal of running a load test per deployment) and should not require any involvement from the Load and Fault team.

Load Testing Architecture

The load testing framework was made up of two major systems

Request Capture System - Capture real customer traffic hitting their backend services
Request Replay System - Replay the captured production traffic on backend services and measure how they handle the load

We’ll go through the architecture of both systems.

Request Capture System

The Robinhood team wanted to build an easy way to capture production traffic so that it could be replayed to test load on various backend services.

Robinhood use nginx as a load balancer, so the Load and Fault team used that to record customer traffic by adding nginx logging rules.

If you're unfamiliar with Load Balancers, we previously did a tech dive on them that you can check out here. (Here's a 25% off discount for Quastor Pro if you're not a member).

Here’s Robinhood's workflow

Nginx samples a percentage of traffic and logs the user UUID, URI and timestamp
Filebeat monitors nginx logs and pushes new log lines to Apache Kafka
Logstash monitors Kafka and pushes the logs to AWS S3. Filebeat and Logstash are part of the Elastic stack.
A data pipeline takes the raw data and filters for GET requests. It also appends an authentication token to each request that specifies a read-only scope. Filtering for GET requests and adding the auth token helps prevent any future mishaps where customer data is accidentally modified during a load test.
After getting cleaned and modified, the data is stored back in AWS S3

Request Replay

The Request Replay system is responsible for load testing backend services using the GET requests that were stored in S3.

The system consists of two components: a pool of load generating pods (a Kubernetes deployment) and an event loop that manages these pods (a Kubernetes job).

The event loop starts and controls the load test by managing the load generating pool. While the test is running, the event loop is also monitoring the target service’s health. If the loop detects that the target service is reporting as unhealthy, then it immediately stops the load test and removes the load generating pool.

The pods in the pool run k6, which is an open source tool for running load tests. They stream requests from the S3 bucket that was populated by the Request Capture system.

Safety Mechanisms

These load tests are being run in production, so safety is incredibly important. It’s vital that customer experience does not take any hits from this testing.

To ensure this, Robinhood takes several measures

Read Only Traffic - The request capture system only stores GET requests, so the replay system will only run read-only traffic and never send traffic that affects customer data.
Clear Safety Levers - In case a load test needs to be stopped, there are multiple clear safety levers that can be pulled. There are UI test controls and slack notifications with deep links to stop a certain load test. There’s also a button to stop the entire load testing system if there’s any emergency.

Load Test Wins

With this system, Robinhood has been able to detect performance regressions and identify bottlenecks in their backend. Whenever a new service is rolled out, they’re able to test it to ensure that it can withstand Robinhood scale and future growth.

For next steps, the team plans on adding a way to test mutating requests (POST) and also expand their tests to include gRPC communication.

For more details, you can read the full blog post here.

How did you like this summary?

Your feedback really helps me improve curation for future emails.

Learn How Teams Use Feature Flags to Ship Products More Confidently

LaunchDarkly surveyed over 1,000 software teams to get the key statistics shaping how CTOs view the value of feature management.

They published the results in an extensive report on the state of Feature Management in 2022.

The report covers

how IT departments use feature flags in their codebase
how much teams spend on feature management tools
the Mean Time To Resolution for issues for those with/without feature management

and much more!

sponsored

Tech Snippets

Expectations of Professional Software Engineers - Mike Ackton is a veteran of the games industry and he gave a great talk in 2019 on things he expects of developers he works with. Some examples are "I can articulate precisely what problem I'm trying to solve", "I can articulate how much my problem is worth solving", "I can articulate the most concrete use case of what I'm developing" and more. This is a good checklist to think about when you're working on a project.

How We Cut GrabFood.com's Page JavaScript Asset Sizes by 3x - Grab Food serves over 175 million requests every week with over 1 terabyte of network egress. Any reduction in their JavaScript bundle size helps with faster site loads, cost savings for Grab, faster build times and more. Grab Engineering was able to do these with several optimizations and cut their static JavaScript assets from 750k to 250 kb.This blog post delves into how they cut their static asset size by two-thirds.

Free software and web apps that you can host yourself - This is a fantastic GitHub repo with a ton of free web apps and software that you can host on your own servers. There's things like blogging platforms, document management tools, feed readers and more. The repo also links to the source code for these tools, so it's a great resource if you want to see how various web apps are built.

Reducing our Amazon S3 costs by 90% with Apache Iceberg - This is a great blog post with an intro to table formats and their design. They talk about how Iceberg is designed to work well with modern cloud infrastructure like S3. Here's another blog post on the actual process of migrating to Iceberg on S3. (sponsored)

Interview Question

Given the head of a linked list, rotate the list to the right by k places.

Here’s the question in LeetCode.

Previous Question

As a reminder, here’s our last question

You are given two sorted arrays that can have different sizes.

Return the median of the two sorted arrays.

Here’s the question in Leetcode.

Solution

The brute force way of solving this is to combine both arrays using the merge algorithm from Merge Sort.

After combining both sorted arrays, we can find the median of the combined array.

This algorithm takes linear time.

Here’s the Python 3 code.

However, can you find the median of the two sorted arrays without having to combine them?

It turns out you can using binary search.

The solution is quite complicated and it’s best understood visually, so I’d highly recommend you watch this video for the explanation (it’s the best explanation I found). It also walks you through the Python 3 code.

A TLDW is that you can run binary search on the smaller array to select which element you’re going to create your partition from.

Since the median element has to have the same number of elements on the right and left, you can derive the longer array’s partition from the smaller array’s partition.

Then, you check if those two partitions give you the median of the combined array.

If it doesn’t, then you can adjust the partition in the shorter array up or down and reevaluate.