How PayPal solved their Thundering Herd Problem

Using Exponential Backoff and Jitter when retrying failed jobs to prevent downstream services from getting trampled.

Hey Everyone!

Today we’ll be talking about

  • How PayPal solved their Thundering Herd Problem

    • Using Ruby on Rails and ActiveJob for managing jobs

    • How retrying failed jobs was creating a "Thundering Herd"

    • Using Exponential Backoff with Jitter to solve the problem

  • Tech Snippets

    • Practical Deep Learning for Coders 2022 (Jeremy Howard recently published the 2022 lectures of his amazing FastAI course)

    • Anatomy of a Terminal Emulator

    • What I learned from Software Engineering at Google

    • Dissecting Startup Failure Rates by Stage


How PayPal solved their Thundering Herd Problem

Braintree is a fintech company that makes it easy for companies to process payments from their customers. They provide a payment gateway so companies can process credit and debit card transactions by calling the Braintree API. In 2018, Braintree processed over 6 billion transactions and their customers include Airbnb, GitHub, Dropbox, OpenTable and more.

PayPal acquired Braintree in 2013, so the company comes under the PayPal umbrella.

One of the APIs Braintree provides is the Disputes API, which merchants can use to manage credit card chargebacks (when a customer tries to reverse a credit card transaction due to fraud, poor experience, etc).

The traffic to this API is highly irregular and difficult to predict, so Braintree uses autoscaling and asynchronous processing where feasible.

One of the issues Braintree engineers dealt with was the thundering herd problem where a huge number of Disputes jobs were getting queued in parallel and bringing down the downstream service.

Anthony Ross is a senior engineering manager at Braintree, and he wrote a great blog post on the cause of the issue and how his team solved it with exponential backoff and by introducing randomness/jitter.

Here’s a summary

Braintree uses Ruby on Rails for their backend and they make heavy use of a component of Rails called ActiveJob. ActiveJob is a framework to create jobs and run them on a variety of queueing backends (you can use popular Ruby job frameworks like Sidekiq, Shoryuken and more as your backend).

This makes picking between queueing backends more of an operational concern, and allows you to switch between backends without having to rewrite your jobs.

Here’s the architecture of the Disputes API service.

Merchants interact via SDKs with the Disputes API. Once submitted, Braintree enqueues a job to AWS Simple Queue Service to be processed.

ActiveJob then manages the jobs in SQS and handles their execution by talking to various Processor services in Braintree’s backend.

The Problem

Braintree setup the Disputes API, ActiveJob and the Processor services to autoscale whenever there was an increase in traffic.

Despite this, engineers were seeing a spike in failures in ActiveJob whenever traffic went up. They have robust retry logic setup so that jobs that fail will be retried a certain number of times before they’re pushed into the dead letter queue (to store messages that failed so engineers can debug them later).

The retry logic had ActiveJob attempt the retries again after a set time interval, but the retries were failing again.

The issue was a classic example of the thundering herd problem. As traffic increased (and ActiveJob hadn’t autoscaled up yet), a large number of jobs would get queued in parallel. They would then hit ActiveJob and trample down the Processor services (resulting in the failures).

Then, these failed jobs would retry on a static interval, where they’d also be combined with new jobs from the increasing traffic, and they would trample the service down again. The original jobs would have to be retried as well as new jobs that failed.

This created a cycle that kept repeating until the retries were exhausted and eventually DLQ’d (placed in the dead letter queue). 

To solve  this, Braintree used a combination of two tactics: Exponential Backoff and Jitter.

Exponential Backoff

Exponential Backoff is an algorithm where you reduce the rate of requests exponentially by increasing the amount of time delay between the requests.

The equation you use to calculate the time delay looks something like this…

time delay between requests = (base)^(number of requests)

where base is a parameter you choose.

With this, the amount of time between requests increases exponentially as the number of requests increases.

However, exponential backoff alone wasn’t solving Braintree’s problems.

By just using exponential backoff, the retries + new jobs still weren't spread out enough and there were clusters of jobs that all got the same sleep time interval. Once that time interval passed, these failed jobs all flooded back in and trampled over the service again.

To fix this, Braintree added jitter (randomness).

Jitter

Jitter is where you add randomness to the time interval the requests that you’re applying exponential backoff to.

To prevent the requests from flooding back in at the same time, you’ll spread them out based on the randomness factor in addition to the exponential function. By adding jitter, you can space out the spike of jobs to an approximately constant rate between now and the exponential backoff time.

Here’s an example of calls that are spaced out by just using exponential backoff.

The time interval between calls is increasing exponentially, but there are still clusters of calls between 0 ms and 250 ms, near 500 ms, and then again near 900 ms.

In order to smooth these clusters out, you can introduce randomness/jitter to the time interval.

With Jitter, our time delay function looks like

time delay between requests = random_between(0, (base)^(number of requests))

This results in a time delay graph that looks something like below.

Now, the calls are much more evenly spaced out and there’s an approximately constant rate of calls.

For more details, you can read the full article by Braintree here.

Here’s a good article on Exponential Backoff and Jitter from the AWS Builders Library, if you’d like to learn more about that.

How did you like this summary?

Your feedback really helps me improve curation for future emails.

Login or Subscribe to participate in polls.

Tech Snippets

  • FastAI's Practical Deep Learning for Coders 2022 - The Fast AI course is an amazing (and free) practical introduction to machine learning (with a focus on deep learning). Jeremy Howard just released the 2022 lectures for the FastAI course, where he gives lectures on Natural Language Processing (with an emphasis on Transformers), Random Forests, Image Processing (with CNNs), Collaborative Filtering and more. He also talks about some of the recent innovations in the field like DALL-E 2, GPT-3, Google Pathways and more. You don't need prior ML/Math knowledge for the course.

  • The Anatomy of a Terminal Emulator - This is a great blog post that goes through all the different components of a terminal and how they interact. It goes through interacting with the shell, drawing the UI and more. Code examples are written in Rust, but they're comprehensible to non-Rust devs and also have explanations.

  • What I learned from Software Engineering at Google - Software Engineering at Google is an awesome book on the engineering culture, processes and tools at Google. Swizec wrote a great blog post where he goes through his takeaways from the book and how he's applying the lessons for software engineering that isn't at Google-scale.

  • Dissecting Startup Failure Rates by Stage - This isn't about software, but it seems like a lot of you are interested in working at startups (where a significant amount of your salary will be in stock options). This post is an awesome look at what percentage of startups fail to exit/raise another round. It's good to be aware of these probabilities when considering a startup job offer.

Interview Question

You are given a non-negative number n.

Count the number of primes less than n and return the count.

Previous Question

Here's a correction to our last solution. Apologies again for the error!

You are given a math equation that consists of positive integers and +, -, * and / operators (addition, subtraction, multiplication and division).

Write a function that computes the result.

Be sure to follow order of operations (you will not be given parentheses).

Input - “2 * 3 + 5 / 6 * 3 + 15”

Output - 23.5

Solution

First of all, it’s important to remember order of operations.

  1. Solve any multiplication/division operations

  2. Solve any addition/subtraction operations

The underlying data structure necessary to solve this question is a stack.

As we iterate through the input string, we'll use a variable called currentNumber to keep track of the current number that we're iterating through.

As we get new digits in the current number, we'll multiply the current number by 10 (to move everything up one place)  and then add the new digit in the ones place.

When we come across a sign (+, -, * or /) or reach the end of the input string, then we'll perform an operation on the current number based on the previous sign we saw. If there was no previous sign (this is the first one in the equation), then we'll assume the previous sign was a +.

If the previous sign was for subtraction, then we'll append the current number multiplied by negative 1 to the stack.

If the previous sign was for addition, then we'll append the current number to the stack.

If the previous sign was for multiplication or division, then we'll pop off the last number on the stack and perform the operation with our current number. Then, we'll add the result back onto the stack.

After, we'll change the previous sign to whatever the current sign is and we'll reset the current number to 0.

When we're done iterating through the input string, we can sum all the numbers in the stack and that'll be our final answer.