How eBay uses Fault Injection Testing

Many companies use Fault Injection Testing at the infrastructure level, but eBay tests at the application level. Plus, how PayPal solved their Thundering Herd problem.

Arpan KG
December 23, 2022

Hey Everyone!

Today we’ll be talking about

Fault Injection Testing at eBay
- When you're working with a massive distributed system, small failures can have big, unintuitive, cascading effects.
- Fault Injection Testing is a method where you test how your system reacts to various failures.
- Most practitioners inject faults at the infrastructure level, but Ebay took a new approach where they inject them at the application level.
How PayPal solved their Thundering Herd Problem
- PayPal has a disputes API that gets irregular traffic. They were having an issue were a large burst of requests was causing downstream services to crash.
- The failed jobs were being retried but this was creating a "Thundering Herd" effect that would bring down the downstream service again.
- They solved this by using Exponential Backoff with Jitter.
Tech Snippets
- Exploring the Internals of Github Copilot
- Ways of Integrating Data from Different Backend Services in your Frontend
- Challenging Algorithms and Data Structures to Try
- Avoid Exception Throwing for Control Flow and Performant Code

Happy holidays to you and your family!

How PayPal solved their Thundering Herd Problem

Braintree is a fintech company that makes it easy for companies to process payments from their customers. They provide a payment gateway so companies can process credit and debit card transactions by calling the Braintree API. In 2018, Braintree processed over 6 billion transactions and their customers include Airbnb, GitHub, Dropbox, OpenTable and more.

PayPal acquired Braintree in 2013, so the company comes under the PayPal umbrella.

One of the APIs Braintree provides is the Disputes API, which merchants can use to manage credit card chargebacks (when a customer tries to reverse a credit card transaction due to fraud, poor experience, etc).

The traffic to this API is irregular and difficult to predict, so Braintree uses autoscaling and asynchronous processing where feasible.

One of the issues Braintree engineers dealt with was the thundering herd problem where a huge number of Disputes jobs were getting queued in parallel and bringing down the downstream service.

Anthony Ross is a senior engineering manager at Braintree, and he wrote a great blog post on the cause of the issue and how his team solved it with exponential backoff and by introducing randomness/jitter.

Here’s a summary

Braintree uses Ruby on Rails for their backend and they make heavy use of a component of Rails called ActiveJob. ActiveJob is a framework to create jobs and run them on a variety of queueing backends (you can use popular Ruby job frameworks like Sidekiq, Shoryuken and more as your backend).

This makes picking between queueing backends more of an operational concern, and allows you to switch between backends without having to rewrite your jobs.

Here’s the architecture of the Disputes API service.

Merchants interact via SDKs with the Disputes API. Once submitted, Braintree enqueues a job to AWS Simple Queue Service to be processed.

ActiveJob then manages the jobs in SQS and handles their execution by talking to various Processor services in Braintree’s backend.

The Problem

Braintree setup the Disputes API, ActiveJob and the Processor services to autoscale whenever there was an increase in traffic.

Despite this, engineers were seeing a spike in failures in ActiveJob whenever traffic went up. They have robust retry logic setup so that jobs that fail will be retried a certain number of times before they’re pushed into the dead letter queue (to store messages that failed so engineers can debug them later).

The retry logic had ActiveJob attempt the retries again after a set time interval, but the retries were failing again.

The issue was a classic example of the thundering herd problem. As traffic increased (and ActiveJob hadn’t autoscaled up yet), a large number of jobs would get queued in parallel. They would then hit ActiveJob and trample down the Processor services (resulting in the failures).

Then, these failed jobs would retry on a static interval, where they’d also be combined with new jobs from the increasing traffic, and they would trample the service down again. The original jobs would have to be retried as well as new jobs that failed.

This created a cycle that kept repeating until the retries were exhausted and eventually DLQ’d (placed in the dead letter queue).

To solve this, Braintree used a combination of two tactics: Exponential Backoff and Jitter.

Exponential Backoff

Exponential Backoff is an algorithm where you reduce the rate of requests exponentially by increasing the amount of time delay between the requests.

The equation you use to calculate the time delay looks something like this…

time delay between requests = (base)^(number of requests)

where base is a parameter you choose.

With this, the amount of time between requests increases exponentially as the number of requests increases.

However, exponential backoff alone wasn’t solving Braintree’s problems.

By just using exponential backoff, the retries + new jobs still weren't spread out enough and there were clusters of jobs that all got the same sleep time interval. Once that time interval passed, these failed jobs all flooded back in and trampled over the service again.

To fix this, Braintree added jitter (randomness).

Jitter

Jitter is where you add randomness to the time interval the requests that you’re applying exponential backoff to.

To prevent the requests from flooding back in at the same time, you’ll spread them out based on the randomness factor in addition to the exponential function. By adding jitter, you can space out the spike of jobs to an approximately constant rate between now and the exponential backoff time.

Here’s an example of calls that are spaced out by just using exponential backoff.

The time interval between calls is increasing exponentially, but there are still clusters of calls between 0 ms and 250 ms, near 500 ms, and then again near 900 ms.

In order to smooth these clusters out, you can introduce randomness/jitter to the time interval.

With Jitter, our time delay function looks like

time delay between requests = random_between(0, (base)^(number of requests))

This results in a time delay graph that looks something like below.

Now, the calls are much more evenly spaced out and there’s an approximately constant rate of calls.

For more details, you can read the full article by Braintree here.

Here’s a good article on Exponential Backoff and Jitter from the AWS Builders Library, if you’d like to learn more about that.

How did you like this summary?

Your feedback really helps me improve curation for future emails.

How to Choose the Right Database for Your Workloads

There’s been a huge amount of research in new database paradigms over the last two decades.

In addition to relational databases, you also have Key-Value, Document, Graph, Columnar, Time Series, Multi-modal and more.

These databases make tradeoffs around the on-disk data format, indexing structures, compression, memory type, etc. and this results in a vastly different read/write experience.

Which database should you choose for your use case?

InfluxDB has a great on-demand webinar that delves into the factors you should be considering and the pros/cons of each paradigm.

sponsored

Tech Snippets

Copilot Internals - It's very likely that tools like GitHub Copilot will become an integral part of the developer workflow over the next few years. It's always useful to have a high-level idea of how your tools work, so this is a great article that delves into Copilot. It goes into detail on the Prompt Engineering, communicating with the large language model and telemetry data that Copilot collects. By the way, if you're interested in learning more about Prompt Engineering, here's a cool free course that is currently being developed that dives into it.

Challenging Algorithms Every Programmer Should Try - This is an short list of some interesting algorithms and data structures to look into. An interesting data structure on the list is Bloom Filters, which are similar to hash tables. A Bloom filter uses significantly less space than a hash table and it can tell you with certainty if an item is not in the bloom filter. They're used extensively and are very common in database storage engines. Log Structured Storage Engines like RocksDB use Bloom Filters to quickly tell you if a certain key/value pair doesn't exist in the engine.

Avoid Using Exceptions for Control Flow - You should avoid using exceptions to branch on the type of a value (using a try catch to replace an if statement for example). It's ugly, harder-to-maintain and also far slower. The author goes into more detail in this blog post.

Frontend Integration Options - If you have a complex backend, then you'll have data stored in different locations that gets integrated together and then served to the user. One way of integrating this data is doing it on the frontend by making API calls to the various backends. This is a great article that goes through different ways of doing this and compares their pros/cons. It talks about different types of Single Page Applications and other options.

System Design Articles

If you'd like weekly articles on System Design concepts (in addition to the Quastor summaries), consider upgrading to Quastor Pro!

Past articles cover topics like

Load Balancers - L4 vs. L7 load balancers and load balancing strategies like Round Robin, Least Connections, Hashing, etc.
Database Storage Engines - Different types of Storage Engines and the tradeoffs made. Plus a brief intro to MyISAM, InnoDB, LevelDB and RocksDB.
Log Structured Merge Trees - Log Structured Storage Engines and how the LSM Tree data structure works and the tradeoffs it makes.
Chaos Engineering - The principles behind Chaos Engineering and how to implement. Plus case studies from Facebook, LinkedIn, Amazon, and more.
Backend Caching Strategies - Ways of caching data and their tradeoffs plus methods of doing cache eviction.
Row vs. Column Oriented Databases - How OLTP and OLAP databases store data on disk and different file formats for storing and transferring data.
API Paradigms - Request-Response APIs with REST and GraphQL vs. the Event Driven paradigm with Webhooks and Websockets.

It's $12 a month or $10 per month if you pay yearly. Thanks a ton for your support.

I'd highly recommend using your Learning & Development budget for Quastor Pro!

Fault Injection Testing at Ebay

Ebay is one of the world’s largest e-commerce companies with over 130 million users worldwide. The company facilitates both consumer-to-consumer transactions and business-to-consumer transactions, so you can buy items from another person or from a business.

The eBay Notification Platform team is responsible for building the service that pushes notifications to third party applications. These notifications contain information on item price changes, changes in inventory, payment status and more.

To build this notification service, the team relies on many external dependencies, including a distributed data store, message queues, other eBay API endpoints and more.

Faults in any of these dependencies will directly impact the stability of the Notification Platform, so the eBay team runs continuous experiments where they simulate failures in these dependencies to understand the behaviors of the system and mitigate any weaknesses.

Wei Chen is a software engineer at eBay, and he wrote a great blog post delving into how eBay does this. Many practitioners of fault injection testing do so at the infrastructure level, but the eBay team has taken a novel approach where they inject faults at the application layer.

We’ll first give a brief overview of fault injection testing, and then talk about how eBay runs these experiments.

Introduction to Fault Injection Testing

When you’re running a massive distributed system with thousands of components, it becomes impossible to just reason about your system. There’s far too many things that can go wrong and building a complete mental model around all the hundreds/thousands of different services is close to impossible.

Jacob Gabrielson was a Principal Engineer at Amazon and he wrote about an interesting failure at Amazon that demonstrates this. They had a site-wide failure of www.amazon.com that was caused because of a single server having its hard disk filled up.

It was a catalog server and after its hard drive filled up, it started returning zero-length responses without doing any processing. These responses were sent much faster than the other catalog servers (because this server wasn’t doing any processing), so the load balancer started directing huge amounts of traffic to that failed catalog server. This had a cascading effect which brought the entire website down.

Failures like this are incredibly difficult to predict, so many companies implement Fault Injection Testing, where they deliberately introduce faults (with dependencies, infrastructure, etc.) into the system and observe how it reacts.

If you’re a Quastor Pro subscriber, you can read a full write up we did on Chaos Engineering here. You can upgrade here (I’d highly recommend using your job’s learning & development budget).

Implementing Fault Injection Testing

There’s a variety of ways to inject faults into your services. There are tools like AWS Fault Injection Simulator, Gremlin, Chaos Monkey and many more. Here’s a list of tools that you can use to implement fault injection testing.

These tools work on the infrastructure level, where they’ll do things like

Terminate VM instances
Fill up the hard drive space by creating a bunch of files
Terminating network connections

And more.

However, the eBay team didn’t want to go this route. They were wary that creating infrastructure level faults could cause other issues. Terminating the network connection for a server will cause issues if the server is being used for other services. Additionally, running these infrastructure layer tests can be expensive.

Instead, the eBay team decided to go with application level fault injection testing, where they would implement the fault testing directly in their codebase.

If they wanted to simulate a network fault, then they would add latency in the http client library to simulate timeouts. If they wanted to simulate a dependency returning an error, then they’d inject a fault directly into the method where it would change the value of the response to a 500 status code.

The Notification Platform is Java based, so eBay implemented the fault injection using a Java agent. These are classes that you create that can modify the bytecode of other Java apps when they’re being loaded into the JVM. Java agets are used very frequently for tasks like instrumentation, profiling, monitoring, etc.

The team used three patterns in the platform’s code to implement fault injection testing.

Blocking Method Logic

In order to simulate a failure, they would add code to throw an exception or sleep for a certain period of time in the method.

221101 Fault Injection tech blog v1 inc 1600x image 2

Changing the State of Method Parameters

Methods would take in HTTP status codes as parameters and check to see if the status code was 200 (success). If not, then failure logic would be triggered.

The eBay team simulated failures by changing the Response object that would be passed to the parameter to signal an error.

221101 Fault Injection tech blog v1 inc 1600x image 3

Replacing Method Parameters

They also dynamically change the value of input parameters in the method to trigger error conditions.

221101 Fault Injection tech blog v1 inc 1600x image 4

Configuration Management

To dynamically change the configuration for the fault injections, they implemented a configuration management console in the Java agent.

This gave developers a configuration page that they could use to change the attributes of the fault injection and turn it on/off.

They're currently working on expanding the scope of the application level fault injection to more client libraries and also incorporating more fault circumstances.

For more details on exactly how the eBay team implemented this, you can read the full article here.

Previous Question

As a reminder, here’s our last question

Given an integer array nums that may contain duplicates, return all possible subsets.

The solution set must not contain duplicate subsets. Return the solution in any order.

Here’s the question in LeetCode.

Solution

This question is asking us to return the power set given a set of integers (nums).

The power set is defined as the set of all subsets (including the empty set and the set itself).

In the worst case (where nums is not empty and consists of all unique integers), there will be 2^N subsets in our power set (where N is the number of integers in nums).

Therefore, our worst case time complexity will be at least O(2^N).

Finding the power set will require a backtracking approach.

With backtracking, you incrementally build up your solution until you reach some block (no more solutions are possible). Then, you recurse back up the stack and incrementally build solutions in a different direction.

Let’s apply backtracking to the given question. Let’s say nums is [1,2,3].

We’ll use the array powerset to be an array of arrays to keep track of all the subsets. After we finish backtracking, our function can just return powerset.

We’ll use curSol to be our current solution. We’ll start with curSol being equal to an empty array.

The empty set is one valid subset of nums, so we’ll add curSol to powerset.

Now, we’ll add 1 to curSol, making it [1].

This is a valid subset, so we add it to powerset.

We’ll continue down this path and add 2 to curSol, making it [1,2].

Another valid subset, so we add it to powerset.

We continue down and add 3 to curSol, making it [1,2,3].

Again, a valid subset, so we add it to powerset.

Now, we’ve reached a block. There are no more integers in nums to add to curSol.

Therefore, we’ll back out of the current solution by popping 3 off curSol, making curSol equal to [1,2].

However, since we’ve already gone down the 3 path, there still aren’t any numbers in nums that we can add to curSol.

Therefore, we’ll back out of the current solution by popping 2 off, making curSol equal to [1].

Now, we have the 3 path to explore. Therefore, we’ll add 3 to curSol, making it [1,3].

This is a new valid subset, so we’ll add it to our powerset.

At this point, we’ve reached another block. There are no new integers in nums that we can add to curSol.

Therefore, we’ll pop off 3 and go back to where curSol is [1].

We’re still at a block, since there are no new integers in nums.

So, we’ll pop off 1, and go back to the empty set.

Now, we’ll add 2 to curSol, making curSol equal to [2]. Again, a valid subset, so we add it to powerset. Then, we continue down the [2,] solution.

Hopefully, you’re getting the gist of how the backtracking approach works now.

Here’s the Python 3 code.