Scaling Microservices at DoorDash
An in-depth guide on common microservice failures. Plus, how to build deep relationships in Tech, an Intro to Fuzz Testing and more!
Hey Everyone!
Today we'll be talking about
Building Robust Microservices at DoorDash
Common Architectural Patterns
Pros/Cons of Microservices
DoorDash switched from a Python 2 monolith to a microservices architecture.
This brings many new potential failures. Some common ones DoorDash experienced were cascading failures, retry storms, death spirals and metastable failures.
Three techniques to reduce these failures are predictive auto-scaling, load shedding and circuit breaking.
Tech Snippets
How Facebook Created the Largest Memcached System in the World
Introduction to Fuzz Testing
A Byte of Vim
An Animated Guide to Load Balancing Algorithms
Guide to Building Reliable Microservices
When designing your architecture, there’s several common patterns you’ll see discussed.
Monolith - All the components and functionalities are packed into a single codebase and operate under a single application. This can lead to organizational challenges as your engineering team grows to thousands of developers.
Modular Monolith - You still have a single codebase with all functionality packed into a single application. However, the codebase is organized into modular components where each is responsible for a distinct functionality.
Services-Oriented Architecture - Decompose the backend into large services where each service is it’s own, separate application. These services communicate over a network. You might have an Authentication service, a Payment Management service, etc.
Microservices Architecture - break down the application into small, independent services where each is responsible for a specific business functionality and operates as its own application. Microservices is a type of services-oriented architecture.
Distributed Monolith - This is an anti-pattern that can arise if you don’t do microservices/SOA properly. This is where you have a services-oriented architecture with services that appear autonomous but are actually closely intertwined. Don’t do this.
With the rise of cloud computing platforms, the Microservices pattern has exploded in popularity.
Here’s some of the Pros/Cons of using Microservices
Pros
Some of the pros are
Organizational Scaling - The main benefit from Microservices is that it’s easier to structure your organization. Companies like Netflix, Amazon, Google, etc. have thousands of engineers (or tens of thousands). Having them all work on a single monolith is very difficult to coordinate.
Polyglot - If you’re at a large tech company, you might want certain services to be built in Java, others in Python, some in C++, etc. Having a microservices architecture (where different services talk through a common interface) makes this easier to implement.
Independent Scaling - You can scale a certain microservice independently (add/remove machines) depending on how much load is coming on that service.
Faster Deployments - Services can be deployed independently. You don’t have to worry about an unrelated team having an issue that’s preventing you from deploying.
Downsides
Some of the downsides are
Complexity - Using Microservices introduces a ton of new failure modes and makes debugging significantly harder. We’ll be talking about some of these failures in the next section (as well as how DoorDash handles them).
Inefficiency - Adding network calls between all your services will introduce latency, dropped packets, timeouts, etc.
Overhead - You’ll need to add more components to your architecture to facilitate service-to-service communication. Technologies like a service mesh (we discussed this in our Netflix article), load balancers, distributed tracing (to make debugging less of a pain) and more.
If you’d like to read about a transition from microservices back to a monolith then Amazon Prime wrote a great blog post about their experience.
Scaling Microservices at DoorDash
Now, we’ll talk about how DoorDash handled some of the complexity/failures that a Microservices architecture brings.
DoorDash is the largest food delivery marketplace in the US with close to 40 million users. You can use their mobile app or website to order items from restaurants, convenience stores, supermarkets and more.
In 2020, they migrated from a Python 2 monolith to a microservices architecture. DoorDash engineers wrote a great blog post going through the most common microservice failures they’ve experienced and how they dealt with them.
The failures they wrote about were
Cascading Failures
Retry Storms
Death Spirals
Metastable Failures
We’ll describe each of these failures, talk about how they were handled at a local level and then describe how DoorDash is attempting to mitigate them at a global level.
Cascading Failures
Cascading failures describes a general issue where the failure of a single service can lead to a chain reaction of failures in other services.
DoorDash talked about an example of this in May of 2022, where some routine database maintenance temporarily increased read/write latency for the service. This caused higher latency in upstream services which created errors from timeouts.
The increase in error rate then triggered a misconfigured circuit breaker (a circuit breaker reduces the number of requests that’s sent to a degraded service) which resulted in an outage in the app that lasted for 3 hours.
When you have a distributed system of interconnected services, failures can easily spread across your system and you’ll have to put checks in place to manage them (discussed below).
Retry Storms
One of the ways a failure can spread across your system is through retry storms.
Making calls from one backend service to another is unreliable and can often fail due to completely random reasons. A garbage collection pause can cause increased latencies, network issues can result in timeouts and more.
Therefore, retrying a request can be an effective strategy for temporary failures.
However, retries can also worsen the problem while the downstream service is unavailable/slow. The retries result in work amplification (a failed request will be retried multiple times) and can cause an already degraded service to degrade further.
Death Spiral
With cascading failures, we were mainly talking about issues spreading vertically. If there is a problem with service A, then that impacts the health of service B (if B depends on A). Failures can also spread horizontally, where issues in some nodes of service A will impact (and degrade) the other nodes within service A.
An example of this is a death spiral.
You might have service A that’s running on 3 machines. One of the machines goes down due to a network issue so the incoming requests get routed to the other 2 machines. This causes significantly higher CPU/memory utilization, so one of the remaining two machines crashes due to a resource saturation failure. All the requests are then routed to the last standing machine, resulting in significantly higher latencies.
Metastable Failure
Many of the failures experienced at DoorDash are metastable failures. This is where there is some positive feedback loop within the system that is causing higher than expected load in the system (causing failures) even after the initial trigger is gone.
For example, the initial trigger might be a surge in users. This causes one of the backend services to load shed and respond to certain calls with a 429 (rate limit).
Those callers will retry their calls after a set interval, but the retries (plus requests from new users) overwhelm the backend service again and cause even more load shedding. This creates a positive feedback loop where calls are retried (along with new calls), get rate limited, retry again, and so on.
This is called the Thundering Herd problem and is one example of a Metastable failure. The initial spike in users can cause issues in the backend system far after the surge has ended.
Countermeasures
DoorDash has a couple techniques they use to deal with these issues. These are
Load Shedding - a degraded service will drop requests that are “unimportant” (engineers configure which requests are considered important/unimportant)
Circuit Breaking - if service A is sending service B requests and service A notices a spike in B’s latencies, then circuit breakers will kick in and reduce the number of calls service A makes to service B
Auto Scaling - adding more machines to the server pool for a service when it’s degraded. However, DoorDash avoids doing this reactively (discussed further below).
All these techniques are implemented locally; they do not have a global view of the system. A service will just look at its dependencies when deciding to circuit break, or will solely look at its own CPU utilization when deciding to load shed.
Local Countermeasures
Load Shedding
With many backend services, you can rank incoming requests by how important they are. A request related to logging might be less important than a request related to a user action.
With Load Shedding, you temporarily reject some of the less important traffic to maximize the goodput (traffic value + throughput) during periods of stress (when CPU/memory utilization is high).
At DoorDash, they instrumented each server with an adaptive concurrency limit from the Netflix library concurrency-limit. This integrates with gRPC and automatically adjusts the maximum number of concurrent requests according to changes in the response latency. When a machine takes longer to respond, the library will reduce the concurrency limit to give each request more compute resources. It can be configured to recognize the priorities of requests from their headers.
Cons of Load Shedding
An issue with load shedding is that it’s very difficult to configure and properly test. Having a misconfigured load shedder will cause unnecessary latency in your system and can be a source of outages.
Services will require different configuration parameters depending on their workload, CPU/memory resources, time of day, etc. Auto-scaling services might mean you need to change the latency/utilization level at which you start to load shed.
Circuit Breaker
While load shedding rejects incoming traffic, circuit breakers will reject outgoing traffic from a service.
They’re implemented as a proxy inside the service and monitor the error rate from downstream services. If the error rate surpasses a configured threshold, then the circuit breaker will start rejecting all outbound requests to the troubled downstream service.
DoorDash built their circuit breakers into their internal gRPC clients.
Cons of Circuit Breaking
The cons are similar to Load Shedding. It’s extremely difficult to determine the error rate threshold at which the circuit breaker should switch on. Many online sources use a 50% error rate as a rule of thumb, but this depends entirely on the downstream service, availability requirements, etc.
Auto-Scaling
When a service is experiencing high resource utilization, an obvious solution is to add more machines to that service’s server pool.
However, DoorDash recommends that teams do not use reactive-auto-scaling. Doing so can temporarily reduce cluster capacity, making the problem worse.
Newly added machines will need time to warm up (fill cache, compile code, etc.) and they’ll run costly startup tasks like opening database connections or triggering membership protocols.
These behaviors can reduce resources for the warmed up nodes that are serving requests. Additionally, these behaviors are infrequent, so having a sudden increase can produce unexpected results.
Instead, DoorDash recommends predictive auto-scaling, where you expand the cluster’s size based on expected traffic levels throughout the day.