Scaling Microservices at DoorDash

An in-depth guide on common microservice failures. Plus, how to build deep relationships in Tech, an Intro to Fuzz Testing and more!

September 26, 2023

Hey Everyone!

Today we'll be talking about

Building Robust Microservices at DoorDash
- Common Architectural Patterns
- Pros/Cons of Microservices
- DoorDash switched from a Python 2 monolith to a microservices architecture.
- This brings many new potential failures. Some common ones DoorDash experienced were cascading failures, retry storms, death spirals and metastable failures.
- Three techniques to reduce these failures are predictive auto-scaling, load shedding and circuit breaking.
- DoorDash also talked about their use of Aperture, an open source reliability management system.
How to Build Deep Relationships in Tech
- How to Reach Out to People You Don’t Know
- Understand their Incentives and Offer Something
- Don’t be Afraid of Reaching Out to Reconnect
- Getting Burned is Part of the Process
Tech Snippets
- How Facebook Created the Largest Memcached System in the World
- Introduction to Fuzz Testing
- A Byte of Vim
- Building a CDN from Scratch
- An Animated Guide to Load Balancing Algorithms

Guide to Building Reliable Microservices

When designing your architecture, there’s several common patterns you’ll see discussed.

Monolith - All the components and functionalities are packed into a single codebase and operate under a single application. This can lead to organizational challenges as your engineering team grows to thousands (tens of thousands?) of developers.
Modular Monolith - You still have a single codebase with all functionality packed into a single application. However, the codebase is organized into modular components where each is responsible for a distinct functionality.
Services-Oriented Architecture - Decompose the backend into large services where each service is it’s own, separate application. These services communicate over a network. You might have an Authentication service, a Payment Management service, etc.
Microservices Architecture - break down the application into small, independent services where each is responsible for a specific business functionality and operates as its own application. Microservices is a type of services-oriented architecture.
Distributed Monolith - This is an anti-pattern that can arise if you don’t do microservices/SOA properly. This is where you have a services-oriented architecture with services that appear autonomous but are actually closely intertwined. Don’t do this.

With the rise of cloud computing platforms, the Microservices pattern has exploded in popularity.

From Google Trends

Here’s some of the Pros/Cons of using Microservices

Pros

Some of the pros are

Organizational Scaling - The main benefit from Microservices is that it’s easier to structure your organization. Companies like Netflix, Amazon, Google, etc. have thousands of engineers (or tens of thousands). Having them all work on a single monolith is very difficult to coordinate.
Polyglot - If you’re at a large tech company, you might want certain services to be built in Java, others in Python, some in C++, etc. Having a microservices architecture (where different services talk through a common interface) makes this easier to implement.
Independent Scaling - You can scale a certain microservice independently (add/remove machines) depending on how much load is coming on that service.
Faster Deployments - Services can be deployed independently. You don’t have to worry about an unrelated team having an issue that’s preventing you from deploying.

Downsides

Some of the downsides are

Complexity - Using Microservices introduces a ton of new failure modes and makes debugging significantly harder. We’ll be talking about some of these failures in the next section (as well as how DoorDash handles them).
Inefficiency - Adding network calls between all your services will introduce latency, dropped packets, timeouts, etc.
Overhead - You’ll need to add more components to your architecture to facilitate service-to-service communication. Technologies like a service mesh (we discussed this in our Netflix article), load balancers, distributed tracing (to make debugging less of a pain) and more.

If you’d like to read about a transition from microservices back to a monolith then Amazon Prime wrote a great blog post about their experience.

Scaling Microservices at DoorDash

Now, we’ll talk about how DoorDash handled some of the complexity/failures that a Microservices architecture brings.

DoorDash is the largest food delivery marketplace in the US with over 30 million users in 2022. You can use their mobile app or website to order items from restaurants, convenience stores, supermarkets and more.

In 2020, they migrated from a Python 2 monolith to a microservices architecture. DoorDash engineers wrote a great blog post going through the most common microservice failures they’ve experienced and how they dealt with them.

The failures they wrote about were

Cascading Failures
Retry Storms
Death Spirals
Metastable Failures

We’ll describe each of these failures, talk about how they were handled at a local level and then describe how DoorDash is attempting to mitigate them at a global level.

Cascading Failures

Cascading failures describes a general issue where the failure of a single service can lead to a chain reaction of failures in other services.

DoorDash talked about an example of this in May of 2022, where some routine database maintenance temporarily increased read/write latency for the service. This caused higher latency in upstream services which created errors from timeouts.

The increase in error rate then triggered a misconfigured circuit breaker (a circuit breaker reduces the number of requests that’s sent to a degraded service) which resulted in an outage in the app that lasted for 3 hours.

When you have a distributed system of interconnected services, failures can easily spread across your system and you’ll have to put checks in place to manage them (discussed below).

Retry Storms

One of the ways a failure can spread across your system is through retry storms.

Making calls from one backend service to another is unreliable and can often fail due to completely random reasons. A garbage collection pause can cause increased latencies, network issues can result in timeouts and more.

Therefore, retrying a request can be an effective strategy for temporary failures.

However, retries can also worsen the problem while the downstream service is unavailable/slow. The retries result in work amplification (a failed request will be retried multiple times) and can cause an already degraded service to degrade further.

Death Spiral

With cascading failures, we were mainly talking about issues spreading vertically. If there is a problem with service A, then that impacts the health of service B (if B depends on A). Failures can also spread horizontally, where issues in some nodes of service A will impact (and degrade) the other nodes within service A.

An example of this is a death spiral.

You might have service A that’s running on 3 machines. One of the machines goes down due to a network issue so the incoming requests get routed to the other 2 machines. This causes significantly higher CPU/memory utilization, so one of the remaining two machines crashes due to a resource saturation failure. All the requests are then routed to the last standing machine, resulting in significantly higher latencies.

Metastable Failure

Many of the failures experienced at DoorDash are metastable failures. This is where there is some positive feedback loop within the system that is causing higher than expected load in the system (causing failures) even after the initial trigger is gone.

For example, the initial trigger might be a surge in users. This causes one of the backend services to load shed and respond to certain calls with a 429 (rate limit).

Those callers will retry their calls after a set interval, but the retries (plus requests from new users) overwhelm the backend service again and cause even more load shedding. This creates a positive feedback loop where calls are retried (along with new calls), get rate limited, retry again, and so on.

This is called the Thundering Herd problem and is one example of a Metastable failure. The initial spike in users can cause issues in the backend system far after the surge has ended.

Countermeasures

DoorDash has a couple techniques they use to deal with these issues. These are

Load Shedding - a degraded service will drop requests that are “unimportant” (engineers configure which requests are considered important/unimportant)
Circuit Breaking - if service A is sending service B requests and service A notices a spike in B’s latencies, then circuit breakers will kick in and reduce the number of calls service A makes to service B
Auto Scaling - adding more machines to the server pool for a service when it’s degraded. However, DoorDash avoids doing this reactively (discussed further below).

All these techniques are implemented locally; they do not have a global view of the system. A service will just look at its dependencies when deciding to circuit break, or will solely look at its own CPU utilization when deciding to load shed.

To solve this, DoorDash has been testing out an open source reliability management system called Aperture to act as a centralized load management system that coordinates across all the services in the backend to respond to ongoing outages.

We’ll talk about the techniques DoorDash uses and also about how they use Aperture.

Local Countermeasures

Load Shedding

With many backend services, you can rank incoming requests by how important they are. A request related to logging might be less important than a request related to a user action.

With Load Shedding, you temporarily reject some of the less important traffic to maximize the goodput (traffic value + throughput) during periods of stress (when CPU/memory utilization is high).

At DoorDash, they instrumented each server with an adaptive concurrency limit from the Netflix library concurrency-limit. This integrates with gRPC and automatically adjusts the maximum number of concurrent requests according to changes in the response latency. When a machine takes longer to respond, the library will reduce the concurrency limit to give each request more compute resources. It can be configured to recognize the priorities of requests from their headers.

Cons of Load Shedding

An issue with load shedding is that it’s very difficult to configure and properly test. Having a misconfigured load shedder will cause unnecessary latency in your system and can be a source of outages.

Services will require different configuration parameters depending on their workload, CPU/memory resources, time of day, etc. Auto-scaling services might mean you need to change the latency/utilization level at which you start to load shed.

Circuit Breaker

While load shedding rejects incoming traffic, circuit breakers will reject outgoing traffic from a service.

They’re implemented as a proxy inside the service and monitor the error rate from downstream services. If the error rate surpasses a configured threshold, then the circuit breaker will start rejecting all outbound requests to the troubled downstream service.

DoorDash built their circuit breakers into their internal gRPC clients.

Cons of Circuit Breaking

The cons are similar to Load Shedding. It’s extremely difficult to determine the error rate threshold at which the circuit breaker should switch on. Many online sources use a 50% error rate as a rule of thumb, but this depends entirely on the downstream service, availability requirements, etc.

Auto-Scaling

When a service is experiencing high resource utilization, an obvious solution is to add more machines to that service’s server pool.

However, DoorDash recommends that teams do not use reactive-auto-scaling. Doing so can temporarily reduce cluster capacity, making the problem worse.

Newly added machines will need time to warm up (fill cache, compile code, etc.) and they’ll run costly startup tasks like opening database connections or triggering membership protocols.

These behaviors can reduce resources for the warmed up nodes that are serving requests. Additionally, these behaviors are infrequent, so having a sudden increase can produce unexpected results.

Instead, DoorDash recommends predictive auto-scaling, where you expand the cluster’s size based on expected traffic levels throughout the day.

Aperture for Reliability Management

One issue with load shedding, circuit breaking and auto-scaling is that these tools only have a localized view of the system. Factors they can consider include their own resource utilization, direct dependencies and number of incoming requests. However, they can’t take a globalized view of the system and make decisions based on that.

Aperture is an open source reliability management system that can add these capabilities. It offers a centralized load management system that collects reliability-related metrics from different systems and uses it to generate a global view.

It has 3 components

Observe - Aperture collects reliability-related metrics (latency, resource utilization, etc.) from each node using a sidecar and aggregates them in Prometheus. You can also feed in metrics from other sources like InfluxDB, Docker Stats, Kafka, etc.
Analyze - A controller will monitor the metrics in Prometheus and track any deviations from the service-level objectives you set. You set these in a YAML file and Aperture stores them in etcd, a popular distributed key-value store.
Actuate - If any of the policies are triggered, then Aperture will activate configured actions like load shedding or distributed rate limiting across the system.

DoorDash set up Aperture in one of their primary services and sent some artificial requests to load test it. They found that it functioned as a powerful, easy-to-use global rate limiter and load shedder.

For more details on how DoorDash used Aperture, you can read the full blog post here.

Tech Snippets

How Facebook handled billions of requests per second efficiently

A terrific dive into how Facebook scaled Memcached to handle billions of requests per second.

The article gives an in-depth explanation on how to scale Memcached and is a good read if you’d like to learn more about scaling a distributed key-value store.

engineercodex.substack.com/p/how-facebook-scaled-memcached

Introduction to Fuzz Testing

Fuzz Testing is where you use randomly generated test cases to identify bugs. Your fuzzer can be generation-based or mutation-based.

Generation-based fuzzing is where you generate inputs from scratch. Mutation-based fuzzing is where you have a corpus of seed inputs and you modify those to create your test cases.

This blog post is a fantastic intro to Fuzz Testing and delves into how QuestDB (a time series database) used it to improve code quality.

questdb.io/blog/fuzz-testing-questdb

A Byte of Vim

This is a free book that delves into Vim without assuming any prior knowledge.

The first part is an introduction to the basics (modes, moving around, basic editing, exiting vim, etc.)

The second part delves into how to personalize vim, extending it and more.

vim.swaroopch.com/byte-of-vim.pdf

Build a CDN from Scratch

Building something from scratch is an awesome way to get a solid understanding of how it works.

This is a detailed article where you build a CDN. You’ll learn about tech like Nginx, Lua, Prometheus, Grafana and more.

You’ll also get some practice with concepts like load balancing and containers.

github.com/leandromoreira/cdn-up-and-running

An Animated Guide to Load Balancing Algorithms

This is an interesting article on load balancing algorithms that delves into how they works. It has terrific graphics that help you visualize the algorithm being used.

Load Balancing Algorithms mentioned include
- Round Robin
- Random
- Weighted Random
- Weighted Round Robin
- Least Connections
- Peak EWMA

samwho.dev/load-balancing

How to Build Deep Relationships in Tech

Rahul Pandey and Alex Chiou are building an awesome community called Taro for software engineers who want to level up in their career. Both of them previously worked at FAANG-companies in leadership roles and have extensive experience leading teams, shipping code and climbing up the career ladder.

They published a fantastic 90 minute masterclass on strategies and tactics for building relationships in tech.

You need to be a Taro premium subscriber to watch the full masterclass, but Rahul kindly let me share a couple of the key points and insights that they went over.

At the core, building relationships in tech consists of two steps:

Identify Important People to Know - your time is limited so it’s impossible to network with everyone. Instead, you should identify specific people who can be helpful to your growth (and also whether you can help them).
Build relationships with the people you identified - You’ll need to reach out to the people from step 1, build a relationship and maintain the relationship over time.

Here’s a list of things to keep in mind when doing this.

How to Frame the Interaction

So, let’s say you identify someone you want to meet. This could be someone at your company or maybe it’s a total stranger at another company you want to work at.

A good way to start building a relationship is to ask them if you can schedule a 1 on 1. This is relatively straightforward if you’re already working at the same company, but it can be difficult if you’re reaching out to someone cold.

Here’s Rahul’s tips on doing cold outreach.

Mention your background + credibility - you should show them that you’re worthy of their time and attention. What have you accomplished (or what are you in the process of accomplishing)? Make it clear if there’s anything you can offer the person.
Mention Commonalities - Look at your current network and see if you know anyone this person knows. Perhaps you went to the same school or you’re both from the same hometown.
Put them on a Pedestal - Explain why you’re reaching out to this person in particular. This is a great place to praise the person’s accomplishments (Dale Carnegie, author of How to Win Friends and Influence People, has written extensively about how important it is to genuinely praise others)
Have a specific ask - Clarify what you are actually looking for. A bad ask is “can you be my mentor”. A good ask is “I’m having some trouble with XYZ. Can I get 20 minutes of your time to pick your brain on XYZ?”

Understand Their Incentives

If you’re just going to people and asking them for value without offering anything, then that makes you a bit selfish. Instead, you should put some thought into thinking about what you have to offer the other person.

Figure out what the person is looking for.

Maybe they’d be interested in a job at company Z, and you have a close friend who works there. Or, perhaps you both work at the same company and you can tell their manager about how helpful they were.

If you think hard enough, there’s almost always some way in which you can help the other person. It’s almost impossible to have lived for over 20 years and not have a single thing you can offer to other people.

You might have interesting life experiences to share, learnings from past failures, common hobbies, etc.

Kindness is 10x more valuable when it’s unexpected

Leading with kindness helps you break the “tit-for-tat” mentality, where you only offer something if someone can help you in return. It’s also much more appreciated when it’s clear that you’re not expecting anything in return.

The vast majority of people will strive to return the favor to you. If they don’t, then maybe they’re not a good fit for your network!

Therefore, it’s really useful to be creative in finding ways to lift the people around you and help others as much as you can.

Put yourself out there

Don’t just wait for other people to come to you. Building relationships is an important part of your career and it’s especially useful for times like now where the job market is poor and few companies are hiring.

Having a strong recommendation is a guaranteed first-round interview at many companies.

You should also reach out to people to reconnect. Relationships need to be maintained so it’s important to be proactive with reconnecting with past colleagues, old friends, etc.

Rahul found that one trait of the best networkers is that they’re extremely good at finding excuses to reach out and reconnect with people. It could be a birthday, a job change or some industry news that’s relevant.

You’re Going to Get Burned And That’s Okay

You will get burned on this. There will be times when you invest time and effort to try and build a relationship with someone and they just don’t reciprocate.

It’s important that you don’t take this as a reflection on yourself.

Sometimes people are just busy and don’t have any time they can spare. Or perhaps the relationship isn’t a good fit.

Regardless, building relationships and proactively networking is limited downside, unlimited upside.

The only loss you’ll face is the time and effort you put into building the connection. The upside is unlimited. You could get new job opportunities, promotions, life-long friends and much more.

These are some of the key points from the 90 minute talk Alex and Rahul gave on building deep relationships in tech.

You can view the full talk here if you’re a Taro premium subscriber. If you’re not a premium subscriber, then you can use this link to get a 20% discount on the platform.

sponsored