How DoorDash Deploys at Scale

We'll talk about deployments and how DoorDash uses Blue-Green and Canary deployments. Plus, what specific metrics they use for monitoring regressions. Also, how to debug and lessons from 8 years of Kubernetes

Arpan KG
August 16, 2024

Hey Everyone!

Today we’ll be talking about

How DoorDash Deploys at Scale
- Deployment strategies with Blue-Green, Canary and Rolling deployments.
- Advice from Google’s Site Reliability Handbook on picking the right testing metrics when checking for regressions
- Specific metrics DoorDash uses for their app like Page Load Duration, Action Load Duration and more
Tech Snippets
- How to Build Extremely Quickly
- Lessons from 8 Years of Kubernetes
- How to Debug

How DoorDash Deploys at Scale

When you’re building an application at scale, one under-appreciated aspect you’ll have to think about is your strategy around deploying updates.

You’ll have to send the new release to all the servers in your fleet and make sure that there’s no disruption to end-users during this process. After you finish deploying the new update, you’ll have to continue monitoring the app to make sure there’s no significant drop in performance.

If something does go wrong during this release process, then that means you’ll have to send out quite a few UberEats gift cards.

DoorDash recently published a fantastic blog post delving into how they monitor the app for performance degradations while rolling out updates. We’ll summarize the DoorDash article and also add context on doing deployments and how you should pick metrics for what you monitor.

Deployment Strategies

Your deployment strategy states how your team releases new versions of your app into your production environment. A good deployment strategy will help you minimize risks of introducing bugs and increase uptime.

Some commonly used deployment strategies are

Blue-Green Deployments - In a Blue-Green setup, you first have your production environment (Blue environment) that is running the current application and serving real users. When you want to deploy a new version of the app, you spin up a new environment that’s identical to production (the Green environment).

You install your new app version on all the machines in the Green environment and run your tests.

Once you’ve validated the Green environment, you switch the Blue and Green environments and start sending user traffic to the environment with the new app. If something goes wrong, you can immediately reverse the switchover and go back to sending all user traffic to the old production environment.
After ensuring that everything is working as expected, you can terminate the instances in the old production environment.
Canary Deployments - In a canary setup, you first start by deploying the new app to a small group of users (these users are the canaries). This gives you a live environment where you can monitor and test the new app while minimizing the impact to the user base.

After you validate the new app, you can gradually roll out the application to the entire user base.
Rolling Deployments - With rolling deployments, you gradually roll out the new app version to all the machines in your server pool. This continues until all the servers are running the new version. As you’re rolling the app out, you can run tests to make sure that users aren’t seeing performance regressions or any other issues.

It’s common to use a strategy of both Canary deployments and Rolling deployments. You first deploy to a small canary group and then run a rolling deployment once you’ve validated everything is working with the new release.

At DoorDash, engineers use both Blue-Green and Canary deployments. Their backend is built on Kubernetes so they use Argo Rollouts to implement blue-green deployments and have a custom controller they built for Canary deployments. You can read more about this here.

Picking the Right Testing Metrics

When you’re running the deployment process, you should have a set of metrics that you’re monitoring to quickly identify any problems with the new release.

However, picking these metrics can be a challenge. Having too many metrics will result in lots of false flags and can unnecessarily slow down your release process.

The Google Site Reliability Engineering handbook has a great section delving into what makes a metric good for testing new releases.

Some suggestions they have are

Metrics Should Indicate Problems - You should stack-rank the metrics you want to evaluate based on how well they indicate actual user-perceivable problems. If the new release is causing a 10% increase in server CPU usage, then this is probably not worth canceling the deployment. However, if the new release is causing a 10% increase in 500 errors then you probably don’t want to deploy it.
Metrics Should be Attributable - Metrics should not be influenced by external factors that are unrelated to the new release. If you’re monitoring the latency of a web service during a canary deployment, the latency metric should represent the impact of the changes that are being tested. If latency increases, it should be because of the new version of the app, not because of unrelated issues like network congestion or increased user traffic.

Metrics Testing at DoorDash

For DoorDash, they look at

Page Load Duration - how long it takes to go from an empty screen to having the page fully loaded.
Action Load Duration - the time it takes from when the user interface receives a user input event to the action being successfully completed.
Unexpected Page Load Errors - the percentage of unexpected errors that occurred during page load.
Unexpected Action Load Errors - the percentage of unexpected errors that occurred while executing a user’s action.
Crashes - the percentage of crashes that happen out of all page load events.

Tech Snippets

How to Debug

Phil Booth is an ex-Mozilla engineer and he writes a fantastic blog post on his process for writing code. He recently wrote a terrific post delving into his process for debugging issues in production.

Here’s the steps
1. Calm Yourself - If your application is down, that can be extremely stressful. Being in a panicked state makes it extremely difficult to fix issues. The first step is to take a deep breath and calm yourself down.

2. Reproduce it - Recreate the bug multiple times without assumptions and with the fewest number of steps. This helps with pin-pointing where the issue is.

3. Understand - Now that you can reproduce the bug, go through the functions and classes that are involved and observe the state using debuggers and logs. Write your thought process down on paper or electronically. This can be extremely helpful later.

For the rest of the steps, read the full article.

philbooth.me/blog/how-to-debug

How to Build Extremely Quickly

This article from Learn How To Learn talks about outline speedrunning, an algorithm for getting 10x gains over the traditional “loading-bar” style of project development.

The algorithm’s steps are:
1. Create an outline of your project

2. For each item in your preliminary outline, recursively make another outline until the work items cannot be split any further.

3. Speedrun filling in each base-level item. The faster you go, the more you benefit from momentum. Do not slow down to make things perfect.

4. Once done, go back and add the finishing touches.

Read the full article to see how this can be applied for two projects: writing an article and building a data pipeline.

learnhowtolearn.org/how-to-build-extremely-quickly

Lessons from 8 Years of Kubernetes

Anders Jönsson is the head of engineering and product at Urb-it and he wrote a terrific blog post about his experience using Kubernetes. They initially self-managed on AWS but ended up moving to Azure Kubernetes Service.

Anders shared his lessons learned around vendor selection, resource management, disaster recovery and more.

medium.com/@.anders/learnings-from-our-8-years-of-kubernetes-in-production-two-major-cluster-crashes-ditching-self-0257c09d36cd