How DoorDash Deploys at Scale
We'll talk about deployments and how DoorDash uses Blue-Green and Canary deployments. Plus, what specific metrics they use for monitoring regressions. Also, how to debug and lessons from 8 years of Kubernetes
Hey Everyone!
Today we’ll be talking about
How DoorDash Deploys at Scale
Deployment strategies with Blue-Green, Canary and Rolling deployments.
Advice from Google’s Site Reliability Handbook on picking the right testing metrics when checking for regressions
Specific metrics DoorDash uses for their app like Page Load Duration, Action Load Duration and more
Tech Snippets
How to Build Extremely Quickly
Lessons from 8 Years of Kubernetes
How to Debug
How DoorDash Deploys at Scale
When you’re building an application at scale, one under-appreciated aspect you’ll have to think about is your strategy around deploying updates.
You’ll have to send the new release to all the servers in your fleet and make sure that there’s no disruption to end-users during this process. After you finish deploying the new update, you’ll have to continue monitoring the app to make sure there’s no significant drop in performance.
If something does go wrong during this release process, then that means you’ll have to send out quite a few UberEats gift cards.
DoorDash recently published a fantastic blog post delving into how they monitor the app for performance degradations while rolling out updates. We’ll summarize the DoorDash article and also add context on doing deployments and how you should pick metrics for what you monitor.
Deployment Strategies
Your deployment strategy states how your team releases new versions of your app into your production environment. A good deployment strategy will help you minimize risks of introducing bugs and increase uptime.
Some commonly used deployment strategies are
Blue-Green Deployments - In a Blue-Green setup, you first have your production environment (Blue environment) that is running the current application and serving real users. When you want to deploy a new version of the app, you spin up a new environment that’s identical to production (the Green environment).
You install your new app version on all the machines in the Green environment and run your tests.
Once you’ve validated the Green environment, you switch the Blue and Green environments and start sending user traffic to the environment with the new app. If something goes wrong, you can immediately reverse the switchover and go back to sending all user traffic to the old production environment.After ensuring that everything is working as expected, you can terminate the instances in the old production environment.
Canary Deployments - In a canary setup, you first start by deploying the new app to a small group of users (these users are the canaries). This gives you a live environment where you can monitor and test the new app while minimizing the impact to the user base.
After you validate the new app, you can gradually roll out the application to the entire user base.Rolling Deployments - With rolling deployments, you gradually roll out the new app version to all the machines in your server pool. This continues until all the servers are running the new version. As you’re rolling the app out, you can run tests to make sure that users aren’t seeing performance regressions or any other issues.
It’s common to use a strategy of both Canary deployments and Rolling deployments. You first deploy to a small canary group and then run a rolling deployment once you’ve validated everything is working with the new release.
At DoorDash, engineers use both Blue-Green and Canary deployments. Their backend is built on Kubernetes so they use Argo Rollouts to implement blue-green deployments and have a custom controller they built for Canary deployments. You can read more about this here.
Picking the Right Testing Metrics
When you’re running the deployment process, you should have a set of metrics that you’re monitoring to quickly identify any problems with the new release.
However, picking these metrics can be a challenge. Having too many metrics will result in lots of false flags and can unnecessarily slow down your release process.
The Google Site Reliability Engineering handbook has a great section delving into what makes a metric good for testing new releases.
Some suggestions they have are
Metrics Should Indicate Problems - You should stack-rank the metrics you want to evaluate based on how well they indicate actual user-perceivable problems. If the new release is causing a 10% increase in server CPU usage, then this is probably not worth canceling the deployment. However, if the new release is causing a 10% increase in 500 errors then you probably don’t want to deploy it.
Metrics Should be Attributable - Metrics should not be influenced by external factors that are unrelated to the new release. If you’re monitoring the latency of a web service during a canary deployment, the latency metric should represent the impact of the changes that are being tested. If latency increases, it should be because of the new version of the app, not because of unrelated issues like network congestion or increased user traffic.
Metrics Testing at DoorDash
For DoorDash, they look at
Page Load Duration - how long it takes to go from an empty screen to having the page fully loaded.
Action Load Duration - the time it takes from when the user interface receives a user input event to the action being successfully completed.
Unexpected Page Load Errors - the percentage of unexpected errors that occurred during page load.
Unexpected Action Load Errors - the percentage of unexpected errors that occurred while executing a user’s action.
Crashes - the percentage of crashes that happen out of all page load events.