How DoorDash Deploys at Scale

We'll talk about deployments and how DoorDash uses Blue-Green and Canary deployments. Plus, what specific metrics they use for monitoring regressions. Also, how to debug and lessons from 8 years of Kubernetes

Hey Everyone!

Today we’ll be talking about

  • How DoorDash Deploys at Scale

    • Deployment strategies with Blue-Green, Canary and Rolling deployments.

    • Advice from Google’s Site Reliability Handbook on picking the right testing metrics when checking for regressions

    • Specific metrics DoorDash uses for their app like Page Load Duration, Action Load Duration and more

  • Tech Snippets

    • How to Build Extremely Quickly

    • Lessons from 8 Years of Kubernetes

    • How to Debug

AuthKit is a fully customizable login box that you can easily plug into your application for user authentication and authorization. It’s built with Radix components so it’s super sleek and customizable but also very powerful.

AuthKit comes with built-in features like password strength validation, leaked password checking, bot detection and more.

Recently, AuthKit added Role-based Access Control (RBAC) as another core capability. With RBAC, you create roles (admin, developer, etc.) and set permissions per role. Users are then assigned specific roles based on their level of access. This approach can be very scalable as you don’t have to set a bunch of individual permissions when you’re adding new users.

AuthKit seamlessly handles RBAC and scales up to 1 million users for free with the WorkOS platform. It can also be used with Directory Sync so you can manage your user’s roles and permissions from a single source of truth instead of manually managing hundreds of different SaaS tools

partnership

How DoorDash Deploys at Scale

When you’re building an application at scale, one under-appreciated aspect you’ll have to think about is your strategy around deploying updates.

You’ll have to send the new release to all the servers in your fleet and make sure that there’s no disruption to end-users during this process. After you finish deploying the new update, you’ll have to continue monitoring the app to make sure there’s no significant drop in performance.

If something does go wrong during this release process, then that means you’ll have to send out quite a few UberEats gift cards.

DoorDash recently published a fantastic blog post delving into how they monitor the app for performance degradations while rolling out updates. We’ll summarize the DoorDash article and also add context on doing deployments and how you should pick metrics for what you monitor.

Deployment Strategies

Your deployment strategy states how your team releases new versions of your app into your production environment. A good deployment strategy will help you minimize risks of introducing bugs and increase uptime.

Some commonly used deployment strategies are

  • Blue-Green Deployments - In a Blue-Green setup, you first have your production environment (Blue environment) that is running the current application and serving real users. When you want to deploy a new version of the app, you spin up a new environment that’s identical to production (the Green environment).

    You install your new app version on all the machines in the Green environment and run your tests.

    Once you’ve validated the Green environment, you switch the Blue and Green environments and start sending user traffic to the environment with the new app. If something goes wrong, you can immediately reverse the switchover and go back to sending all user traffic to the old production environment.

    After ensuring that everything is working as expected, you can terminate the instances in the old production environment.

  • Canary Deployments - In a canary setup, you first start by deploying the new app to a small group of users (these users are the canaries). This gives you a live environment where you can monitor and test the new app while minimizing the impact to the user base.

    After you validate the new app, you can gradually roll out the application to the entire user base.

  • Rolling Deployments - With rolling deployments, you gradually roll out the new app version to all the machines in your server pool. This continues until all the servers are running the new version. As you’re rolling the app out, you can run tests to make sure that users aren’t seeing performance regressions or any other issues.

     
    It’s common to use a strategy of both Canary deployments and Rolling deployments. You first deploy to a small canary group and then run a rolling deployment once you’ve validated everything is working with the new release.

At DoorDash, engineers use both Blue-Green and Canary deployments. Their backend is built on Kubernetes so they use Argo Rollouts to implement blue-green deployments and have a custom controller they built for Canary deployments. You can read more about this here.

Picking the Right Testing Metrics

When you’re running the deployment process, you should have a set of metrics that you’re monitoring to quickly identify any problems with the new release.

However, picking these metrics can be a challenge. Having too many metrics will result in lots of false flags and can unnecessarily slow down your release process.

The Google Site Reliability Engineering handbook has a great section delving into what makes a metric good for testing new releases.

Some suggestions they have are

  1. Metrics Should Indicate Problems - You should stack-rank the metrics you want to evaluate based on how well they indicate actual user-perceivable problems. If the new release is causing a 10% increase in server CPU usage, then this is probably not worth canceling the deployment. However, if the new release is causing a 10% increase in 500 errors then you probably don’t want to deploy it.

  2. Metrics Should be Attributable - Metrics should not be influenced by external factors that are unrelated to the new release. If you’re monitoring the latency of a web service during a canary deployment, the latency metric should represent the impact of the changes that are being tested. If latency increases, it should be because of the new version of the app, not because of unrelated issues like network congestion or increased user traffic.

Metrics Testing at DoorDash

For DoorDash, they look at

  • Page Load Duration - how long it takes to go from an empty screen to having the page fully loaded.

  • Action Load Duration - the time it takes from when the user interface receives a user input event to the action being successfully completed.

  • Unexpected Page Load Errors - the percentage of unexpected errors that occurred during page load. 

  • Unexpected Action Load Errors - the percentage of unexpected errors that occurred while executing a user’s action.

  • Crashes - the percentage of crashes that happen out of all page load events.

As your application scales, you’ll have to spend a lot more time thinking about how you implement authorization. Users will have different access permissions and you’ll need to manage this in a flexible, performant manner.

Two philosophies for building authorization are

Many companies start with role-based authorization and then transition to resource-based as they scale up.

WorkOS published a fantastic guide on role-based vs. resource-based authorization and the factors you’ll have to consider when implementing either.

They talk about real-world authorization systems like Google Zanzibar, Figma’s permissions system, Open Policy Agent and more.

partnership

Tech Snippets