How Dropbox Maintains 3 Nines of Availability

September 22, 2022

Hey Everyone!

Backend Caching Strategies and Techniques
- Benefits of adding a Cache Tier and the Downsides
- Cache Aside and Write Through strategies
- Cache Eviction with FIFO, LRU and LFU
How Dropbox maintains an uptime of 99.95% (< 20 mins down per month)
- Making issue detection easier with Vortex, Dropbox's server-side metrics and alerting system
- Diagnosing issues quickly with Grafana dashboards and exception tracking infrastructure
- Making the recovery process faster by asking engineers one simple question
Tech Snippets
- An organized and updated reading list with concepts, case studies and articles on building reliable large-scale systems.
- How Slack uses Kafka
- API versioning practices at LinkedIn
- A collection of programming tutorials that focus on Project-Based learning

We also have a solution to our last interview question on Binary Trees.

This is an example of content we'll be sending out in the premium version of Quastor. For future premium content, please subscribe here. I'd highly recommend you use your job's Education/L&D budget for Quastor.

If you don't have an L&D budget (or you're a student), reply to this email and I'll be more than happy to give you a discount.

Backend Caching Strategies

Caching is an important part of large scale system design. For many web applications, the database workload will be read intensive (users will send far more database read requests than writes) so finding a way to reduce the read load on your database can be very useful.

If the data being read doesn’t change often, then adding a caching layer can significantly reduce the latency users experience while also reducing the load on your database. The reduction in read requests frees up your database for writes (and reads that were missed by the cache).

The cache tier is a data store that temporarily stores frequently accessed data. It’s not meant for permanent storage and typically you’ll only store a small subset of your data in this layer.

When a client requests data, the backend will first check the caching tier to see if the data is cached. If it is, then the backend can retrieve the data from cache and serve it to the client. This is a cache hit. If the data isn’t in the cache, then the backend will have to query the database. This is a cache miss. Depending on how the cache is set up, the backend might write that data to the cache to avoid future cache misses.

A couple examples of popular data stores for caching tiers are Redis, Memcached, Couchbase and Hazelcast. Redis and Memcached are very popular and they’re offered as options with cloud cache services like AWS’s ElastiCache and Google Cloud’s Memorystore.

Redis and Memcached are in-memory, key-value data stores, so they can serve reads with a lower latency than disk-based data stores like Postgres. They’re in-memory data stores, so RAM is used as the primary method of storing and serving data while the disk is used for backups and logging. This translates to speed improvements as memory is much faster than disk.

Downsides of Caching

If implemented poorly, caching can result in increased latency to clients and add unnecessary load to your backend. Everytime there’s a cache miss, then the backend has just wasted time making a request to the cache tier.

If you have a high cache miss rate, then that means the cache tier is adding more latency than it’s reducing and you’d be faster off by removing it or changing your cache parameters to reduce the miss rate (discussed below).

Another downside of adding a cache tier is dealing with stale data. If the data you’re caching is static, then this isn’t an issue but you’ll frequently want to cache data that is dynamic. You’ll have to have a strategy for cache invalidation to minimize the amount of stale data you're sending to clients. We’ll talk about this below.

Implementing Caching

There are several ways of implementing caching in your application. We’ll go through a few of the main ones.

Cache Aside

The most popular method is a Cache Aside strategy.

Here’s the steps

Client requests data
The server checks the caching tier. If there’s a cache hit, then the data is immediately served.
If there’s a cache miss, then the server checks the database and returns the data.
The server writes the data to the cache.

Here, your cache is being loaded lazily, as data is only being cached after it’s been requested. You usually can’t store your entire dataset in cache, so lazy loading the cache is a good way to make sure the most frequently read data is cached.

However, this also means that the first time data is requested will always result in a cache miss. Developers solve this by cache warming, where you load data into the cache manually.

In order to prevent stale data, you’ll also give a Time-To-Live (TTL) whenever you cache an item. When the TTL expires, then that data is removed from the cache. Setting a very low TTL will reduce the amount of stale data but also result in a higher number of cache misses. You can read more about this tradeoff in this AWS Whitepaper.

An alternative caching method that minimizes stale data is the Write-Through cache.

Write Through

A Write Through cache can be viewed as an eager loading approach. Whenever there’s a change to the data, that change is reflected in the cache.

This helps solve the data consistency issues (avoid stale data) and it also prevents cache misses for when data is requested the first time.

Here’s the steps.

Client writes/modifies data.
Backend writes the change to both the database and also to the cache. You can also do this step asynchronously, where the change is written to the cache and then the database is updated after a delay (a few seconds, minutes, etc.). This is known as a Write Behind cache.
Clients can request data and the backend will try to serve it from the cache.

A Write Through strategy is often combined with a Read Through so that changes are propagated in the cache (Write Through) but missed cache reads are also written to the cache (Read Through).

You can read about more caching patterns in the Oracle Coherence Documentation (Coherence is a Java-based distributed cache).

Cache Eviction

The amount of storage you have available in your caching tier is usually a lot smaller than what you have available for your database.

Therefore, you’ll eventually reach a situation where your cache is full and you can’t add any new data to it.

To solve this, you’ll need a cache replacement policy. The ideal cache replacement policy will remove cold data (data that is not being read frequently) and replace it with hot data (data that is being read frequently).

There are many different possible cache replacement policies.

A few categories are

Queue-Based - Use a FIFO queue and evict data based on the order in which it was added regardless of how frequently/recently it was accessed.
Recency-Based - Discard data based on how recently it was accessed. This requires you to keep track of when each piece of data in your cache was last read. The Least Recently Used (LRU) policy is where you evict the data that was read least recently.
Frequency-Based - Discard data based on how many times it was accessed. You keep track of how many times each piece of data in your cache is being read. The Least Frequently Used (LFU) policy is where you evict data that was read the least.

The type of cache eviction policy you use depends on your use case. Picking the optimal eviction policy can massively improve your cache hit rate.

Want to learn more?

Here’s some helpful links on Caching Strategies. I referred to these sources while writing this article.

How Dropbox maintains 3 Nines of Availability

Dropbox is a file hosting company that stores exabytes of data for their 700 million users. The majority of their users are consumers but many enterprises also use Dropbox’s storage solutions.

It’s extremely important that Dropbox meets various service level objectives (SLOs) around availability, durability, security and more. Failing to meet these objectives means unhappy users (increased churn, bad PR, fewer sign ups) and lost revenue from enterprises who have service level agreements (SLAs) in their contracts with Dropbox.

The availability SLA in contracts is 99.9% uptime but Dropbox sets a higher internal bar of 99.95% uptime. This translates to less than 21 minutes of downtime allowed per month.

In order to meet their objectives, Dropbox has a rigorous process they execute whenever an incident comes up. They’ve also developed a large amount of tooling around this process to ensure that incidents are resolved as soon as possible.

Joey Beyda is a Senior Engineering Manager at Dropbox and Ross Delinger is a Site Reliability Engineer. They wrote a great blog post on incident management at Dropbox and how the company ensures great service for their users.

Here’s a Summary

With any incident, there’s 3 stages that engineers have to go through.

Detection - identify an issue and alert a responder
Diagnosis - the time it takes for responders to root-cause an issue and identify a resolution approach.
Recovery - the time it takes to mitigate the issue for users once a resolution approach is found.

Dropbox went into each of these stages and described the process and tooling they’ve built.

Detection

Whenever there’s an incident around availability, durability, security, etc. it’s important that Dropbox engineers are notified as soon as possible.

To accomplish this, Dropbox built Vortex, their server-side metrics and alerting system.

You can read a detailed blog post about the architecture of Vortex here. It provides an ingestion latency on the order of seconds and has a 10 second sampling rate. This allows engineers to be notified of any potential incident within tens of seconds of its beginning.

However, in order to be useful, Vortex needs well-defined metrics to alert on. These metrics are often use-case specific, so individual teams at Dropbox will need to configure them themselves.

To reduce the burden on service owners, Vortex provides a rich set of service, runtime and host metrics that come baked in for teams.

Noisy alerts can be a big challenge, as they can cause alarm fatigue which will increase the response time.

To address this, Dropbox built an alert dependency system into Vortex where service owners can tie their alerts to other alerts and also silence a page if the problem is in some common dependency. This helps on-call engineers avoid getting paged for issues that are not actionable by them.

Diagnosis

In the diagnosis stage, engineers are trying to root-cause the issue and identify possible resolution approaches.

To make this easier, Dropbox has built a ton of tooling to speed up common workflows and processes.

The on-call engineer will usually have to pull in additional responders to help diagnose the issue, so Dropbox added buttons in their internal service directory to immediately page the on-call engineer for a certain service/team.

They’ve also built dashboards with Grafana that list data points that are valuable to incident responders like

Client and server-side error rates
RPC latency
Exception trends
Queries Per Second

And more. Service owners can then build more nuanced dashboards that list team-specific metrics that are important for diagnosis.

One of the highest signal tools Dropbox has for diagnosing issues is their exception tracking infrastructure. It allows any service at Dropbox to emit stack traces to a central store and tag them with useful metadata.

Developers can then view the exceptions within their services through a dashboard.

Recovery

Once a resolution approach is found, the recovery process consists of executing that approach and resolving the incident.

To make this as fast as possible, Dropbox asked their engineers the following question - “Which incident scenarios for your system would take more than 20 minutes to recover from?”.

They picked the 20 minute mark since their availability targets were no more than 21 minutes of downtime per month.

Asking this question brought up many recovery scenarios, each with a different likelihood of occurring. Dropbox had teams rank the most likely recovery scenarios and then they tried to shorten these scenarios.

Examples of recovery scenarios that had to be shortened were

Promoting standby database replicas could take more than 20 minutes - If Dropbox lost enough primary replicas during a failure, then they might be forced to break their availability target if they had to promote a standby replica to primary. Engineers solved this by improving the tooling that handled database promotions.
Experiments and Feature Gates could be hard to roll back - If there was an experimentation-related issue, this could take longer than 20 minutes to roll back and resolve. To address this, engineers ensured all experiments and feature gates had a clear owner and that they provided rollback capabilities and a playbook to on-call engineers.

For more details, you can read the full blog post here.

How did you like this summary?

Your feedback really helps me improve curation for future emails.

Tech Snippets

Awesome Scalability - This is a great Github repo with a ton of talks, case studies, blog posts and more on designing large-scale, performant systems. It lists resources focused on a specific area of design like performance, availability, scalability, and more. It also lists articles on scaling engineering teams and management.

Project-based Learning Tutorials - An extremely effective way to learn a new programming language is to build a small project using that language. This is an awesome repository that collects articles on building a toy project in a variety of languages. Languages include Rust, Swift, Python, Kotlin and more.

How Slack uses Kafka - This is an awesome blog post from Slack Engineering on how they built and operationalized a Kafka cluster to run at scale.Slack uses Kafka extensively for their Job Queue and for moving around mission critical data (analytics, logging, billing data and more)This blog post covers how (and why) Slack automated the operational aspects of managing a Kafka cluster using Chef and Terraform.You can view their Kafka config on Github.

API versioning practices at LinkedIn - This is an interesting article by LinkedIn engineering on how they built an API versioning system for the LinkedIn Marketing APIs. The system was built in their API Gateway and it makes it easy for engineers to upgrade the API without disrupting existing customers.

Interview Question

You are given a positive integer as input.

Print the next smallest and next largest numbers that have the same number of 1 bits in their binary representation.

Previous Question

As a reminder, here’s our last question

You’re given an infinite complete binary tree.

Every level of the tree is filled with values from 1 to infinity.

However, the ordering of the nodes has been flipped for even numbered rows, where they go from largest to smallest.

Given the label of a node in this tree, return the labels in the path from the root of the tree to that row.

Example

Input: 14

Output: [1, 3, 4, 14]

Can you solve this in O(log n) time?

Here’s the question in LeetCode

Solution

You first instinct might be to think of some kind of tree traversal or binary search algorithm.

But, that’s actually wrong.

You don’t need to interact with a binary tree data structure to solve this question!

Instead, try and work backwards.

We’re at the label node, but can we figure out what the parent node is based off the properties of a complete tree?

We can find what level of the tree we’re in by taking int(log(label)) where it’s log base 2. The first level will be level 0.

If our level is even, then the nodes will be ordered from smallest to largest. If our level is odd, then they will be ordered from largest to smallest.

Therefore, we can calculate the label of the first node in our level. If we’re on an even level, then our first node is 2**(level). Otherwise, the first node is 2**(level + 1) - 1.

Now that we have the first node, we can find our current node’s position in the level by subtracting label and first.

Given the current node’s position, our parent node’s position (in its own level) will be the current node’s position divided by 2 (rounded down). That is a property of complete binary trees.

Now, we have to find the first node in our parent node’s level and then we can add our parent node’s position (or subtract, depending on if the parent node’s level is even or odd) to find the parent node’s label.

Now, we repeat the same exact process until we come to our root node, which has the label 1.

While doing the process, we’ll append our labels on to an array that will keep track of the path.