The Architecture of Dropbox's Asynchronous Task Framework

Hey Everyone!

Today we’ll be talking about

  • The design of Dropbox's Asynchronous Task Framework

    • Dropbox built an Asynchronous Task Framework to handle tasks like sending a user an email

    • It's built with Courier (their RPC framework), EdgeStore (their metadata store built on top of MySQL), AWS Simple Queue Service and more

    • We talk about the architecture and the system guarantees

  • Tech Snippets

    • Stack Overflow's Tech Stack

    • How Reddit uses Large Language Models

    • Functional Programming Jargon Explained

    • How do Chrome Extensions affect Browser Performance


Dropbox's Asynchronous Task Framework

Dropbox is a file hosting and sharing company with over 700 million registered users. Hundreds of thousands of companies also rely on Dropbox for their business needs like storing and sharing documents, team collaboration, etc.

Like many other companies, Dropbox relies on an asynchronous task manager (called Asynchronous Task Framework or ATF) to manage and run async tasks. When you remove a file on your Dropbox account, the UI may show that the file was moved; but behind the scenes, an async task was created to delete the file on all the database replicas.

ATF (Asynchronous Task Framework) serves more than 9000 async tasks scheduled per second, and more than 30 teams at Dropbox make use of the framework.

Arun Sai Krishnan is a Software Engineer at Dropbox, and he wrote a great blog post on design goals of ATF and the architecture behind it.

Here’s a summary

ATF allows Dropbox engineers to schedule async tasks on-demand through a callback-based architecture. Developers can define callback functions and then schedule ATF tasks that execute these callbacks.

The callback functions are called lambdas, and developers can write lambdas to execute async tasks like sending out an email to a user.

When an engineer wants to execute a lambda, they can submit it to the ATF. This creates a task, which is just a unit of execution of a lambda (similar to how a process is a unit of execution of a program).

ATF supports features like

  • Task Scheduling - schedule the task to execute at a specific time.

  • Priority Based Execution - tasks with higher priority get executed before tasks with lower priority.

  • Task Status Querying - clients can query the status of a scheduled task.

ATF Architecture

Here’s the Architecture for ATF

Async Task Framework (ATF) [Fig 1]

ATF consists of the following components

  • Frontend - Clients can schedule tasks using remote procedure calls (RPC). Dropbox uses gRPC with an in-house built RPC framework called Courier.

  • Task Store - The frontend accepts tasks and stores them in the task store. This can be any generic data store that has indexed querying capability. Dropbox uses their in-house metadata store called Edgestore. It's built on top of MySQL.

  • Store Consumer - The store consumer is a service that will periodically poll the task store to find tasks that are ready for execution. It pushes these tasks onto the right queue.

  • Queue - Dropbox uses AWS Simple Queue Service (SQS) to queue the tasks. Worker machines will pull tasks off the SQS queues.

  • Controller - Worker machines consist of a controller and multiple executors. The controller process is responsible for polling tasks from SQS queues and pushing them onto process local buffered queues. Then, it serves these tasks from its local queue as a response to Next Work RPC requests.

  • Executor - The executor is a process with multiple threads that is responsible for the actual task execution. It gets tasks from the Controller by polling for work from the Controller by sending Next Work RPC requests.

  • Heartbeat and Status Controller (HSC) - The HSC serves RPCs for status updates during task execution and setting task status in the task store after execution.

ATF provides the following system guarantees

  • At-least Once Task Execution - tasks will be executed at least once. The ATF will try and retry tasks until they complete execution or reach a fatal failure state. This means that a task may get executed multiple times, so developers have to ensure that their lambda logic is idempotent (can be run multiple times without changing the result).

  • No Concurrent Task Execution - The ATF system guarantees that at most one instance of a task will be actively executing at any given time, so developers can write their callback logic without designing for concurrent execution of the same task from different workers. Before a task starts execution, it will be marked with a state of “Claimed” so it doesn’t get assigned to another worker machine.

  • Delivery Latency - 95% of tasks begin execution within 5 seconds from their scheduled execution time. The store consumer polls for ready tasks once every two seconds. This polling frequency can be configured to change the task delivery latency.

  • 3 Nines Availability - The ATF service is 99.9% available to accept task scheduling requests from any client.

For more details on ATF’s ownership model, task lifecycle and data model, you can read the full article here.

Tech Snippets

  • Stack Overflow’s Tech Stack - This is a great graphic by Stack Overflow on the tech stack they use to power Stack Exchange. They get 1.3 billion page views per month and run the site on 9 web servers that can serve a peak of 450 requests per second. They use 4 SQL servers that are organized as 2 clusters and can serve more than 11,000 queries per second. Check out the link to read about the rest of their tech stack.

  • How Reddit uses Large Language Models - A language model predicts the probability of a sequence of words. They form the basis for Natural Language Processing. BERT is a language model developed by Google and is used widely across the internet (most prominently in Google Search). Reddit extended BERT to create SnooBERT, a language model that has been trained on Reddit’s posts and comments from the past year. They wrote a great blog post on how/why they did this.

  • Functional Programming Jargon Explained - Do you work with a bunch of programming language nerds who constantly throw around terms like Lambda Calculus, Monoid, Referential Transparency, etc.? Well then, this is the repo for you! It gives short explanations on a bunch of terms in Functional Programming along with short code examples in JavaScript.

  • How do Chrome extensions impact browser performance? - This is a great blog post that looks at the top 1000 most popular Chrome extensions and examines how they impact browser performance. Some findings were that extensions like Honey, Evernote Web Clipper, and Loom have significant CPU consumption. On ad heavy websites, ad blocker extensions can greatly improve browser performance.

Interview Question

You are given a sorted array nums.

Write a function that removes any duplicates in nums in-place.

You must do this using O(1) space, you cannot allocate extra space for another array.

Previous Question

As a refresher, here’s the previous question

You are given an array and an integer K where K is less than size of the array.

Return the smallest K numbers in the array.

Solution

The brute force solution is to just sort the array and then return the first K elements.

Using an algorithm like quicksort, this would take N log N where N is the number of items in the array.

But, we can do better.

We only care about the K smallest numbers in the array. We don’t care about the ordering of the rest of the elements in the array. We also don’t care about the relative ordering of the K smallest numbers.

Therefore, we can use a max heap to keep track of the K smallest numbers.

We can check the largest element in our max heap in O(1) time, so we can quickly check if an element is small enough to be one of the smallest K numbers (it should be smaller than the largest element in the heap).

We’ll start by adding the first K elements in our array to our max heap. Then, we’ll iterate through the rest of the array.

For each element in our array, we’ll first check if it is smaller than the largest element in our max heap. If it is, then that means our element is one of the smallest K integers (that we’ve seen so far).

Therefore, we’ll insert it into our max heap. That is a log K operation. We’ll also have to remove the largest element in our max heap (since it should only be K elements). That’ll be another log K operation.

After we iterate through the entire array, then we can just return the elements in our heap.

The time complexity of our solution is N log K where K will always be less than or equal to N. This means it’s an improvement over N log N.

Note: for our Python 3 code, we make use of the heapq library. Unfortunately, heapq’s implementation is a min heap (we need a max heap). Therefore, we use a little “hack” where we multiply each number we insert by -1 (thereby reversing the order).

Here’s the Python 3 code

There’s also an even faster solution using the Quickselect algorithm.

Quickselect is a variant of Quicksort, specifically meant to find the Kth smallest item.

In Quicksort, we select a pivot, reorder the numbers so that all the numbers smaller than the pivot are to the left and all the numbers larger than the pivot are to the right.

Then, we recurse on both sides.

In Quickselect, we only recurse on one side (the side that contains the Kth smallest element).

Once we find a pivot that divides the list into K elements on one side, then we’ve found the K smallest elements!

Quickselect runs in an average of O(n) time complexity, but it is O(n^2) in the worst case.