How Facebook Transfers Exabytes of Data Across The World

Facebook needs to send exabytes of data across their different data centers. We talk about how they do this.

Hey Everyone!

Today we’ll be talking about

  • How Facebook Transfers Exabytes of Data Across Their Data Centers Globally

    • Why Facebook needs a caching system on top of their distributed data store

    • Using a Hierarchical Caching approach and the Bittorrent Protocol

    • Using a hybrid of both approaches with a decentralized data plane and a centralized control plane

  • Tech Snippets

    • The Spelled-Out Intro to Neural Networks and Backprop by Andrej Karpathy

    • Rebuilding the world's most popular Spellchecker

    • A great series of blog posts by GitHub diving into Git's internals

We also have a solution to our last coding interview question on finding the submatrix with the largest possible sum.


How Facebook Transfers Exabytes of Data Across Their Data Centers Globally

In order to serve their 3 billion users, Facebook runs one of the largest private clouds in the world. They have massive data centers spread out across the globe and they’ve invested tens of billions of dollars in their infrastructure.

One big challenge when operating at this scale is distributing data across the system. Objects like executable files, search indexes, AI models and containers are a few examples of files that Facebook needs to send to many different machines globally.

Each of these files ranges from a couple of megabytes to a few terabytes and they’re split into small chunks. These chunks need to be transferred between Facebook machines with low latency and a very high throughput (millions of machines may need to quickly read a certain object).

These files are stored on Facebook’s distributed data store and client machines can read the files from there. However, having all the client machines read from this data store quickly leads to scalability issues.

There are far too many machines requesting files and the transfer speeds would be too slow. Instead, there needs to be a caching system built on top of the distributed data store that can facilitate easy transfer of this data.

To build this system, Facebook tried multiple approaches with varying degrees of centralization. They first tried a highly centralized system with a hierarchical caching layer but that led to scalability issues. They also tried a decentralized approach with the BitTorrent protocol but that was too complex to manage.

Eventually, they settled on a balance between these two approaches with Owl, a system for high-fanout distribution of data objects across Meta’s private cloud. Owl distributes over 700 petabytes of data per day to over 10 million unique client machines across Facebook’s data centers.

Despite serving over 700 petabytes of data per day, only ~40 petabytes of data is read on average from the underlying data store. This means that Owl has a ~95% cache hit rate and is able to massively reduce the amount of read traffic sent to the underlying data store.

Engineers at Facebook published a great paper where they talked about their prior data distribution systems (hierarchical caching, bittorrent and more), lessons learned and the architecture/implementation details behind Owl.

Here’s a Summary

Facebook engineers needed a way to distribute large objects across their private cloud. The task can be described by 3 dimensions

  • Scale - The same object could be read by anywhere from a handful of client machines to millions of clients around the world.

  • Size - Objects range in size from 1 megabyte to a few terabytes. Objects are split up into chunks and stored in a distributed storage system.

  • Hotness - All the client machines may request the object within a few seconds of each other, or their reads could be spread out over a few hours

Distributing these files must also be done efficiently and reliably. To be considered reliable, the caching system must successfully complete a large percentage of download requests within a certain latency. It should also not be too burdensome for engineers to maintain the system.

Facebook tried several approaches with varying amounts of centralization in the control plane and the data plane. The data plane are the machines where the cached data is stored while the machines in the control plane determine which files the data plane nodes should cache/delete and how requests should be routed to nodes in the data plane.

Here are a couple of Facebook’s initial approaches.

Hierarchical caching

The first attempt was to add a hierarchical cache system in front of their distributed data stores. This is a pretty standard solution and also relatively simple to implement.

Facebook set aside a dedicated pool of machines to use as the caching layer.

When a client machine needs a certain file, their first request goes to a first-level cache. If there’s a cache-miss, then the first-level will request the data from the next level caches in the hierarchy (second-level, third-level, and so on). The final layer is the distributed data store itself.

The data would be stored/evicted in these hierarchies so that the first-level cache held the most requested (hottest) data.

The issue with this approach is that it was too difficult for Facebook to handle load spikes for particular pieces of content. Machines in the caching system would get overloaded and start to throttle requests from the clients and Facebook had trouble provisioning capacity appropriately.

They would either provision for the steady state and miss load spikes or they would provision for load spikes and waste compute/servers.

The centralization of the data plane on the dedicated pool of machines was making the system too slow to scale.

Bittorrent

To address these scaling issues, Facebook built a second solution based on the bittorrent protocol, a very popular protocol for peer-to-peer file sharing.

With this system, any client that wants to download data becomes a peer in the system (so there were millions of peers). Clients would dedicate whatever resources they had available to sharing their downloaded files to other peers in the network. Trackers maintained a list of all the peers and which file chunks were stored on which peers.

When a new peer wants a certain data chunk, it can get a list of other peers that are sharing that chunk from the trackers. After downloading the chunk, that new peer can start sharing that chunk as well.

This scaled much better than hierarchical caching due to the peer-to-peer nature of the system. When there was a load spike for a particular piece of content, the number of peers sharing that content would automatically increase at a similar rate as the demand.

However, each peer in this system was making its own individual decision on which data to request and share. A machine would only become a peer for a certain file if the machine needed to download that file for its own purposes.

This decentralization of the control plane led to an inefficient allocation of resources where cold/stale data weren’t getting evicted from the system and hot data wasn’t being replicated at the optimal level.

The decentralization also made it very hard to operate and debug. Engineers could not get a clear picture of health and status without aggregating data from a large number of peers.

Owl

Facebook then designed a system that combined the best of both of these approaches with Owl.

Owl has a decentralized data plane and a centralized control plane. Data is stored in a decentralized manner (similar to bittorrent) on all the client machines that are downloading from the system. However, decisions around which client stores what chunk and how data is cached on the various peers are managed more centrally.

To accomplish this, Owl has 3 components

  • Peer - A peer is a library that’s linked with a client machine that wants to download data from Owl. As the machine is downloading the file, it can share data with other client machines (also peers) that request the file. When a peer wants to download a file, it asks the Owl Tracker (explained below) where to get the data from.

  • Superpeer - Superpeers are dedicated machines that can cache and serve data but aren’t linked to a client process. Instead, the entire machine is dedicated to caching and sharing data to other peers/superpeers. Owl Trackers will manage what data gets stored on the superpeers.

  • Tracker - The trackers are the brain of the system. They tell the peers and superpeers what data to cache/evict based on the entire state of the system. Trackers have a global view of what data is being requested so they can intelligently manage the system.

When a client machine wants to download a file from Owl, it will use the Owl library to send a remote procedure call for the data to a tracker.

The tracker has a global view of state and it will return information on the optimal peer/superpeer that has the data and can share it with the client. The client can then download the data from that node and become a peer itself.

The selection policy for how the tracker selects the peer/superpeer to share the data depends on a variety of factors like geographic distance, load, amount of the file that the node has saved.

If none of the peers have the file, then the tracker can have a superpeer fetch the data from the underlying data store. The superpeer can then share the data with the client machine, making the client machine a peer that can share the file.

The default cache eviction policy is LRU where the least recently used files get evicted when a peer’s storage is full. However many nodes also use a least rare policy where files are evicted from a peer based on how many other peers have that file cached. A file that is cached on many other peers will be evicted over a file cached on only a few peers.

The system can also be configured to use a hybrid policy of least-rare eviction for hot data and LRU eviction for cold data.

Owl has over 10 million peers and approximately 800 super peers. These are managed by 112 tracker nodes.

This is just a brief overview of how Owl works. You can get way more more details on sharding, security, fault tolerance and much more by reading the full paper.

Results

Facebook started Owl 2 years ago and since then they’ve seen 200x growth in the amount of traffic Owl is getting. This growth came from replacing the prior systems and taking on their load as well as organic adoption.

Despite this massive increase in traffic, the number of machines needed to run Owl (for the superpeers and trackers) only increased by 4x. The decentralized nature of the data plane (with peer-to-peer distribution) makes the system much easier to scale.

Owl is now handling over 700 petabytes of data per day and has over 10 million client processes using the system. This amounts to a throughput of ~7 - 15 terabytes per second of data that client processes are reading. With Owl, the amount of storage reads that have to be served by the underlying distributed data store is less than 0.7 terabytes per second.

During a typical day, Owl clients will read 700+ petabytes of data per day, but only ~40 petabytes will be read from the underlying data store, equating to a ~95% cache hit rate.

For more details, you can read the full paper here.

How did you like this summary?

Your feedback really helps me improve curation for future emails.

Login or Subscribe to participate in polls.

Prototyping An IoT Application with InfluxDB

There are more than 10 billion IoT devices with fitness trackers, medical sensors, doorbells etc. The engineering behind these devices is fascinating.

Here’s an awesome, step-by-step tutorial on how to stand up an IoT monitoring app with InfluxDB’s time series data platform.

InfluxDB makes it incredibly easy with a serverless architecture and zero infrastructure to manage.

The app is built on the free tier of InfluxDB, so you can code alongside and learn a new technology at no cost.

sponsored

Tech Snippets

  • Rebuilding the world’s most popular Spellchecker - Hunspell is an open source spell checker that is used by Google Chrome, Mozilla Firefox, LibreOffice, macOS, Adobe products and a variety of other apps. You can view the source code for Hunspell here. This is a great series of blog posts by Victor Shepelev where he tries to rebuild Hunspell and goes through some of the assumptions/technical choices that the project makes. This is a great series of blog posts to read if you want to learn more about the complexities behind building a Spellchecker.

Interview Question

You’re given an infinite complete binary tree.

Every level of the tree is filled with values from 1 to infinity.

However, the ordering of the nodes has been flipped for even numbered rows, where they go from largest to smallest.

Given the label of a node in this tree, return the labels in the path from the root of the tree to that row.

Example

Input: 14

Output: [1, 3, 4, 14]

Can you solve this in O(log n) time?

Previous Question

As a reminder, here’s our last question

You are given an N x N matrix that contains positive and negative integers.

Write a function that finds the submatrix with the largest possible sum.

Solution

Brute Force Solution

The brute force solution is to look at every possible submatrix and calculate the sum

We then pick the submatrix with the largest sum.

In order to do this, we’ll have to use 4 nested loops.

The first loop will iterate through the rows in our matrix and use that row as the starting row of our submatrix.

The second loop (nested in the first) will iterate through all the rows after the starting row and use this row as the end row of our submatrix.

The third loop (nested in the second) will iterate through all the columns in our matrix and use that column as the start column of our submatrix.

The fourth loop (nested in the third) will iterate through all the columns after the starting column and use that column as the end column of our submatrix.

Inside the fourth loop, we’ll have our submatrix and we can calculate the sum of all the elements inside it.

Calculating the sum of the elements in our submatrix will take O(N^2) time.

Therefore, running the brute force algorithm takes O(N^6) time as we also had 4 nested loops.

Optimized Solution

We can solve this faster by using the solution to the maximum subarray problem (kadane's algorithm).

The maximum subarray problem asks that given an integer array, find the contiguous subarray with the largest sum and return that largest sum.

You can use Kadane’s algorithm to solve the maximum subarray problem in O(n) time, and the solution looks like this.

For finding the largest submatrix, we’ll repeat the same brute force process on our rows.

We’ll try every combination of startRow and endRow using a nested for loop.

For the columns, we’ll collect the sums of each column for all the items between startRow and endRow into an array called colSums.

In the nested for loop with endRow, each iteration of the loop will expand our submatrix downward by a single row (endRow increases by 1).

Therefore, we can reuse the colSums array and just add in the items from endRow into their respective column sums.

Then, we’ll use our maxSubArray function to find the maximum contiguous subarray from colSums. That will be the maximum submatrix sum between startRow and endRow.

We’ll compare that to the maximum submatrix sum we’ve seen so far, and if it’s greater then we’ll make that the maximum submatrix sum.

The time complexity is O(N^3).