The Architecture of Facebook's Distributed Key-Value Store

Hey Everyone!

Today we’ll be talking about

  • The Architecture of Facebook's Distributed Key Value Store

    • Why Facebook built ZippyDB on top of RocksDB

    • The Data Model and Architecture of ZippyDB

    • ZippyDB's Consistency options with Eventual, Read-Your-Writes and Strong Consistency

  • Tech Snippets

    • Advice for Engineering Managers Who Want to Climb the Career Ladder

    • How IBM Lost the Cloud

    • B-Trees: More Than You Wanted to Know

    • How to Review Code as a Junior Developer


The Architecture of Facebook's distributed Key Value store

ZippyDB is a strongly consistent, distributed key-value store built at Facebook. It was first deployed in 2013 and it serves many use cases like storing product data, keeping track of events and storing metadata for a distributed file system.

Because it serves a variety of different use cases, ZippyDB offers users a lot of flexibility with the option to tune durability, consistency, availability and latency to fit the application’s needs.

Sarang Masti is a software engineer at Facebook and he wrote a great blog post about the design choices and trade-offs made in building the ZippyDB service.

Here’s a summary

Before ZippyDB, various teams at Facebook used RocksDB to manage their data. RocksDB is a fork of Google’s LevelDB with the goal of improving performance for server workloads.

However, the teams using RocksDB were each facing similar challenges around consistency, fault tolerance, failure recovery, replication, etc. They were building their own custom solutions, which meant an unnecessary duplication of effort.

ZippyDB was created to address the issues for all these teams. It provides a highly durable and consistent key-value data store with RocksDB as the underlying storage engine.

Data Model

ZippyDB supports a simple key-value data model with APIs to get, put and delete keys. It supports iterating over key prefixes and deleting a range of keys, similar to what you get with RocksDB.

They also have TTL (Time to live) support for ephemeral data where clients can specify the expiry time for a key-value pair.

Architecture

The basic unit of data management for ZippyDB is a shard, where each shard consists of multiple replicas that are spread across geographic regions for fault tolerance. The replication is done with either Paxos or async replication (depending on the configuration).

Within a shard, a subset of the replicas are configured to be part of the Paxos quorum group, where data is synchronously replicated between those nodes. The write involves persisting the data on a majority of the Paxos replicas log’s (so Paxos’ consensus algorithm will return the new write) and also writing the data to RocksDB on the primary. Once that’s done, the write gets confirmed to the client, providing highly durable writes.

The remaining replicas in the shard are configured as followers. These receive data through asynchronous replication. These replicas handle low-latency reads with the tradeoff being that they have worse consistency.

The quorum size vs. the number of follower replicas is configurable, and it lets a user strike their preferred balance between durability, write performance, read performance and consistency. We’ll talk about ZippyDB’s consistency in the next section.

The optimal assignment of shards to servers depends on the shard load, user constraints, etc. It’s handled by another Facebook service called ShardManager. ShardManager handles monitoring shards for load balancing, failure recovery, etc.

Each shard has a size of 50 - 100 gigabytes and is split into several thousand microshards which are then stored on different physical servers. This additional layer of abstraction allows ZippyDB to reshard the data without any changes for the client.

ZippyDB maps from microshards to shards with two types of mapping: Compact mapping and Akkio mapping.

Compact mapping is used when the assignment is fairly static and mapping is only changed when there is a need to split shards that have become too large or hot.

Akkio mapping is more involved and tries to optimize microshard placement to minimize latency. You can read about how Akkio mapping works here.

Consistency

ZippyDB provides configurable consistency and durability levels as options in the read/write APIs. This allows users to make durability, consistency and performance trade-offs dynamically on a per-request level.

By default, a write involves persisting the data on a majority of the Paxos replicas’ logs and also writing the data to RocksDB on the primary before confirming the write to the client. Persisting the write on a majority of the Paxos replicas means that the Paxos Quorum will return the new value.

However, some applications need lower latency writes so ZippyDB also supports a fast-acknowledge mode where writes are confirmed as soon as they are enqueued on the primary for replication. This means lower durability.

For reads, the three most popular consistency levels for ZippyDB are

  • Eventual

  • Read-your-writes

  • Strong

Eventual - This is a much stronger consistency level than what’s typically described as eventual consistency. ZippyDB ensures that reads that are served by follower replicas aren’t lagging behind the primary/quorum beyond a certain configurable threshold. Therefore, it’s similar to something like Bounded Staleness that you might see in Azure’s CosmosDB.

Read-Your-Writes - The client will always get a replica that is current enough to have any previous writes made by this client. In order to implement this, ZippyDB assigns a monotonically increasing sequence number to each write and it’ll return this number in response to a client’s write request. The client can use their latest sequence number when sending read requests to ensure that it gets a replica that’s up to date on all the client’s past writes.

Strong - The client will see the effects of the most recent writes. This is done by routing the read requests to the primary.

For more details on how ZippyDB implements transactions and conditional writes, you can read the full article here.

Tech Snippets

  • Advice for Engineering Managers who want to Climb the Ladder - Charity Majors was previously an Engineering Manager at Facebook and is now the CEO of HoneyComb (a VC-backed startup). She wrote a great post for EMs on how they can climb the career ladder and get to positions in senior leadership.

  • How IBM lost the Cloud - When you talk about the largest Cloud providers, the first company you think of probably isn't IBM. This is a great article that delves into why IBM was so behind on cloud computing despite being a tech behemoth. It's based on more than a dozen interviews with current and former IBM executives/engineers.

  • B-Trees: More Than I thought I'd Want to Know - This is a great blog post that delves into B-Trees and why they're useful. B-Trees are self-balancing tree data structures that are used extensively for lookups in databases and file systems.

  • How to Review Code as a Junior Developer - This is an awesome blog post on the Pinterest Engineering blog about how junior developers should do code reviews. Many junior devs may think their feedback (especially when given to developers who are more senior) is a waste of time, but that’s not the case! Several benefits from reviewing code as a junior dev are

    • You learn the code base more quickly

    • You build a feedback circle

    • You take more ownership over the codebase

Interview Question

You are given a math equation that consists of positive integers and +, -, * and / operators (addition, subtraction, multiplication and division).

Write a function that computes the result.

Be sure to follow order of operations (you will not be given parentheses).

Input - “2 * 3 + 5 / 6 * 3 + 15”

Output - 23.5

Previous Question

As a refresher, here’s the previous question

You are given a sorted array nums.

Write a function that removes any duplicates in nums in-place such that they appear at most twice.

You must do this using O(1) space, you cannot allocate extra space for another array.

Solution

We can solve this question using the Two Pointer Method.

We have a slow and fast pointer, where our slow pointer will always be less than the fast pointer.

Since we have to remove duplicates only if they appear more than twice, we start slow and fast at index 2.

We iterate through the sorted array nums and check if nums[slow - 2] == nums[fast].

If this is the case, then that means we have a group of 3 or more equal elements and we should rewrite the slow pointer with a future value (thus removing the element from nums). Therefore, we’ll keep the slow pointer at the current value and increment our fast pointer.

If it is not the case, then we can rewrite the value at the slow pointer with the value at the fast pointer. Then, we can increment both the slow and fast pointers.