How Distributed Databases work

January 14, 2022

Hey Everyone,

Today we’ll be talking about

How Distributed Databases handle Replication - Today’s email will be a brief summary of Chapter 5 from Designing Data Intensive Applications by Martin Kleppmann
- Why replicate your data across multiple nodes - reducing latency, increasing availability, increasing read throughput
- 3 popular strategies for writing changes to your replicas - Single leader replication, Multi-leader replication, Leaderless replication
- Common problems that arise from Replication Lag and how to solve them - Read Your Own Writes, Monotonic Reads, Consistent Prefix Reads

Plus, a couple awesome tech snippets on
- An amazing (free) textbook on Machine Learning interviews - Contains hundreds of fully solved job interview questions from a wide range of key topics in ML
- Anatomy of Programming Languages - An open source textbook that teaches the theory of programming languages with a practical focus
- Your API is Bad - A great article that goes through the best practices you should employ when building an API.

Plus, we have a solution to our last interview question and a new question from Google!

Questions? Please contact me at [email protected].

Quastor Daily is a free Software Engineering newsletter sends out Technical Deep Dives, summaries of Engineering Blog Posts, and FAANG Interview questions (with detailed solutions).

How Distributed Databases Work - Replication

Designing Data Intensive Applications (DDIA) is a must-read if you’re interested in backend engineering. The data engineering world is full of buzzwords and hype, but Martin Kleppman does an amazing job of breaking down all the core technologies.

Here’s a summary of Chapter 5 from DDIA on Replication

Replication is where you keep a copy of your data on multiple different machines. These machines are connected via a network, so they’re all accessible to your backend server(s).

Instead of having a single machine serve as your database, you are now using a distributed database consisting of multiple machines.

There are several reasons why you’d want to replicate your data across several computers

Reduce latency - Users in India can send requests to the nodes located in Delhi while users in America can send requests to the nodes located in New York.
Increase Availability - If one of the nodes goes down for some reason you’ll have another node that can take over and respond to requests for data.
Increase Read Throughput - Several nodes have the ability to respond to read queries instead of just having 1 machine doing all the work for read requests. Many workloads are read-scaling (consist of mostly reads and a small percentage of writes) so increasing read throughput is extremely helpful.

The difficult part about replication lies in handling changes to replicated data.

When you get a write request that modifies your database, how do you make sure that all the replicas reflect this write request?

How do you stop replicas that haven’t updated from responding with stale data to read requests?

There are 3 popular strategies for writing changes to all your replicas

Single Leader Replication - One replica node is designated as the leader. The other nodes are followers. Write requests go to the leader node, who will then propagate the changes to the followers. This is the replication strategy used by many databases like PostgreSQL, MongoDB, MySQL, and more.
Multi Leader Replication - This is similar to Single Leader, but now multiple nodes can act as the leader and process write requests.Multi-Leader Replication is usually implemented with external tools, such as Tungstein Replicator for MySQL, BDR for PostgreSQL and GoldenGate for Oracle.
Leaderless Replication - All replica nodes can accept write requests from clients, so there is no leader node. Riak and Cassandra are examples of databases that use leaderless replication strategies. Amazon used leaderless replication for their in-house Dynamo system, so Riak and Cassandra are also known as Dynamo-style.

Note - Amazon’s Dynamo system is different from Amazon’s DynamoDB. DynamoDB is based on many principles of Dynamo but has a different implementation. DynamoDB uses single-leader replication.

Almost all distributed databases use one of these three approaches and they all have their pros and cons.

However, Single Leader Replication is the most popular replication strategy for distributed databases. Therefore, we’ll dive further into single leader. If you’re interested in learning more about multi-leader and leaderless strategies, check out the book.

Single Leader Replication

Single Leader Replication works as follows

One of the replicas is designed as the leader. Write requests from clients will be sent to the leader, who will write the new data to it’s local storage.
The other replicas are known as followers. Whenever the leader writes new data to it’s local storage, it also sends the data changes to all of the followers.
Each follower takes the data change log from the leader and updates its local copy of the database by applying all the new writes.
When a client wants to read from the database, the read requests can be queried to any of the nodes in the database - leader or follower.

Writes to the database can be asynchronous, synchronous, and semi-synchronous.

For an asynchronous write, the leader will get the client’s write request and update it’s own local storage. Then, it will respond saying that the write was successful. After it responds, the leader will send a message to all the follower nodes with the data change from the client’s write request.

With a synchronous write, the leader will first make sure every follower node has written the data change to their local database. Once the leader node has received confirmation from all the followers, it will respond with a message that the write was successful.

For a semi-synchronous write, the leader will wait for write confirmation from a specific number of follower nodes (this parameter can be configured) until it responds with a message that the write was successful.

In practice, synchronous writes are rarely used. With a synchronous write strategy, write requests will take an extremely long time (since you have to wait for every follower to respond) and will frequently fail (any time one or more follower nodes are not responsive).

Therefore, engineers typically use a semi-synchronous strategy or an asynchronous strategy.

The tradeoff between semi-synchronous and asynchronous write strategies comes down to how fast you want your write requests processed (asynchronous writes are faster) and how durable you want your write requests to be (asynchronous write strategies have a greater chance of losing write data if the leader node crashes before sending write changes to the followers).

Two issues that come up frequently with Single Leader replication are

Handing Node Outages
Replication Lag

Handling Node Outages

Node outages are inevitable, especially if you’re using a large distributed database with many follower nodes.

There are two types of node outages: follower outages and leader outages.

Follower Failure: Catch-up recovery

If a follower node fails, then it can recover quite easily. Followers keep a log of all the data changes received from the leader in local, nonvolatile storage. Therefore, the follower knows the last transaction it processed.

The follower will query the leader for all the changes that have happened since that last transaction, and then update its local state to match the present.

Leader Failure: Failover

Handling a failure of the leader is trickier. One of the follower nodes needs to be promoted to the new leader and clients have to be reconfigured to send their writes to the new leader. The other followers also have to start consuming data changes from the new leader.

This process is called failover.

Failover processes have tons of things that can go wrong

If asynchronous replication is used, then the new leader may not have received all the writes from the old leader before it failed. This means weaker durability guarantees.
When the original leader comes back online, they may be misconfigured to think they are still the leader. This is a common failure and is often called split brain.
Load issues can arise since your database can’t accept new writes while the failover process is happening. If the leader node fails often, then this can clog up the database.

Replication Lag

When you’re using a single-leader replication strategy with semi-synchronous or asynchronous writes, you’ll frequently run into consistency issues where a client will read stale data from a follower node that hasn’t been fully updated.

This inconsistency is a temporary state, and if you wait for a bit then all the followers will eventually catch up. Therefore, this effect is known as eventual consistency.

However, eventual consistency is a vague term, and doesn’t specify how long the replication lag is. It could be a few seconds or even a few minutes.

Therefore, even with “eventual consistency”, the replication lag can be a big problem for your users.

In order to ease these issues, there are several approaches that you can use to reduce some of the common issues that users face.

We’ll go through some of these approaches and the issues that they solve.

Read Your Own Writes

Let’s say you’re building a twitter clone. The user can post a tweet from their computer, which will send a write request to your distributed database.

This write request is asynchronously replicated, so the leader will respond that the write was successful after changing it’s local state. Then, it will send the change to all the follower nodes.

If the user refreshes their page right after tweeting and tries to reload, the new read request for the user’s previous tweets might go to a follower node that hasn’t been informed of the new tweet.

Therefore, the user’s twitter profile will not show his new tweet after he refreshes.

Obviously, this can be super frustrating to users.

Read Your Own Writes consistency is a solution that guarantees that if the user reloads the page, they will always see any updates they submitted themselves. It makes no promises about other users.

Read Your Own Writes consistency can be implemented in many ways. One possible way is to track the last time a user submitted an update. If they submitted an update within the past minute, then their read requests should be handled by the leader node.

Monotonic Reads

Going back to the twitter clone example, asynchronous replication will result in some follower nodes lagging other follower nodes in terms of updates.

Therefore, a user might visit the website and get tweets from a follower node that is up to date. After, he might reload his page and then get tweets from a follower node that is lagging behind.

This will result in his twitter feed “moving back in time” as the data he’s getting is stale. This obviously means a bad user experience.

Monotonic reads is a guarantee that this kind of anomaly does not happen.

One way of achieving this guarantee is by making sure that each user always reads from the same follower node (different users can read from different replicas).

The replica can be chosen based on a hash of the user ID, rather than randomly.

Consistent Prefix Reads

Let’s say you have user A and user B on your twitter clone app. User A tweets out a picture of his dog. User B reply tweets to that picture with a compliment for the dog.

There is a causal dependency between the two tweets where user B’s reply tweet doesn’t make any sense if you can’t see user A’s tweet.

User C follows both user A and user B. If user A’s tweet goes through more replication lag than user B’s tweet, then user C might see user B’s reply tweet without getting user A’s tweet.

Constant Prefix Reads is a guarantee that solves this anomaly. This guarantee says that if a sequence of writes happens in a certain order, then anyone reading those writes will see them appear in the same order.

This can be solved if the database always applies the writes in the same order, but there are complications that arise when your database is sharded. Checkout DDIA for more details on this.

Quastor Daily is a free Software Engineering newsletter sends out Technical Deep Dives, summaries of Engineering Blog Posts, and FAANG Interview questions (with detailed solutions).

Tech Snippets

Machine Learning Interview Questions and Solutions - Amir Ivry is a PhD candidate who’s currently doing AI Research at Microsoft. Shlomo is a Quantum Computing Scientist who was previously Chief Data Scientist at multiple startups.They wrote an awesome book on Machine Learning Interviews that goes through hundreds of fully solved problems on all areas of AI.The book is openly published on arXiv.
Anatomy of Programming Languages - William Cook is an Associate Professor of Computer Science at UT Austin.He published an awesome textbook on the anatomy of programming languages that covers the material for his UT Austin course. The book uses Haskell but requires a basic understanding of the language.
Your API is Bad - Paddy Carver is an engineer at LaunchDarkly. Previously, he was a senior software engineer at HashiCorp.He wrote an awesome article on the best practices you should be following if you want to build an API that’s easy to use.

How Notion sharded their Postgres Database

Notion is an app that is meant to serve as a personal (or corporate) workspace.

You can store notes, tasks, wikis, kanban boards and other things in a Notion workspace and you can easily share it with other users.

If you’ve been a Notion user for a while, you probably noticed that the app got extremely slow in late 2019 and 2020.

Last year, Notion sharded their Postgres monolith into a fleet of horizontally scalable databases. The resulting performance boost was pretty big.

Amazing how much faster @NotionHQ feels lately. The "show don't tell" is strong with this one.
— Guillermo Rauch (@rauchg)
6:36 PM • Apr 27, 2021

Sharding a database means partitioning your data across multiple database instances.

This allows you to run your database on multiple computers and scale horizontally instead of vertically.

When to Shard?

Sharding your database prematurely can be a big mistake. It can result in an increased maintenance burden, new constraints in application code and little to no performance improvement (so a waste of engineering time).

However, Notion was growing extremely quickly, so they knew they’d have to implement sharding at some point.

The breaking point came when the Postgres VACUUM process began to stall consistently.

The VACUUM process clears storage occupied by dead tuples in your database.

When you update data in Postgres, the existing data is not modified. Instead, a new (updated) version of that data is added to the database.

This is because it’s not safe to directly modify existing data, as other transactions could be reading it.

This is called Multiversion Concurrency Control (MVCC).

At a later point, you can run the VACUUM process to delete the old, outdated data and reclaim disk space.

If you don’t regularly vacuum your database (or have Postgres run autovacuum, where it does this for you), you’ll eventually reach a transaction ID wraparound failure.

So, you must vacuum your database or it will eventually fail.

Having the VACUUM process consistently stall is not an issue that can be ignored.

Application-Level vs. Managed

Sharding can be divided into two approaches

Application-Level Sharding - You implement the data partitioning scheme in your application code. You might direct all American users to one database and all Asian users to another database.
Third-Party Sharding - You rely on a third party to handle the sharding for you. An example is Citus, an open source extension for Postgres.

Notion decided to go with Application-Level sharding.

They didn’t want to go with a third party solution because they felt it’s sharding logic would be opaque and hard to debug.

Shard Key

In order to shard a database, you have to pick a shard key. This determines how your data will be split up amongst the shards.

You want to pick a shard key that will equally distribute loads amongst all the shards.

If one shard is getting a lot more reads/writes than the others, that can make scaling very difficult.

Notion decided to partition their database by workspace. Workspaces are the folders that contain all the pages, tasks, notes, etc.

So, if you’re a student using Notion, you might have separate Workspaces for all your classes.

Each workspace is assigned a UUID upon creation, so that UUID space is partitioned into uniform buckets.

Each bucket goes to a different shard.

How many Shards?

Notion ended up going with 460 logical shards distributed across 32 physical databases (with 15 logical shards per database).

This allows them to handle their existing data and scale for the next two years (based off their projected growth).

Database Migration

After establishing how the sharded database works, you still have to migrate from the old database to the new distributed database.

This is the 4 step process Notion used for the migration.

Double-write: Incoming writes are applied to both the old and new databases.
Backfill: Migrate the old data to the new database.
Verification: Ensure the integrity of data in the new database.
Switch-over: Actually switch to the new database. This can be done incrementally, e.g. double-reads, then migrate all reads.

Read the full details in the blog post!

Quastor Daily is a free Software Engineering newsletter sends out Technical Deep Dives, summaries of Engineering Blog Posts, and FAANG Interview questions (with detailed solutions).

Interview Question

Given an integer array nums and an integer k, return the number of good subarrays of nums.

A good array is an array where the number of different integers in that array is exactly k.

For example, [1,2,3,1,2] has 3 different integers: 1, 2, and 3.

A subarray is a contiguous part of an array.

Here’s the question in LeetCode

Previous Solution

As a reminder, here’s our last question

Given an integer, convert it to a Roman Numeral.

Example

Input: 3

Output: “III”

Input: 4

Output: “IV”

Input: 1994

Output: “MCMXCIV”

Here’s the question in LeetCode

Solution

This question can actually be solved by a pretty elegant solution.

We first put all the possible combinations of roman numeral symbols and their integer values in an ordered dictionary (dictionaries are ordered in Python 3.6+ by default).

We put them in order from largest to smallest.

Then, we iterate through our dictionary and for each integer value, we divide the number by that integer value.

That tells us how many of those roman numerals we will need.

For example, 4000 will be represented as MMMM.

4000 / 1000 = 4, which means 4 Ms.

Then, we can add that number of roman numeral symbols to our character array (that represents the final roman numeral).

After, we can set our number to the remainder that’s left.

Then, we continue on to the next roman numeral, integer value pair in our dictionary.

Here’s the Python 3 code

Quastor Daily is a free Software Engineering newsletter sends out Technical Deep Dives, summaries of Engineering Blog Posts, and FAANG Interview questions (with detailed solutions).