☕ Distributed Databases Explained

June 21, 2021

Hey Everyone,

Hope you’re all having a fantastic day!

Today we’ll be talking about

The CAP Theorem
Database Replication
Database Sharding

Today’s interview question is on String Matching.

System Design Question

As a refresher, here’s the last question

What is Database Replication and Database Sharding?

Why is Replication necessary?

What are the tradeoffs between Asynchronous vs. Synchronous replication?

Why is Sharding necessary?

What are some ways of implementing Database Sharding?

Solution

Most applications today are considered data-driven, meaning application data is an integral part of the system. This makes databases a frequent bottleneck in System Design. If your database fails (or is just too slow), then your system will fail.

Database Replication and Sharding are methods used to help scale up the amount of reads/writes you can perform.

Database Replication

Replication means creating replicas of your database.

Rather than just having one main database that handles all the reads/writes, you might have one main database and several read replicas.

Whenever you need to write data, you write it to the main database and then to all the read replicas.

When you need to read data, you can read from any one of the read replicas.

This helps you scale an application where your users read more data than they write (this is typical for many applications - users will consume far more than they produce).

Replication improves scalability and it also improves redundancy. If your main database fails, one of the read replicas can takeover as the main database and there’s a much smaller chance of data loss.

With database replication, there are several ways of implementing it.

Asynchronous replication - When a user creates a post or sends a message on your application, you’ll first write the data to the main database. Then, you’ll let the user know that his data was successfully written. After, you asynchronously update the read replicas with the data that the user has written. This results in a faster user experience for writes.
Synchronous replication - When a user creates a post or sends a message on your application, you’ll first write the data to the main database. Then, you update the read replicas first and after that, you tell the user that his data has been written. This way, the writes are being done synchronously across all your database and replicas. This results in a better user experience for reads since your database system has higher consistency (all the replicas are up to date).

So far, we’ve only discussed having a single main database. However, as you scale your database up more and more, you’ll need to scale that up as well. Therefore, there are even more options for you to choose from.

Single-Leader Architecture - you only have one main database that interfaces with all your read replicas.
Multi-Leader Architecture - multiple servers work as main databases. They all receive writes and serve as a model for your replicas.
No-leader Architecture - Every server can receive writes and serve as a model for your replicas.

Database Sharding

One issue that comes up with Database replicas is every single database in your system will contain all the data. They’re all replicas of the main database(s)!

However, what happens when you have an immense amount of user data? It might not be technically possible to have all your user data in every single database.

This is where Database Sharding comes in.

Database sharding is where you break your data into smaller chunks called shards. Each database will contain a shard of your data rather than the whole pie.

There are multiple ways of implementing database sharding (several ways of splitting up your data).

Key Based Sharding

You take a column from your data and plug the values from that column into a hash function to determine which shard the data should go to. The mechanics of how this works is similar to a hash table.

You might take something like a customer ID, ZIP code, or something and hash that value. This column will be the shard key. It should be static (meaning that a row's shard key should never change).

Range Based Sharding

You select a column from your data that can be categorized in some way. Then you create ranges from that column and make each range a shard.

An example might be location (each country is a shard), date (each month is a shard), price (every $100 increment is a shard), etc.

Lookup Table Based Sharding

A third way of implementing sharding is by using a lookup table (or hash table).

Like we did in key-based sharding, you'll set one column of your data as the shard key. Then, you can randomly assign rows to different shards and remember which shard contains which row with a lookup table.

One thing to watch out for is unbalanced shards, where one shard is getting far more reads/writes than the other shards. An example might be if you’re using Range Based sharding and you create one shard for users in the United States and another for users in Belgium. If your app has more US-based users, then that shard will get a lot more read/writes than your Belgium shard! That will cause scalability issues.

The CAP Theorem

When you’re working with a distributed system, you now have the possibility of network failures.

Network outages are impossible to avoid with distributed databases and with that comes the CAP Theorem.

The CAP Theorem states that it is impossible for a distributed database to provide two out of the following three guarantees.

Consistency - every request sent to the database will always return the most recent data.
Availability - every request sent to the database will receive a non-error response. The data may be stale (not consistent), but it should never be an error or timeout.
Partition Tolerance - the distributed database should continue to work if there’s a network failure.

Your distributed database must be partition tolerant since network failures are impossible to avoid in a distributed system. So, you can either tradeoff consistency or availability in the event of a network failure.

Interview Question

You are given a list of strings called words and a string pattern.

Return a list of the strings inside words that match pattern.

A word matches the pattern if the letters in the word can be mapped one-to-one to characters in the pattern.

Example

Input: words = ["abc","deq","mee","aqq","dkd","ccc"], pattern = "abb"Output: ["mee","aqq"]

Explanation: “mee” matches “abb” since m can be mapped to a and e can be mapped to b. The same is true for “aqq”

Here’s the question in Leetcode.

We’ll send a detailed solution tomorrow, so make sure you move our emails to primary, so you don’t miss them!

Gmail users—move us to your primary inbox

On your phone? Hit the 3 dots at the top right corner, click "Move to" then "Primary"
On desktop? Back out of this email then drag and drop this email into the "Primary" tab near the top left of your screen

Apple mail users—tap on our email address at the top of this email (next to "From:" on mobile) and click “Add to VIPs”

Previous Tech Snippets

How to Review Code as a Junior Developer - a terrific blog post by Emma Catlin at Pinterest. Here’s a quick summary.

Junior Devs may lack confidence when asked to conduct code reviews. “Why would a senior engineer want to have a junior engineer review their code?”
However, there are several benefits to reviewing code as a junior dev
- Helps you learn the code base - reading well written code by senior engineers helps you quickly learn the code base. It also helps you know who to ask when you need help with a section of the codebase.
- Helps you build a feedback circle - when you give other engineers feedback on their code, it becomes easier for you to ask them for feedback.
- Being an owner - Reviewing code helps you take partial ownership over the team’s codebase. Many companies encourage an ownership mentality for developers.
How to develop an ability to code review?
- Ask Questions - If something’s not clear to you, then it probably isn’t clear to everyone. Ask questions on why a piece of logic was added in a specific file vs. further downstream, what a specific section of code does, or what a particular comment means.
- Calibrate feedback - Identify what your team members care about and calibrate feedback off that. Does your team have code guidelines that you can refer to?
- Emulate others - Identify someone on your team who reviews code well and watch what they do. Observe things that they look/ask for and also observe how they ask.

A browsable Petascale Reconstruction of the Human Cortex

A connectome is a map of all the neural connections in an organism’s brain. It’s useful for understanding the organization of neural interactions inside the brain.
Releasing a full mapping of all the neurons and synapses in a brain is incredibly complicated, and in January 2020, Google Research released a “hemibrain” connectome of a fruit fly - an online database with the structure and synaptic connectivity of roughly half the brain of a fruit fly.
The connectome for the fruit fly has completely transformed neuroscience, with Larry Abbott, a theoretical neuroscientist at Columbia, saying “the field can now be divided into two epochs: B.C. and A.C. — Before Connectome and After Connectome”.
- You can read more about the fruit fly connectome’s influence here.
Google Research is now releasing the H01 dataset, a 1.4 petabyte (a petabyte is 1024 terabytes) rendering of a small sample of human brain tissue.
- The sample covers one cubic millimeter of human brain tissue, and it includes tens of thousands of reconstructed neurons, millions of neuron fragments and 130 million annotated synapses.
The initial brain imaging generated 225 million individual 2D images. The Google AI team then computationally stitched and aligned that data to produce a single 3D volume.
- Google did this using a recurrent convolutional neural network. You can read more about how this is done here.
You can view the results of H01 (the imaging data and the 3D model) here.
The 3D visualization tool linked above was written with WebGL and is completely open source. You can view the source code here.
H01 is a petabyte-scale dataset, but is only one-millionth the volume of an entire human brain. THe next challenge is a synapse-level brain mapping for an entire mouse brain (500x bigger than H01) but serious technical challenges still remain.
- One challenge is data storage - a mouse brain could generate an exabyte of data so Google AI is working on image compression techniques for Connectomics with negligible loss of accuracy for the reconstruction.
- Another challenge is that the imaging process (collecting images of the slices of the mouse brain) is not perfect. There is image noise that has to be dealt with.
- Google AI solved the imaging noise by imaging the same piece of tissue in both a “fast” acquisition regime (leading to higher amounts of noise) and a “slow” acquisition regime (leading to low amounts of noise). Then, they trained a neural network infer the “slow” scans from the “fast” scans, and can now use that neural network as part of the connectomics process.

Previous Coding Problem Solution

As a refresher, here’s the last coding question

You are given a linked list with n nodes.

The nodes in the linked list have a next and prev pointer, but they also have a random pointer.

The random pointer points to a randomly selected node in the linked list (or it could point to null).

Construct and return a deep copy of the given linked list.

Here’s the question in LeetCode.

Solution

One way of solving this question would be with a hash table.

You create a hash table mapping the nodes from the old linked list to the nodes of the deep copy linked list. The keys in the hash table would be the nodes of the old linked list and the values would be the respective nodes in the deep copy.

You iterate through the given linked list and for each node in the linked list you

Check if the current node has a mapping in the deep copy. If not, then create a node in the deep copy that is a copy of the current node. Create the mapping in the hash table.
Check if the next node has a mapping in the deep copy. If not, then create a node in the deep copy that is a copy of the next node. Create the mapping in the hash table.
Check if the random node has a mapping in the deep copy. If not, then create a node in the deep copy that is a copy of the random node. And… create the mapping in the hash table.

After iterating through the entire linked list, you can check the hash table for the copy of the original head node and return that.

Here’s the Python 3 code…

The time and space complexity are both linear.

However, this isn’t the most optimal way of solving the question. We can actually do it without using any additional data structure (other than new deep copy linked list we’re constructing).

The way we do it is by augmenting the original linked list.

We’ll first iterate through the original linked list and create a new node in between the cur node and the cur.next node.

This new node has the same value as cur and is meant to be cur’s copy in the deep copy linked list. We’ll ignore the random pointer for now.

We go through the entire linked list and create the new nodes.

So, if we had a linked list that looked like

1 -> 2 -> 3 -> 4

It will now look like

1 -> 1 -> 2 -> 2 -> 3 -> 3 -> 4 -> 4

Now, we iterate through the augmented linked list and set the random pointers for the new nodes we created.

Since the deep copy node will be immediately after the original node, we know that the deep copy node for the node pointed at by our random pointer will also be immediately after that node.

After setting all the random pointers, we’ll have to iterate through the augmented linked list again.

Now, we’ll be breaking the linked list up into two linked lists. One linked list will be all the original linked list nodes and the other linked list will be the deep copy nodes.

Now, we can return the deep copy linked list.

Here’s the Python 3 code.

If you want to practice more questions like this with top engineers at Google, Facebook, etc. then check out Interviewing.io.

You can book realistic mock interviews with senior FAANG engineers who will give you detailed and actionable feedback on exactly what you need to work on.

You don’t pay anything until you’re hired.

Check them out here.