☕ Database Replication and Sharding

A quick intro to Database Replication and Sharding! Plus an interview question on binary search trees.

Hi Everyone!

We’re back! Really sorry for missing the last two emails! I was having an issue with my email service provider, but looks like it’s resolved now!

Our previous solution is pretty long, so no industry news for today! Scroll down to see today’s interview problem!

Previous Solution

As a refresher, here’s the previous question

What is Database Replication and Database Sharding?

Why is Replication necessary?

Why is Sharding necessary?

Solution

Most applications today are considered data-driven, meaning application data is an integral part of the system. This makes databases a frequent bottleneck in System Design. If your database fails (or is just too slow), then your system will fail.

Database Replication and Sharding are methods used to help scale up the amount of reads/writes you can perform.

Replication means creating replicas of your database.

Rather than just having one main database that handles all the reads/writes, you might have one main database and several read replicas.

Whenever you need to write data, you write it to the main database and then to all the read replicas.

When you need to read data, you can read from any one of the read replicas.

This helps you scale an application where your users read more data than they write (this is typical for many applications - users will consume far more than they produce).

Replication improves scalability and it also improves redundancy. If your main database fails, one of the read replicas can takeover as the main database and there’s a much smaller chance of data loss.

With database replication, there are several ways of implementing it.

  • Asynchronous replication - When a user creates a post or sends a message on your application, you’ll first write the data to the main database. Then, you’ll let the user know that his data was successfully written. After, you asynchronously update the read replicas with the data that the user has written. This results in a faster user experience for writes.

  • Synchronous replication - When a user creates a post or sends a message on your application, you’ll first write the data to the main database. Then, you update the read replicas first and after that, you tell the user that his data has been written. This way, the writes are being done synchronously across all your database and replicas. This results in a better user experience for reads since your database system has higher consistency (all the replicas are up to date).

So far, we’ve only discussed having a single main database. However, as you scale your database up more and more, you’ll need to scale that up as well. Therefore, there are even more options for you to choose from.

  • Single-Leader Architecture - you only have one main database that interfaces with all your read replicas.

  • Multi-Leader Architecture - multiple servers work as main databases. They all receive writes and serve as a model for your replicas.

  • No-leader Architecture - Every server can receive writes and serve as a model for your replicas.

Database Sharding

One issue that comes up with Database replicas is every single database in your system will contain all the data. They’re all replicas of the main database(s)!

However, what happens when you have an immense amount of user data? It might not be technically possible to have all your user data in every single database.

This is where Database Sharding comes in.

Database sharding is where you break your data into smaller chunks called shards. Each database will contain a shard of your data rather than the whole pie.

There are multiple ways of implementing database sharding (several ways of splitting up your data).

Key Based Sharding

You take a column from your data and plug the values from that column into a hash function to determine which shard the data should go to. The mechanics of how this works is similar to a hash table.

You might take something like a customer ID, ZIP code, or something and hash that value. This column will be the shard key. It should be static (meaning that a row's shard key should never change).

Range Based Sharding

You select a column from your data that can be categorized in some way. Then you create ranges from that column and make each range a shard.

An example might be location (each country is a shard), date (each month is a shard), price (every $100 increment is a shard), etc.

Lookup Table Based Sharding

A third way of implementing sharding is by using a lookup table (or hash table).

Like we did in key-based sharding, you'll set one column of your data as the shard key. Then, you can randomly assign rows to different shards and remember which shard contains which row with a lookup table.

One thing to watch out for is unbalanced shards, where one shard is getting far more reads/writes than the other shards. An example might be if you’re using Range Based sharding and you create one shard for users in the United States and another for users in Belgium. If your app has more US-based users, then that shard will get a lot more read/writes than your Belgium shard! That will cause scalability issues.

Interview Question

Code an in-order traversal for a Binary Search Tree.

Can you write your function both recursively and iteratively?

We’ll send a detailed solution tomorrow, so make sure you move our emails to primary, so you don’t miss them!

Gmail users—move us to your primary inbox

  • On your phone? Hit the 3 dots at the top right corner, click "Move to" then "Primary"

  • On desktop? Back out of this email then drag and drop this email into the "Primary" tab near the top left of your screen

Apple mail users—tap on our email address at the top of this email (next to "From:" on mobile) and click “Add to VIPs”

Best,

Arpan