How Quora Shards their MySQL databases

Arpan KG
November 17, 2022

Hey Everyone!

Today we’ll be talking about

How Quora shards their MySQL databases
- Scaling issues Quora had with MySQL
- Splitting up their database by tables
- Splitting up tables into shards
Tech Snippets
- The Evolution of Scalable CSS
- Software Licenses Explained In Plain English
- What Every Programmer Should Know About SEO
- The Distributed Computing Manifesto at Amazon
- Ask HN: Books/Courses That Finally Made Topics Click

Plus, we have a solution to our last coding interview question on string matching.

Sharding Databases at Quora

Quora is a social platform where users can post and answer questions on anything. The website receives more than 600 million visits per month.

Quora relies on MySQL to store critical data like questions, answers, upvotes, comments, etc. The size of the data is on the order of tens of terabytes (without counting replicas) and the database gets hundreds of thousands of queries per second.

Vamsi Ponnekanti is a software engineer at Quora, and he wrote a great blog post about why Quora decided to shard their MySQL database.

MySQL at Quora

Over the years, Quora’s MySQL usage has grown in the number of tables, size of each table, read queries per second, write queries per second, etc.

In order to handle the increase in read QPS (queries per second), Quora implemented caching using Memcache and Redis.

However, the growth of write QPS and growth of the size of the data made it necessary to shard their MySQL database.

At first, Quora engineers split the database up by tables and moved tables to different machines in their database cluster.

Afterwards, individual tables grew too large and they had to split up each logical table into multiple physical tables and put the physical tables on different machines.

We’ll talk about how they implemented both strategies.

Splitting by Table

As the read/write query load grew, engineers had to scale the database horizontally (add more machines).

They did this by splitting up the database tables into different partitions. If a certain table was getting very large or had lots of traffic, they create a new partition for that table. Each partition consists of a master node and replica nodes.

The mapping from a partition to the list of tables in that partition is stored in ZooKeeper.

The process for creating a new partition is

Use mysqldump (a tool to generate a backup of a MySQL database) to dump the table in a single transaction along with the current binary log position (the binary log or binlog is a set of log files that contains all the data modifications made to the database)
Restore the dump on the new partition
Replay binary logs from the position noted to the present. This will transfer over any writes that happened after the initial dump during the restore process (step 2).
When the replay is almost caught up, the database will cutover to the new partition and direct queries to it. Also, the location of the table will be set to the new partition in ZooKeeper.

A pro of this approach is that it’s very easy to undo if anything goes wrong. Engineers can just switch the table location in ZooKeeper back to the original partition.

Some shortcomings of this approach are

Replication lag - For large tables, there can be some lag where the replica nodes aren’t fully updated.
No joins - If two tables need to be joined then they need to live in the same partition. Therefore, joins were strongly discouraged in the Quora codebase so that engineers could have more freedom in choosing which tables to move to a new partition.

Splitting Individual Tables

Splitting large/high-traffic tables onto new partitions worked well, but there were still issues around tables that became very large (even if they were on their own partition).

Schema changes became very difficult with large tables as they needed a huge amount of space and took several hours (they would also have to frequently be aborted due to load spikes).

There were unknown risks involved as few companies have individual tables as large as what Quora was operating with.

MySQL would sometimes choose the wrong index when reading or writing. Choosing the wrong index on a 1 terabyte table is much more expensive than choosing the wrong index on a 100 gigabyte table.

Therefore, engineers at Quora looked into sharding strategies, where large tables could be split up into smaller tables and then put on new partitions.

Key Decisions around Sharding

When implementing sharding, engineers at Quora had to make quite a few decisions. We’ll go through a couple of the interesting ones here. Read the full article for more.

Build vs. Buy

Quora decided to build an in-house solution rather than use a third-party MySQL sharding solution (Vitess for example).

They only had to shard 10 tables, so they felt implementing their own solution would be faster than having to develop expertise in the third party solution.

Also, they could reuse a lot of their infrastructure from splitting by table.

Range-based sharding vs. Hash-based sharding

There are different partitioning criteria you can use for splitting up the rows in your database table.

You can do range-based sharding, where you split up the table rows based on whether the partition key is in a certain range. For example, if your partition key is a 5 digit zip code, then all the rows with a partition key between 7000 and 79999 can go into one shard and so on.

You can also do hash-based sharding, where you apply a hash function to an attribute of the row. Then, you use the hash function’s output to determine which shard the row goes to.

Quora makes frequent use of range queries so they decided to use range-based sharding. Hash-based sharding performs poorly for range queries.

How Quora Shards Tables

So, when Quora has a table that is extremely large, they’ll split it up into smaller tables and create new partitions that hold each of the smaller tables.

Here are the steps they follow for doing this

Data copy phase - Read from the original table and copy to all the shards. Quora engineers set up N threads for the N shards and each thread copies data to one shard. Also, they take note of the current binary log position.
Binary log replay phase - Once the initial data copy is done, they replay the binary log from the position noted in step 1. This copies over all the writes that happened during the data copy phase that were missed.
Dark read testing phase - They send shadow read traffic to the sharded table in order to compare the results with the original table.
Dark write testing phase - They start doing dark writes on the sharded table for testing. Database writes will go to both the unsharded table and the sharded table and engineers will compare.

If Quora engineers are satisfied with the results from the dark traffic testing, they’ll restart the process from step 1 with a fresh copy of the data. They do this because the data may have diverged between the sharded and unsharded tables during the dark write testing.

They will repeat all the steps from the process until step 3, the dark read testing phase. They’ll do a short dark read testing as a sanity check.

Then, they’ll proceed to the cutover phase where they update ZooKeeper to indicate that the sharded table is the source of truth. The sharded table will now serve read/write traffic.

However, Quora engineers will still propagate all changes back to the original, unsharded table. This is done just in case they need to switch back to the old table.

For more details, you can read the full article here.

How did you like this summary?

Your feedback really helps me improve curation for future emails.

Tech Snippets

The Evolution of Scalable CSS - How do you manage and organize your CSS when working on a large project? This is a great blog post that delves into why CSS is hard to manage at scale and some of the tools/techniques that have popped up to help solve this. Areas discussed include CSS in JS, Inline Styles, TailwindCSS, Sass and more.

The Distributed Computing Manifesto - Werner Vogels is the CTO of Amazon and he has a great blog called All Things Distributed. He just published the Distributed Computing Manifest, a document from the early days of Amazon (late 90s) that sketches out how Amazon will shift their backend from a two tier architecture (a monolith and a bunch of databases) to a service-based architecture with three tiers: a presentation, business logic and data layers.

Ask HN: Do You Recall Any Book Or Course That Finally Made A Topic Click? - An interesting discussion on Hacker News about great videos/courses/books on various topics in programming (and other topics). Some recommendations include Crafting Interpreters by Robert Nystrom, Peter Norvig's Design of Computer Programs, The Story of Asynchronous JavaScript and more.

What Every Programmer Should Know About SEO - SEO can often seem like a black box, but there are certain best practices that you should definitely follow if you want to rank high on Google. This is a great blog post that gives an intro to the basics. You should make sure your site is crawlable, use the right keywords in the right places, avoid duplicate content, use smart meta descriptions and more.

Software Licenses In Plain English - This is an awesome site that lets you enter in a software license (like GPL-3 or MIT License) and it'll tell you clearly what that license lets you do with the code.

Interview Question

Write a function that sorts the elements in a stack so that the smallest elements are on top.

You can use an additional temporary stack, but you cannot copy the elements into any other data structure.

The stack supports the following operations

push
pop
peek
isEmpty

Previous Question

As a reminder, here’s our last question

Implement a BST Iterator class that represents an iterator over the in-order traversal of a Binary Search Tree.

Implement the following methods

BSTIterator - constructor. The root of the BST will be passed in as a parameter. The pointer should be initialized to a non-existent number small than any number in the BST.
hasNext - Returns true if there exists a number in the traversal to the right of the pointer, otherwise returns false.
next - Moves the pointer to the right, then returns the number at the pointer.

Solution

We can solve this question by doing an iterative in-order traversal (NeetCode is a great YouTube channel for coding interviews).

We’ll use an array to keep track of our stack and use the variable cur to keep track of our current node.

Constructor

When we instantiate our class, we’ll first create cur and point that to None.

Then, we’ll create an empty array that represents our stack.

The smallest object in a BST is the left-most object in the tree.

Therefore, we’ll keep moving down to the left child until we reach None. While we do this, we add each parent node to our stack.

We do this part in _getLeftMost.

When the user calls the next method, we’ll do some basic case analysis.

if cur is not pointing to anything and our stack has a length of greater than 1 -Then, we can pop off an item from our stack and return it. We also set cur to that item.
if cur is not None and has a right child -Then we’ll run our in-order traversal on the tree under the right child.We first find the smallest node in that tree by moving down the left children and add each parent node to our stack. We do this with our _getLeftMost function.We return the smallest node.
if cur is not None and we don’t have a right child -Then we will first check if our stack is empty.If the stack is not empty, then we can pop a node off the stack and return that.Otherwise, that means we’ve traversed through the entire tree and we’ll just return None.

hasNext

For hasNext, we can just check if the length of our stack is greater than 0 and if the node at cur has a right child.

If either of those are true, then we can return true.

Otherwise, we return False.

Here’s the Python 3 code.