Partitioning Relational Databases at GitHub

October 25, 2021

Hey Everyone,

Today we’ll be talking about

Tools GitHub built to make (vertically) partitioning their relational database easier
- GitHub’s tech stack and why they wanted to vertically partition
- How GitHub enforces schema domains with SQL linters
- How GitHub uses Vitess for vertical sharding
Plus, a couple awesome tech snippets on
- Rewriting the sync engine at Dropbox - The sync engine is responsible for keeping local files synced with Dropbox’s cloud. Dropbox Engineering wrote a great article detailing the process of rewriting it.
- Modern JavaScript: Everything you missed over the last 10 years - A great cheat sheet that lists all the features that have been added to JavaScript over the past decade.
- A gentle intro to Assembly with Rust - An awesome blog post that looks at assembly code generated by some simple Rust functions and breaks it down.

We have a solution to our last coding interview question on linked lists, plus a new question from Amazon.

Quastor Daily is a free Software Engineering newsletter sends out FAANG Interview questions (with detailed solutions), Technical Deep Dives and summaries of Engineering Blog Posts.

Partitioning GitHub’s Relational Database

In our last summary (included below), we talked about GitHub’s transition from a monolith to microservices.

This summary is on GitHub’s blog post on how they built tooling to make partitioning their database easier.

Until recently, GitHub was built around one MySQL database cluster that housed a large portion of the data used for repositories, issues, pull requests, user profiles, etc.

This MySQL cluster was called mysql1.

This created challenges around scaling and issues around reliability since all of GitHub’s core features would stop working if the database cluster went down.

In 2019, GitHub set up a plan to improve their ability to partition their relational databases. The goal was to create better tooling that improved GitHub’s ability to partition relational databases.

The tooling GitHub built made partitioning much easier by allowing for

Virtual Partitions - Engineers can separate database tables virtually in the application layer so that the physical separation process is easier.
Move data without downtime - After the database is virtually partitioned, Engineers can easily move partitioned database tables to a different database cluster without downtime.

Since implementing these changes, GitHub has seen a significant decrease in load on the main database cluster.

In 2019, mysql1 answered 950,000 queries per second on average, 900,000 queries per second on replicas, and 50,000 queries per second on the primary.

In 2021, the same database tables were partitioned over several database clusters. The average load on each host halved despite the total queries per second increasing to 1,200,000 queries per second.

Here’s a rundown on the technical changes GitHub made to implement Virtual Partitions and moving data without downtime.

Virtual Partitions

Before database tables are physically partitioned, they need to be virtually partitioned in the application layer. You can’t have SQL queries that span partitioned (or soon to be partitioned) database tables.

Schema Domains

In order to implement Virtual Partitioning, GitHub first created schema domains, where a schema domain describes a tightly coupled set of database tables that are frequently used together in queries.

An example of a schema domain is the gists schema domain, which consists of the tables gists, gist_comments, and starred_gists. These tables would remain together after a partition.

GitHub stored a list of all the schema domains in a YAML configuration file. Here’s an example of a YAML file with the gists, repositories and users schema domains and their respective tables.

Now, GitHub needed to enforce these schema domains. They want to make sure that application code doesn’t have SQL queries or transactions that span schema domains.

They enforce this with SQL Linters.

SQL Linters

GitHub has two SQL linters (a Query linter and a Transaction linter) that enforce virtual boundaries between the schema domains.

They identify any violating queries and transactions that span schema domains and throw an exception with a helpful message for the developer.

Transactions aren’t allowed to span multiple schema domains because after the partition, those transactions will no longer be able to guarantee consistency.

MySQL transactions guarantee consistency across tables within a database, but partitioned schema domains will be in different database clusters.

Moving Data without Downtime

Now that GitHub has virtually isolated schema domains, they can physically move their schema domains to separate database clusters.

In order to do this on the fly, GitHub uses Vitess. Vitess is an open source database clustering system for MySQL that was originally developed at YouTube.

Vitess was serving all YouTube database traffic from 2011 to 2019, so it’s battle-tested.

GitHub uses Vitess’ vertical sharding feature to move sets of tables together in production without downtime.

To do that, GitHub uses Vitess’ VTGate proxies as the endpoint for applications to connect to instead of direct connections to MySQL.

Vitess handles the rest.

For more details, you can read the full blog post here.

Quastor Daily is a free Software Engineering newsletter sends out FAANG Interview questions (with detailed solutions), Technical Deep Dives and summaries of Engineering Blog Posts.

Tech Snippets

Rewriting the heart of our sync engine - One of the most important pieces of Dropbox’s tech stack is their sync engine.When you download the Dropbox application, you can create a folder that will be synced with a folder in Dropbox’s cloud. The sync engine is responsible for reading local file changes, making those changes on the server and also reading server changes (if you make changes via Dropbox’s web app) and writing those changes locally.Dropbox recently rewrote the sync engine and this is a great blog post that delves into why and the challenges involved in the rewrite.
Modern Javascript: Everything you missed over the last 10 years - This is a great cheat sheet that lists all the features that have been added to JavaScript over the past decade.
A Gentle Intro to Assembly with Rust - This is a great blog post that looks at the assembly code generated by some simple Rust functions and breaks them down. You can get an idea of what kind of optimizations the compiler is doing and how things work under the hood.

Interview Question

You are given an array of integers sorted in non-decreasing order.

Write a function that removes any duplicates in the array so that each element appears only once.

Your function must use constant space, so your space complexity must be O(1).

The function should return k, where k is the number of unique items in the array. The first k items in the array should be the unique integers in sorted order from least to greatest.

The elements in the array after the first k elements can be whatever you choose.

Your function should run in O(n) or better.

Previous Solution

As a reminder, here’s our last question

Given a linked list, swap every two adjacent nodes and return its head.

You must solve the problem without modifying the values in the list’s nodes. Only the nodes themselves can be changed.

Here’s the question in LeetCode

Solution

We can solve this question quite elegantly by using recursion.

Our base case will be if either head or head.next is None.

If so, then we can just return head.

Otherwise, we’ll recursively call our swapPairs function on head.next.next, so that it swaps the nodes for the rest of our linked list (the nodes after the first 2 nodes).

Now, we have to swap the first 2 nodes, head and head.next. We do this using a temporary variable and then we return the new head (which was previously head.next).

Here’s the Python 3 code. What’s the time and space complexity of our solution? Reply back with your answer and we’ll tell you if you’re right/wrong.