How Notion Sharded Their Postgres Database

October 19, 2022

Hey Everyone!

Today we’ll be talking about

How Notion Sharded their Postgres Database
- How the Notion team knew it was time to break up their Postgres database
- Deciding to go with application level sharding with workspace ID as the partition key
- How Notion migrated with double writes, backfill, verification and the switch over.
Tech Snippets
- Comparing the web development experience with 10 different programming languages
- How PNG works
- How Netflix tests for Performance Regressions

How Notion Sharded Their Database

Notion is a web/mobile app for creating your personal workspace. You can store notes, tasks, wikis, kanban boards and other things in a Notion workspace and you can easily share it with other users.

Previously, Notion stored all of their user's workspace data on a Postgres monolith, which worked extremely well as the company scaled (over four orders of magnitude of growth in data and IOPS).

However, by mid-2020, it became clear that product usage was surpassing the abilities of their monolith.

To address this, Notion sharded their postgres monolith into 480 logical shards evenly distributed across 32 physical databases.

Garrett Fidalgo is an Engineering Manager at Notion, and he wrote a great blog post on what made them shard their database, how they did it, and the final results.

Here’s a Summary

Deciding When to Shard

Although the Notion team knew that their Postgres monolith set up would eventually reach its limit, the engineers wanted to delay sharding for as long as possible.

Sharding adds an increased maintenance burden, constraints in the application code and much more architectural complexity. That engineering time/effort was better allocated towards product features and other more pressing concerns.

The inflection point arrived when the Postgres VACUUM process began to stall consistently. Although disk capacity could be increased, not vacuuming the database would eventually lead to a TXID wraparound failure, where Postgres runs out of Transaction IDs (the peak is around 4 billion). We'll explain this below.

This became an issue Notion had to address.

Brief Description of the VACUUM process

Postgres gives ACID guarantees for transactions, so one of the guarantees is Isolation (the “I” in ACID). This means that you can run multiple transactions concurrently, but the execution will occur as if those transactions were run sequentially. This makes it much easier to run (and reason about) concurrent transactions on Postgres, so you can handle a higher read/write load.

The method used to implement this is called Multiversion Concurrency Control (MVCC). One challenge that MVCC has to solve is how to handle transactions that delete/update a row without affecting other transactions that are reading that same row. Updating/deleting that data immediately could result in the other transaction reading inconsistent data.

MVCC solves this by not updating/deleting data immediately. Instead, it marks the old tuples with a deletion marker and stores a new version of the same row with the update.

Postgres (and other databases that use MVCC) provides a command called vacuum, which will analyze the stored tuple versions and delete the ones that are no longer needed so you can reclaim disk space.

If you don’t vacuum, you’ll be wasting a lot of disk space and eventually reach a transaction ID wrap around failure.

Sharding Scheme

Due to the VACUUM process stalling, it became a necessity for Notion to implement sharding. They started considering their options.

They could implement sharding at the application-level, within the application logic itself. Or, they could use a managed sharding solution like Citus, an open source extension for sharding Postgres.

The team wanted complete control over the distribution of their data and they didn't want to deal with the opaque clustering logic of a managed solution. Therefore, they decided to implement sharding in the application logic.

With that, the Notion team had to answer the following questions

How should the data be partitioned?
How many logical shards spread out across how many physical machines?

When you use the Notion app, you do so within a Workspace. You might have separate workspaces for your job, personal life, side-business, etc.

Every document in Notion belongs to a workspace and users will typically query data within a single workspace at a time.

Therefore, the Notion team decided to partition by workspace and use the workspace ID as the partition key. This would determine which logical shard the data would go to.

Users will typically work in a single workspace at a time, so this limits the number of cross-shard joins the system would have to do.

In order to determine the number of logical shards and physical machines, the Notion team looked at the hardware constraints they were dealing with. Notion uses AWS, so they looked at Disk I/O throughput for the various instances and the costs around that.

They decided to go with 480 logical shards that were evenly distributed across 32 physical databases.

A hand-drawn diagram depicts one cylindrical database, labeled "Monolith", containing colored tables block, space, and comment. Below the monolith, arrows point to three smaller databases, each containing three rows of block, space, and comment tables. The smaller databases are labeled to indicate that there are 32 physical databases in total, with each database containing 15 logical shards.

Migration

The migration process consisted of 4 steps

Double-writes - Incoming writes get applied to both the old monolith and the new sharded system.
Backfill - Once double-writing has begun, migrate the old data to the new database system
Verification - Ensure the data in the new system is correct
Switch-over - Switch over to the new system after ensuring everything is functioning correctly.

Double Writes

There are several ways to implement double writes

Write directly to both databases - For each incoming write, execute the write on both systems. The downside is that this will cause increased write latency and any issue with the write on either system will lead to inconsistencies between the two. The inconsistencies will lead to issues with future writes as the two systems have different data written.
Logical Replication - Use Postgres’ Logical Replication functionality to send data changes on one system to the other.
Audit Log and Catch-up Script - Create an audit log that keeps track of all the writes to one of the systems. A catch up script will iterate through the audit log and apply updates to the other system.

Notion went with the Audit log strategy.

Backfilling Old Data

Once incoming writes were propagating to both systems, Notion started the data backfill process where old data was copied over to the new system.

They provisioned a m5.24xlarge AWS instance to handle the replication, which took around 3 days to backfill.

Verifying Data Integrity

Now that data was backfilled and incoming writes were being propagated to both systems, Notion had to verify that the data integrity of the new system.

They did this with

Verification Script - A script verified a continuous range of data from randomly selected UUID values, comparing each record on the monolith to the corresponding shard.
Dark Reads - Read queries would execute on both the old and new databases and compare results and log any discrepancies. This increased API latency, but it gave the Notion team confidence that the switch over would be seamless.

After verification, the Notion team switched over to the new system.

For more details, you can read the full blog post here.

Tech Snippets

I built 10 web apps… with 10 different languages - Jeff Delaney provides a great overview of 10 different web frameworks from 10 different languages. He talks about Ruby on Rails, Django (Python), Laravel (PHP), NextJS (JavaScript), Spring (Java), Phoenix (Elixir), Rocket (Rust), Gin (Go), Vapor (Swift) and Ktor (Kotlin). His opinion was that Ruby on Rails gave the best development experience.

How PNG works - Reducible runs an awesome YouTube channel where he explains computer science topics in an interactive and intuitive way. He’s like the 3Blue1Brown for CS. This is an awesome video where he explains how the PNG file format works and the tradeoffs made in it’s design.

How Netflix tests for performance regressions - Before releasing to production, Netflix does significant performance testing for their app. They check responsiveness, latency, start up time, memory usage and more. This blog post provides a detailed look into how Netflix runs these tests and what factors they’re checking.

Get a look into what CTOs are reading - If you find Quastor useful, you should check out Pointer.io. It's a reading club for software developers that sends out super high quality engineering-related content. It's read by CTOs, engineering managers and senior engineers so you should definitely sign up if you're on that path (or if you want to go down that path in the future). It's completely free! (cross promo)

Interview Question

Given an integer array nums sorted in non-decreasing order, return an array of the squares of each number sorted in non-decreasing order.

Here's the question in LeetCode.

Previous Question

As a reminder, here’s our last question

Given the head of a linked list, rotate the list to the right by k places.

Here’s the question in LeetCode.

Solution

We can solve this question in three parts.

First, we count the number of nodes in the linked list and then set the next pointer at the tail of the linked list to the head. This makes our linked list a circular linked list.

Now, we want to break the circle in our linked list at our new tail. So, we use k and the length of our linked list to figure out the new tail.

Then, we go to that node where our new tail is and set the next pointer to None, breaking the circular aspect of our linked list.

Now, we can return the new head of the linked list.