How Reddit built a Metadata Store that Handles 100k Reads per Second

We'll talk about the design of Reddit's Metadata Store and the tech behind it. Plus, how Python's Asyncio works, how GitLab automates engineering management and more.

Arpan KG
May 07, 2024

Hey Everyone!

Today we’ll be talking about

How Reddit built a Metadata store that handles 100k reads per second
- High level goals of Reddit’s metadata store
- Picking Sharded Postgres vs. Cassandra
- Data Migration Process
- Scaling with sharding and denormalization
Tech Snippets
- How Python Asyncio Works
- Writing a File System from Scratch in Rust
- How GitLab automates engineering management

How Reddit built a Metadata store that Handles 100k Reads per Second

Over the past few years, Reddit has seen their user-base double in size. They went from 430 million monthly active users in 2020 to 850 million in 2023.

The good news with all this growth is that they could finally IPO and let employees cash in on their stock options. The bad news is that the engineering team had to deal with a bunch of headaches.

One issue that Reddit faced was with their media metadata store. Reddit is built on AWS and GCP, so they store any media uploaded to the site (images, videos, gifs, etc.) on AWS S3.

Every piece of media uploaded also comes with metadata. Each media file will have metadata like video thumbnails, playback URLs, S3 file locations, image resolution, etc.

Previously, Reddit’s media metadata was distributed across different storage systems. To make this easier to manage, the engineering team wanted to create a unified system for managing all this data.

The high-level design goals were

Single System - they needed a single system that could store all of Reddit’s media metadata. Reddit’s growing quickly, so this system needs to be highly scalable. At the current rate of growth, Reddit expects the size of their media metadata to be 50 terabytes by 2030.
Read Heavy Workload - This data store will have a very read-heavy workload. It needs to handle over 100k reads per second with latency less than 50 ms.
Support Writes - The data store should also support data creation/updates. However these requests have significantly lower traffic than reads and Reddit can tolerate higher latency for this.

Reddit wrote a fantastic article delving into their process for creating a unified media metadata store.

We’ll be summarizing the article and adding some extra context.

We’ll cover a ton of concepts on Postgres, Cassandra and database scaling strategies in this article.

If you’d like Spaced Repetition Flashcards (Anki) on all the concepts discussed in Quastor, check out Quastor Pro.

When you join, you’ll also get an up-to-date PDF with our past articles.

Picking the Database

To build this media metadata store, Reddit considered two choices: Sharded Postgres or Cassandra.

Postgres

Postgres is one of the most popular relational databases in the world and is consistently voted most loved database in Stack Overflow’s developer surveys.

Some of the pros for Postgres are

Battle Tested - Tens of thousands of companies use Postgres and there’s been countless tests on benchmarks, scalability and more. Postgres is used (or has been used) at companies like Uber, Skype, Spotify, etc.

With this, there’s a massive wealth of knowledge around potential issues, common bugs, pitfalls, etc. on forums like Stack Overflow, Slack/IRC, mailing threads and more.
Open Source & Community - Postgres has been open source since 1995 with a liberal license that’s similar to the BSD and MIT license. There’s a vibrant community of developers who help teach the database and provide support for people with issues. Postgres also has outstanding documentation.
Extensibility & Interoperability - One of the initial design goals of Postgres was extensibility. Over it’s 30 year history, there’s been a countless number of extensions that have been developed to make Postgres more powerful. We’ll talk about a couple Postgres extensions that Reddit uses for sharding.

We did a detailed tech dive on Postgres that you can check out here.

Cassandra

Cassandra is a NoSQL, distributed database created at Facebook in 2007. The initial project was heavily inspired by Google Bigtable and also took many ideas from Amazon’s Dynamo.

Here’s some characteristics of Cassandra:

Large Scale - Cassandra is completely distributed and can scale to a massive size. It has out of the box support for things like distributing/replicating data in different locations.

Additionally, Cassandra is also designed with a decentralized architecture to minimize any central points of failure.
Highly Tunable - Cassandra is highly customizable so you can configure it based on your exact workload. You can change how the communications between nodes happens (gossip protocol), how data is read from disk (LSM Tree), the consensus between nodes for writes (consistency level) and much more.
Wide Column - Cassandra uses a wide column storage model, which allows for flexible storage. Data is organized into column families, where each can have multiple rows with varying numbers of columns. You can read/write large amounts of data quickly and also add new columns without having to do a schema migration.

We did a much more detailed dive on Cassandra that you can read here.

Picking Postgres

After evaluating both choices extensively, Reddit decided to go with Postgres.

The reasons why included:

Challenges with Managing Cassandra - They found some challenges with managing Cassandra. Ad-hoc queries for debugging and visibility were far more difficult compared to Postgres.
Data Denormalization Issues with Cassandra - In Cassandra, data is typically denormalized and stored in a way to optimize specific queries (based on your application). However, this can lead to challenges when creating new queries that your data hasn’t been specifically modeled for.

Reddit uses AWS, so they went with AWS Aurora Postgres. For more on AWS RDS, you can check out a detailed tech dive we did here.

Data Migration

Migrating to Postgres was a big challenge for the team. They had to transfer terabytes of data from the different systems to Postgres while ensuring that the legacy systems could continue serving over 100k reads per second.

Here’s the steps they went through for the migration

Dual Writes - Any new media metadata would be written to both the old systems and to Postgres.
Backfill Data - Data from the older systems would be backfilled into Postgres
Dual Reads - After Postgres has enough data, enable dual reads so that read requests are served by both Postgres and the old system
Monitoring and Ramp Up - Compare the results from the dual reads and fix any data gaps. Slowly ramp up traffic to Postgres until they could fully cutover.

Scaling Strategies

With that strategy, Reddit was able to successfully migrate over to Postgres.

Currently, they’re seeing peak loads of ~100k reads per second. At that load, the latency numbers they’re seeing with Postgres are

2.6 ms P50 - 50% of requests have a latency lower than 2.6 milliseconds
4.7 ms P90 - 90% of requests have a latency lower than 4.7 milliseconds
17 ms P99 - 99% of requests have a latency lower than 17 milliseconds

They’re able to achieve this without needing a read-through cache.

We’ll talk about some of the strategies they’re using to scale.

Table Partitioning

At the current pace of media content creation, Reddit expects their media metadata to be roughly 50 terabytes. This means they need to implement sharding and partition their tables across multiple Postgres instances.

Reddit shards their tables based on post_id where they use range-based partitioning. All posts with a post_id in a certain range will be put on the same database shard.

post_id increases monotonically, so this means that their table will be partitioned by time periods.

Many of their read requests involve batch queries on multiple IDs from the same time period, so this design helps minimize cross-shard joins.

Reddit uses the pg_partman Postgres extension to manage the table partitioning.

Denormalization

Another way Reddit minimizes joins is by using denormalization.

They took all the metadata fields required for displaying an image post and put them together into a single JSONB field. Instead of fetching different fields and combining them, they can just fetch that single JSONB field.

This made it much more efficient to fetch all the data needed to render a post.

All the metadata needed to render an image post

It also simplified the querying logic, especially across different media types. Instead of worrying about exactly which data fields you needed, you just fetch the single JSONB value.

Tech Snippets

Writing a File System from Scratch in Rust

This is a really interesting blog post that walks through the process of building your own file system.

This post explains how a file system structures data on disks with things like superblocks, bitmaps, inodes, data blocks and more.

It delves into the author’s implementation of a file system called GotenksFS. Key parts covered include initializing the file system image, mounting the file system, on-disk data structures and more.

blog.carlosgaldino.com/writing-a-file-system-from-scratch-in-rust.html

How Python Asyncio Works

This is an excellent deep dive into Python’s asyncio library. It explains how it works under the hood by recreating a simplified version from scratch.

The post goes through creating a basic event loop and how to build a sleep generator that pauses a task’s execution for a certain duration.

jacobpadilla.com/articles/recreating-asyncio

How GitLab automates engineering management

Engineering managers at GitLab use a ton of automation scripts to help them manage projects and keep track of issues.

This is a post by GitLab Engineering with some examples of automation scripts they use to help keep projects on track.

about.gitlab.com/blog/2021/11/16/engineering-managers-automate-their-jobs