How Reddit built a Metadata Store that Handles 100k Reads per Second
We'll talk about the design of Reddit's Metadata Store and the tech behind it. Plus, how Python's Asyncio works, how GitLab automates engineering management and more.
Hey Everyone!
Today we’ll be talking about
How Reddit built a Metadata store that handles 100k reads per second
High level goals of Reddit’s metadata store
Picking Sharded Postgres vs. Cassandra
Data Migration Process
Scaling with sharding and denormalization
Tech Snippets
How Python Asyncio Works
Writing a File System from Scratch in Rust
How GitLab automates engineering management
How Reddit built a Metadata store that Handles 100k Reads per Second
Over the past few years, Reddit has seen their user-base double in size. They went from 430 million monthly active users in 2020 to 850 million in 2023.
The good news with all this growth is that they could finally IPO and let employees cash in on their stock options. The bad news is that the engineering team had to deal with a bunch of headaches.
One issue that Reddit faced was with their media metadata store. Reddit is built on AWS and GCP, so they store any media uploaded to the site (images, videos, gifs, etc.) on AWS S3.
Every piece of media uploaded also comes with metadata. Each media file will have metadata like video thumbnails, playback URLs, S3 file locations, image resolution, etc.
Previously, Reddit’s media metadata was distributed across different storage systems. To make this easier to manage, the engineering team wanted to create a unified system for managing all this data.
The high-level design goals were
Single System - they needed a single system that could store all of Reddit’s media metadata. Reddit’s growing quickly, so this system needs to be highly scalable. At the current rate of growth, Reddit expects the size of their media metadata to be 50 terabytes by 2030.
Read Heavy Workload - This data store will have a very read-heavy workload. It needs to handle over 100k reads per second with latency less than 50 ms.
Support Writes - The data store should also support data creation/updates. However these requests have significantly lower traffic than reads and Reddit can tolerate higher latency for this.
Reddit wrote a fantastic article delving into their process for creating a unified media metadata store.
We’ll be summarizing the article and adding some extra context.
We’ll cover a ton of concepts on Postgres, Cassandra and database scaling strategies in this article.
If you’d like Spaced Repetition Flashcards (Anki) on all the concepts discussed in Quastor, check out Quastor Pro.
When you join, you’ll also get an up-to-date PDF with our past articles.
Picking the Database
To build this media metadata store, Reddit considered two choices: Sharded Postgres or Cassandra.
Postgres
Postgres is one of the most popular relational databases in the world and is consistently voted most loved database in Stack Overflow’s developer surveys.
Some of the pros for Postgres are
Battle Tested - Tens of thousands of companies use Postgres and there’s been countless tests on benchmarks, scalability and more. Postgres is used (or has been used) at companies like Uber, Skype, Spotify, etc.
With this, there’s a massive wealth of knowledge around potential issues, common bugs, pitfalls, etc. on forums like Stack Overflow, Slack/IRC, mailing threads and more.Open Source & Community - Postgres has been open source since 1995 with a liberal license that’s similar to the BSD and MIT license. There’s a vibrant community of developers who help teach the database and provide support for people with issues. Postgres also has outstanding documentation.
Extensibility & Interoperability - One of the initial design goals of Postgres was extensibility. Over it’s 30 year history, there’s been a countless number of extensions that have been developed to make Postgres more powerful. We’ll talk about a couple Postgres extensions that Reddit uses for sharding.
We did a detailed tech dive on Postgres that you can check out here.
Cassandra
Cassandra is a NoSQL, distributed database created at Facebook in 2007. The initial project was heavily inspired by Google Bigtable and also took many ideas from Amazon’s Dynamo.
Here’s some characteristics of Cassandra:
Large Scale - Cassandra is completely distributed and can scale to a massive size. It has out of the box support for things like distributing/replicating data in different locations.
Additionally, Cassandra is also designed with a decentralized architecture to minimize any central points of failure.Highly Tunable - Cassandra is highly customizable so you can configure it based on your exact workload. You can change how the communications between nodes happens (gossip protocol), how data is read from disk (LSM Tree), the consensus between nodes for writes (consistency level) and much more.
Wide Column - Cassandra uses a wide column storage model, which allows for flexible storage. Data is organized into column families, where each can have multiple rows with varying numbers of columns. You can read/write large amounts of data quickly and also add new columns without having to do a schema migration.
We did a much more detailed dive on Cassandra that you can read here.
Picking Postgres
After evaluating both choices extensively, Reddit decided to go with Postgres.
The reasons why included:
Challenges with Managing Cassandra - They found some challenges with managing Cassandra. Ad-hoc queries for debugging and visibility were far more difficult compared to Postgres.
Data Denormalization Issues with Cassandra - In Cassandra, data is typically denormalized and stored in a way to optimize specific queries (based on your application). However, this can lead to challenges when creating new queries that your data hasn’t been specifically modeled for.
Reddit uses AWS, so they went with AWS Aurora Postgres. For more on AWS RDS, you can check out a detailed tech dive we did here.
Data Migration
Migrating to Postgres was a big challenge for the team. They had to transfer terabytes of data from the different systems to Postgres while ensuring that the legacy systems could continue serving over 100k reads per second.
Here’s the steps they went through for the migration
Dual Writes - Any new media metadata would be written to both the old systems and to Postgres.
Backfill Data - Data from the older systems would be backfilled into Postgres
Dual Reads - After Postgres has enough data, enable dual reads so that read requests are served by both Postgres and the old system
Monitoring and Ramp Up - Compare the results from the dual reads and fix any data gaps. Slowly ramp up traffic to Postgres until they could fully cutover.
Scaling Strategies
With that strategy, Reddit was able to successfully migrate over to Postgres.
Currently, they’re seeing peak loads of ~100k reads per second. At that load, the latency numbers they’re seeing with Postgres are
2.6 ms P50 - 50% of requests have a latency lower than 2.6 milliseconds
4.7 ms P90 - 90% of requests have a latency lower than 4.7 milliseconds
17 ms P99 - 99% of requests have a latency lower than 17 milliseconds
They’re able to achieve this without needing a read-through cache.
We’ll talk about some of the strategies they’re using to scale.
Table Partitioning
At the current pace of media content creation, Reddit expects their media metadata to be roughly 50 terabytes. This means they need to implement sharding and partition their tables across multiple Postgres instances.
Reddit shards their tables based on post_id
where they use range-based partitioning. All posts with a post_id
in a certain range will be put on the same database shard.
post_id
increases monotonically, so this means that their table will be partitioned by time periods.
Many of their read requests involve batch queries on multiple IDs from the same time period, so this design helps minimize cross-shard joins.
Reddit uses the pg_partman Postgres extension to manage the table partitioning.
Denormalization
Another way Reddit minimizes joins is by using denormalization.
They took all the metadata fields required for displaying an image post and put them together into a single JSONB field. Instead of fetching different fields and combining them, they can just fetch that single JSONB field.
This made it much more efficient to fetch all the data needed to render a post.
All the metadata needed to render an image post
It also simplified the querying logic, especially across different media types. Instead of worrying about exactly which data fields you needed, you just fetch the single JSONB value.