How Reddit built a Metadata Store that Handles 100k Reads per Second

We'll talk about the design of Reddit's Metadata Store and the tech behind it. Plus, how Python's Asyncio works, how GitLab automates engineering management and more.

Hey Everyone!

Today we’ll be talking about

  • How Reddit built a Metadata store that handles 100k reads per second

    • High level goals of Reddit’s metadata store

    • Picking Sharded Postgres vs. Cassandra

    • Data Migration Process

    • Scaling with sharding and denormalization

  • Tech Snippets

    • How Python Asyncio Works

    • Writing a File System from Scratch in Rust

    • How GitLab automates engineering management

A small improvement to the recommendation algorithm at Meta or YouTube results in billions of dollars of additional revenue. It’s one of the highest leveraged areas of work for engineers. 

Even more exciting, the field of recommendation systems is rapidly changing with the incorporation of LLMs into retrieval and ranking, emphasis on deep learning with two-tower neural network architectures and widespread introduction of multi-modal data to enhance relevance. 

At the Index Conference next week, there will be a session dedicated to discussing the newest innovations in recommendation systems with speakers:

  • Vishal Kathuria, former Technical Director in AI Research at Meta

  • Jaya Kawale, Vice President of Engineering and ML at Tubi

  • Shu Zhang, Director of Engineering at Pinterest

  • Yexi Jiang, Principal Engineer at Roblox

  • Nikhil Garg, CEO Fennel and Ex-Head of Platform and Infrastructure at Quora

It’s completely free to attend and will take place on May 16th.

You can join the conference virtually via Zoom or in-person at the Computer History Museum in Mountain View, Ca.

sponsored

How Reddit built a Metadata store that Handles 100k Reads per Second

Over the past few years, Reddit has seen their user-base double in size. They went from 430 million monthly active users in 2020 to 850 million in 2023.

The good news with all this growth is that they could finally IPO and let employees cash in on their stock options. The bad news is that the engineering team had to deal with a bunch of headaches.

One issue that Reddit faced was with their media metadata store. Reddit is built on AWS and GCP, so they store any media uploaded to the site (images, videos, gifs, etc.) on AWS S3.

Every piece of media uploaded also comes with metadata. Each media file will have metadata like video thumbnails, playback URLs, S3 file locations, image resolution, etc.

Previously, Reddit’s media metadata was distributed across different storage systems. To make this easier to manage, the engineering team wanted to create a unified system for managing all this data.

The high-level design goals were

  • Single System - they needed a single system that could store all of Reddit’s media metadata. Reddit’s growing quickly, so this system needs to be highly scalable. At the current rate of growth, Reddit expects the size of their media metadata to be 50 terabytes by 2030.

  • Read Heavy Workload - This data store will have a very read-heavy workload. It needs to handle over 100k reads per second with latency less than 50 ms.

  • Support Writes - The data store should also support data creation/updates. However these requests have significantly lower traffic than reads and Reddit can tolerate higher latency for this.

Reddit wrote a fantastic article delving into their process for creating a unified media metadata store.

We’ll be summarizing the article and adding some extra context.

We’ll cover a ton of concepts on Postgres, Cassandra and database scaling strategies in this article.

If you’d like Spaced Repetition Flashcards (Anki) on all the concepts discussed in Quastor, check out Quastor Pro.

When you join, you’ll also get an up-to-date PDF with our past articles.

Picking the Database

To build this media metadata store, Reddit considered two choices: Sharded Postgres or Cassandra.

Postgres

Postgres is one of the most popular relational databases in the world and is consistently voted most loved database in Stack Overflow’s developer surveys.

Some of the pros for Postgres are

  • Battle Tested - Tens of thousands of companies use Postgres and there’s been countless tests on benchmarks, scalability and more. Postgres is used (or has been used) at companies like Uber, Skype, Spotify, etc.

    With this, there’s a massive wealth of knowledge around potential issues, common bugs, pitfalls, etc. on forums like Stack Overflow, Slack/IRC, mailing threads and more.

  • Open Source & Community - Postgres has been open source since 1995 with a liberal license that’s similar to the BSD and MIT license. There’s a vibrant community of developers who help teach the database and provide support for people with issues. Postgres also has outstanding documentation.

  • Extensibility & Interoperability - One of the initial design goals of Postgres was extensibility. Over it’s 30 year history, there’s been a countless number of extensions that have been developed to make Postgres more powerful. We’ll talk about a couple Postgres extensions that Reddit uses for sharding.

We did a detailed tech dive on Postgres that you can check out here.

Cassandra

Cassandra is a NoSQL, distributed database created at Facebook in 2007. The initial project was heavily inspired by Google Bigtable and also took many ideas from Amazon’s Dynamo.

Here’s some characteristics of Cassandra:

  • Large Scale - Cassandra is completely distributed and can scale to a massive size. It has out of the box support for things like distributing/replicating data in different locations.


    Additionally, Cassandra is also designed with a decentralized architecture to minimize any central points of failure.

  • Highly Tunable - Cassandra is highly customizable so you can configure it based on your exact workload. You can change how the communications between nodes happens (gossip protocol), how data is read from disk (LSM Tree), the consensus between nodes for writes (consistency level) and much more.

  • Wide Column - Cassandra uses a wide column storage model, which allows for flexible storage. Data is organized into column families, where each can have multiple rows with varying numbers of columns. You can read/write large amounts of data quickly and also add new columns without having to do a schema migration.

We did a much more detailed dive on Cassandra that you can read here.

Picking Postgres

After evaluating both choices extensively, Reddit decided to go with Postgres.

The reasons why included:

  • Challenges with Managing Cassandra - They found some challenges with managing Cassandra. Ad-hoc queries for debugging and visibility were far more difficult compared to Postgres.

  • Data Denormalization Issues with Cassandra - In Cassandra, data is typically denormalized and stored in a way to optimize specific queries (based on your application). However, this can lead to challenges when creating new queries that your data hasn’t been specifically modeled for.

Reddit uses AWS, so they went with AWS Aurora Postgres. For more on AWS RDS, you can check out a detailed tech dive we did here.

Data Migration

Migrating to Postgres was a big challenge for the team. They had to transfer terabytes of data from the different systems to Postgres while ensuring that the legacy systems could continue serving over 100k reads per second.

Here’s the steps they went through for the migration

  1. Dual Writes - Any new media metadata would be written to both the old systems and to Postgres.

  2. Backfill Data - Data from the older systems would be backfilled into Postgres

  3. Dual Reads - After Postgres has enough data, enable dual reads so that read requests are served by both Postgres and the old system

  4. Monitoring and Ramp Up - Compare the results from the dual reads and fix any data gaps. Slowly ramp up traffic to Postgres until they could fully cutover.

Scaling Strategies

With that strategy, Reddit was able to successfully migrate over to Postgres.

Currently, they’re seeing peak loads of ~100k reads per second. At that load, the latency numbers they’re seeing with Postgres are

  • 2.6 ms P50 - 50% of requests have a latency lower than 2.6 milliseconds

  • 4.7 ms P90 - 90% of requests have a latency lower than 4.7 milliseconds

  • 17 ms P99 - 99% of requests have a latency lower than 17 milliseconds

They’re able to achieve this without needing a read-through cache.

We’ll talk about some of the strategies they’re using to scale.

Table Partitioning

At the current pace of media content creation, Reddit expects their media metadata to be roughly 50 terabytes. This means they need to implement sharding and partition their tables across multiple Postgres instances.

Reddit shards their tables based on post_id where they use range-based partitioning. All posts with a post_id in a certain range will be put on the same database shard.

post_id increases monotonically, so this means that their table will be partitioned by time periods.

Many of their read requests involve batch queries on multiple IDs from the same time period, so this design helps minimize cross-shard joins.

Reddit uses the pg_partman Postgres extension to manage the table partitioning.

Denormalization

Another way Reddit minimizes joins is by using denormalization.

They took all the metadata fields required for displaying an image post and put them together into a single JSONB field. Instead of fetching different fields and combining them, they can just fetch that single JSONB field.

This made it much more efficient to fetch all the data needed to render a post.

All the metadata needed to render an image post

It also simplified the querying logic, especially across different media types. Instead of worrying about exactly which data fields you needed, you just fetch the single JSONB value.

Index is a conference for backend engineers who want to learn about building search, analytics, and AI applications at scale.

Some of the talks this year will cover:

  • How Meta Built FAISS (an extremely popular vector search library) by Matthijs Douze, Research Scientist at Meta AI Research and co-creator of FAISS

  • How DoorDash’s Shopping Recommendation System Works by Sudeep Das, Head of Machine Learning and AI at DoorDash

  • The Tech Behind the Online Data Systems Netflix Uses to Serve the Homepage by Shriya Arora, Engineering Manager at Netflix

  • The Architecture of the Uber Eats Recommendation System by Bo Ling, a Staff Software ML Engineer at Uber

You can join the conference virtually through Zoom or you can attend in-person at the Computer History Museum in Mountain View, Ca.

It’ll be a fantastic learning experience if you’re a backend engineer and also a great networking opportunity

It’s completely free to join!

sponsored

Tech Snippets