How Pinterest Stores and Transfers Hundreds of Terabytes of Data Daily
We'll talk about Change Data Capture and the internal system Pinterest built to handle CDC for all their databases. Plus, a practical guide on how you can improve your coding skills with cognitive psychology, a visual guide to memory allocation and more.
Hey Everyone!
Today we’ll be talking about
How Pinterest Stores and Transfers Hundreds of Terabytes of Data Daily
Introduction to Change Data Capture
The Design Goals for CDC at Pinterest
Architecture of Pinterest’s CDC System
Tech Snippets
The Programmer’s Brain
A Curated List of Cryptography Resources and Links
Memory Allocation Visualized
Getting Things Done in a Chaotic Environment
One of the hardest decisions you’ll have to make is around what technologies your team adopts. A wrong decision can be extremely costly and take years to reverse. On the other hand, not making a decision can be just as costly (lost revenue, poor developer productivity, etc.)
Product for Engineers wrote a fantastic blog post on their advice for choosing technologies to adopt.
Some of their tips include
Prioritize based on set Criteria - There will always be some shiny new toy that your team can adopt. Instead, prioritize based on problems your team is facing. This can be excessive costs, scaling challenges, or a customer need.
Mimic the Real World when Evaluating - The engineers who will be using the technology should have significant sway in the decision. They should be able to test the technology in production (safely) and build proof of concepts before deciding.
Ensure you consider technical AND business factors - You should talk to all stakeholders and clarify what the set of evaluation criteria are. Some potential criteria include performance, cost, reliability, support, flexibility and more.
Subscribe to Product for Engineers for the rest of their tips on picking technologies. It’s free!
sponsored
How Pinterest built a Change Data Capture System
Pinterest is a social media platform that lets users share images/links for things they’re interested in. You can use the site to get low-carb pasta recipes, find destinations for a wedding or get inspiration for a DIY project (and learn that DIY projects are way harder than they look).
Pinterest first launched in 2010 and they’ve scaled to hundreds of millions of users with billions of monthly visits to the platform.
As you might imagine, Pinterest handles a massive amount of data. Real-time data processing is crucial for delivering personalized recommendations, detecting fraudulent accounts, reporting results to advertisers and more.
Change Data Capture (CDC) is a critical tool that Pinterest uses to power their data infrastructure. The engineering team published a fantastic blog post talking about how they built a generic CDC system for all the online databases at Pinterest. This system handles millions of database queries per second and transfers hundreds of terabytes per day.
Introduction to Change Data Capture (CDC)
Change Data Capture is a set of software design patterns that lets you track database changes in real-time. It captures data modifications like inserts, updates and deletes as they occur. Typically, you’ll set up CDC on your OLTP database (Postgres, MySQL, MongoDB, etc.) and transfer the modifications over to your analytics platform, audit/compliance system, data warehouse, etc.
This ensures that all your systems are kept up-to-date with the most recent data (compared to running nightly batch jobs for syncing changes). It can also reduce the load on your OLTP database since you’re only transferring the changes instead of doing full data loads.
All the commonly used database systems offer a mechanism for tracking database changes.
Examples are
Postgres - provides a write-ahead log (WAL) where every change to the database is written to the WAL first. Plugins (like pgoutput) can decode the WAL entries into JSON for a CDC tool to consume.
MySQL - provides a binary log (binlog) that records all data modifications. CDC tools can tap into the binlog to capture changes in real-time.
MongoDB - uses an operations log (oplog) to store a rolling record of all write operations. CDC systems can tail the oplog to capture changes. MongoDB also provides change streams that you can subscribe to for real-time data changes (change streams are built on top of the oplog).
DynamoDB - offers DynamoDB streams for a time-ordered sequence of records that captures all the data modifications in your tables.
Microsoft SQL Server - provides a built-in feature called Change Data capture (CDC) that captures insert, update and delete activity and makes the details available in an easily consumed relational format.
Couchbase - uses the Database Change Protocol (DCP) to stream any mutations that happen within the database. Applications can connect to DCP and get a real-time feed of changes.
Cassandra - provides a feature called CDC on Apache Cassandra that lets you capture changes on a per-table basis and write them to the local filesystem.
Debezium is the most popular open source platform for CDC and it supports databases like Postgres, MySQL, MongoDB and more.
Architecture of Pinterest’s Generic CDC System
Before Pinterest’s Generic CDC System, individual teams at the company were building their own CDC solutions. This led to issues around wasted engineering-time, unclear ownership, reliability issues and more.
The Pinterest team decided to solve this by building a Generic CDC solution based on Debezium.
Their goals were
Distributed - Pinterest has many distributed databases with some having 10,000+ shards. The CDC system should connect to all these shards and transfer the changes.
Reliability - data should be transferred reliably with a guarantee of at least once processing.
Scalability - the CDC system should scale to hundreds of terabytes of throughput. Databases at Pinterest receive millions of queries per second.
Configurability - the CDC system should allow configurability around connectors, failure recovery, load balancing and more.
Here’s the architecture of the system they built
Here’s the components
Control Plane - manages the configuration and the coordination of the CDC system. It handles things like creating new connectors for new database shards, fixing failed connectors, updating configuration of existing connectors, etc. It runs on a single host and executes its logic on a scheduled basis (typically every minute).
Data Plane - the machines in the data plane are responsible for capturing changes from the Pinterest database shards and sending them to Apache Kafka. Each host runs Kafka Connect and runs multiple Debezium connectors (with each connector responsible for a single database shard).
Kafka - Kafka is used as the message broker to store and transport the change events. The actual CDC data is stored in preconfigured topics. Consumers can subscribe to these topics for updates.
Product for Engineers is a fantastic newsletter by PostHog that helps developers learn how to find product-market fit and build apps that users love.
A/B testing and experimentation are crucial for building a feature roadmap, improving conversion rates and accelerating growth. However, many engineers don’t understand the ins-and-outs of how to run these tests effectively (they just leave it to the data scientists).
This edition of Product for Engineers delves into A/B tests and discusses
The 5 traits of good A/B tests
How to think about statistical significance and p-values
Avoiding false positives
And more.
To hone your product skills and read more articles like this, check out Product for Engineers below.
sponsored