How Pinterest Load Tests their Database

Pinterest uses Apache Druid for Ad Reporting metrics. Here's how they load test Druid for the holidays.

September 08, 2022

Hey Everyone!

Today we’ll be talking about

How Pinterest Load Tests Their Database
- Pinterest uses Apache Druid as a database to store and serve their ad reporting metrics. Druid is an open source, column-oriented, distributed database that's written in Java.
- Architecture of Druid and the structure of its components and services.
- How Pinterest tests Druid's ability under increased queries per second, data ingestion and storage requirements.
Tech Snippets
- A Curated List of Competitive Programming Resources
- Master Technologies By Recreating Them from Scratch
- Breaking Down The PIP from Tech Career Growth
- Learnings from 5 years of Tech Startup Code Audits
- A Staff Engineer's Guide to Career Development
- Design Docs at Google

How Pinterest Load Tests Their Database

Pinterest is a social media service with over 430 million monthly active users. The company relies on advertising for revenue, where they show users promoted posts that are served based on a real time ad auction.

In order to store and serve all of their reporting metrics, Pinterest relies on Apache Druid - an open source, column-oriented, distributed data store that’s written in Java.

Druid is commonly used for OLAP (analytics workloads) and it’s designed to ingest massive amounts of event data and then provide low latency, analytics queries on the ingested data. Druid is also used at Netflix, Twitter, Walmart, Airbnb and many other companies.

To get an idea of how Druid works, you can split its architecture into three components: the query nodes, data nodes and deep storage.

Deep storage is where the company stores all their data permanently, like AWS S3 or HDFS.

Druid connects with the company’s deep storage and indexes the company’s data into Druid data nodes for fast analytical queries. New data can also be ingested through the data nodes and Druid will then write it to deep storage.

In order to make analytical queries performant, the data stored in Druid's data nodes is stored in a columnar format. You can read more about this here.

Clients can send their queries (in SQL or JSON) for Druid through the query nodes.

These components are broken down into different services that can be configured and scaled independently.

During the holiday months, Pinterest typically gets a big spike in traffic. In order to deal with this, teams at the company perform extensive load testing in the months prior.

Engineers on the Real-Time Analytics team at Pinterest wrote a great blog post on the process they go through for load testing Druid.

Here’s a summary

When load testing Druid, engineers are looking to verify several areas

Queries - The service should be able to handle the expected increase in queries per second and do so within the latency requirements specified in the service level agreement (SLA).
Ingestion - The real-time ingestion capabilities should be able to handle the increase in data. Druid should be able to take in all the data and write it to the data nodes and deep storage with low ingestion lag and a low number of failed writes.
Data Size - The storage system should have sufficient capacity to handle the increased data volume.

We’ll go through each of these and talk about how Pinterest tests them.

Setting up the Testing Environment

When load testing their Druid system, Pinterest can either do so with generated queries or with real production queries.

With generated queries, queries are created based on the current data set in Druid. This is fairly simple to run and does not require any preparation. However, it may not accurately show how the system will behave in production scenarios since the generated queries might not be representative of a real world workload (in terms of which data is accessed, query types, edge cases).

Another option is to capture real production queries and re-run these queries during testing. This is more involved as queries need to be captured and then updated for the changes in the dataset/timeframe. However, this is more reflective of what Druid will experience.

Pinterest moved ahead with using real production queries and implemented query capture using Druid’s logging feature that automatically logs any query that is being sent to a Druid broker host (you send your query to a Query Server which contains a broker host).

Engineers don’t conduct testing on the production environment, as that could adversely affect users. Instead, they create a test environment that’s as close to production as possible.

They replicate the Druid setup of brokers, coordinators, and more and also make sure to use the same host machine types, configurations, pool size, etc.

Druid relies on an external database for metadata storage (data on configuration, audit, usage information, etc.) and it supports Derby, MySQL and Postgres. Pinterest uses MySQL.

Therefore, they use a MySQL dump to create a copy of all the metadata stored in the production environment and add that to a MySQL instance in the test environment.

They spin up data nodes in the test environment that read from deep storage and index data from the past few weeks/months.

Flow chart showing the test environment set up with the same configuration and hosts as the Prod environment. Historical Nodes and Broker hosts are mirrored on the Test environment. The deep storage is shared between both environments.

The testing service loads historical production queries from the log files in the production environment and sends them to the brokers in the test environment for execution. They ramp up the queries per second to test what they expect for holiday traffic.

Evaluating Query Load

Pinterest runs the real production queries on the test environment and looks at several metrics like

99th percentile latency of the queries
CPU usage of the brokers
Peak queries per second

Using these tests, they can find and address bottlenecks in Druid around the query and data services and adjust how much compute is dedicated for these components.

Some changes can be done quickly while others can take hours. Increasing the number of machines in the query services can be done quickly, whereas increasing the number of data replicas takes time since data needs to be indexed and loaded from deep storage.

Handling Increase in Data Ingestion

Testing Data Ingestion is quite similar to testing queries per second. Pinterest sets up a test environment with the same capacity, configuration, etc. as the production environment.

The main difference is that the Real-Time Analytics team now needs some help from client teams who generate the ingested data to also send additional events that mimic production traffic.

When reviewing ingestion traffic, the Pinterest team looks at

Ingestion lag
Number of successful/rejected events
General system health

And more.

They also make sure to validate the ingested data and make sure it’s being written correctly.

Handling Increase in Data Volume

Evaluating if the system can handle the increase in data volume is the simplest and quickest check.

For this, they look at the Druid Web Console, where they can see all the data nodes and current capacity. They estimate the amount of additional data that will be stored over the holiday period and adjust for that.

Results

From the testing, Pinterest found that they were able to handle the additional traffic expected during the holiday period. They saw that the broker pool may need additional hosts if traffic meets a certain threshold, so they made a note that the pool size may need to be increased.

For more details, you can read the full blog post here.

How did you like this summary?

Your feedback really helps me improve curation for future emails.

It’s About Time! Use InfluxDB’s Time Series Database As A Service

InfluxDB is a developer favorite and the #1 time series database according to DB-Engines.

With serverless InfluxDB Cloud, you can build software on InfluxDB without provisioning infrastructure and managing clusters, and easily collect metrics with Telegraf or pre-configured templates.

Try it for free on AWS, Azure, or GCP!

sponsored

Tech Snippets

A curated list of Competitive Programming resources - Competitive Programming is a fantastic hobby to take up if you want to hone your data structures, algorithms & problem solving skills. This is a great github repo with all the resources you’ll need to get started with competitive programming like what books/courses to read, sites for practicing, algorithm implementations and more.

Master tech by recreating it from scratch - A great way to deeply understand a technology is to build your own toy version from scratch. This is a fantastic repo that collects blog posts where someone builds something and writes about the process. You can see the process of building a bittorrent client, virtual machine, a shell and more.

Breaking Down the PIP (Performance Improvement Plan) - A Performance Improvement Plan (PIP) is a formal document that tech companies use to tell employees that they're not meeting performance expectations. Rahul Pandey was a manager at Meta and he runs the Tech Career Growth community. He made an awesome video on what PIPs are, what they mean, and what you should do if you get PIP'd.

Learnings from 5 years of tech startup code audits - Ken Kantzer is a VP of Engineering and was previously the Head of Security at FiscalNote. He wrote a great blog post on some lessons he’s learned while conducting security audits for early stage startups. He’s found fuzz testing to be surprisingly effective and also saw how many startups weren’t properly authenticating incoming requests with webhooks.

A Staff Engineer’s Guide to Career Development - Andrew Hao is a Staff Software Engineer at Lyft. He wrote a fantastic blog post on how senior engineers can get promoted to the staff level and things you should be doing as a developer to advance your career. He talks about building a network/influence at your company, finding the right problems to tackle and how to lead.

Design Docs at Google - One key part of Google’s culture is the use of Design Docs for defining software design. The engineers in charge of a project/application will first write a design doc that outlines the context and scope of the new system, goals and non-goals, the architecture, API, data storage, and more. This is a great blog post that goes into detail on how design docs are written and what is covered inside one.

Interview Question

You are given an N x N matrix that contains positive and negative integers.

Write a function that finds the submatrix with the largest possible sum.

Previous Question

As a reminder, here’s our last question

Write a function to calculate pow(x, n), which is x raised to the power n.

Here’s a link to the question in LeetCode

Solution

This question requires (a bit) of math knowledge.

You should have a solid understanding on how exponents work.

To solve this, we use the property that

Using this property, we’ll we can reason that if n is even, then we calculate pow(x,n/2) and then return the square of that ( finding the square is a O(1) time operation).

If n is odd, then we can calculate pow(x, n - 1) and then return that times x. Remember that n - 1 will always be even if n is odd.

We keep repeating this until our base case, where n == 0.

This makes our time complexity logarithmic, since every time n is even, we’re dividing the number of operations by 2 and every time n is odd, we’re doing a constant time operation.

Modifying our function to work for negative exponents is trivial, we can just recursively call our function with the reciprocal of x.