How LinkedIn scaled their Distributed Key Value Store

Hey Everyone!

Today we'll be talking about

  • How LinkedIn scaled their Distributed Key-Value Store

    • The architecture of Venice, LinkedIn's distributed key-value store for derived data

    • Using RocksDB as the storage engine and Fast-Avro for serialization

    • Adding the ability to run computations on Venice servers to reduce network load

  • A Dive into Chaos Engineering (Quastor Pro)

    • The Motivation, Philosophy and Principles behind Chaos Engineering

    • How to Implement Chaos Engineering in your System with AWS FIS, Azure Chaos Studio or open source tools

    • How Facebook, LinkedIn, Audible, Twitch and Target run Chaos Experiments

    • Note - This section is part of Quastor Pro. Here we delve into the system design concepts that are mentioned in past Quastor summaries. You can upgrade to Quastor Pro here. I'd highly recommend using your job's Education / Learning & Development budget. If you don't have one, just shoot me a quick reply and I'll be happy to give you a significant discount.

  • Tech Snippets

    • Comparing Lossless Image Formats

    • Building a Web Assembly Powered Serverless Platform

    • Programming Languages Endorsed for Server-side Use at Meta

    • Network Troubleshooting Explained

If you find Quastor useful, you should check out Pointer.io

It's a reading club for software developers that sends out super high quality engineering-related content.

It's read by CTOs, engineering managers, and senior engineers so you should definitely sign up if you're on that path (or if you want to go down that path in the future).

It's completely free!

sponsored (but I'm a big fan of Pointer)

How LinkedIn scaled their Distributed Key Value Store

LinkedIn is a career-related social media platform with over 830 million users from across 200 countries. The company is well known for creating many popular open source tools and projects like Kafka, Pinot (column-oriented distributed data store) , Databus (for change data capture) and more.

Venice is an open source distributed key-value store that was developed at LinkedIn in late 2015. It’s designed to serve read-heavy workloads and has been optimized for serving derived data (data that is created from running computations on other data). Because Venice is serving derived data, all writes are asynchronous from offline sources and the database doesn’t support strongly consistent online writes.

Venice is used heavily for recommendation systems at LinkedIn, where features like People You May Know (where LinkedIn looks through your social graph to find your former colleagues so you can connect with them) are powered by this distributed database.

Venice works very efficiently for single-get and small batch-get requests (where you get the values for a few number of keys) but in late 2018 LinkedIn started ramping up a more challenging use case with large batch-get requests.

These large batch-gets were requesting values for hundreds (or thousands) of keys per request and this resulted in a large fanout where Venice had to touch many partitions in the distributed data store. It would also return a much larger response payload.

For the People You May Know feature, the application will send nearly 10,000+ queries per second to Venice and each request will contain 5,000+ keys and result in a ~5 mb response per request.

Venice’s latency SLA is ~100 milliseconds at p99 (99% of requests should be completed in less than 100 milliseconds) however these large batch-get requests were preventing the Venice platform from meeting this target.

Gaojie Liu is a Senior Staff Engineer at LinkedIn, and he wrote a great blog post about optimizations the LinkedIn team implemented with Venice in order to handle these large batch-gets.

Here’s a Summary

High Level Architecture of Venice

The components in Venice’s read path are

  • Thin Client - This is a client library that LinkedIn applications (like People You May Know) can use to perform single/multi-key lookups on Venice.

  • Apache Helix - Apache Helix is a cluster management framework that manages partition placement among the Venice servers while taking care of things like fault tolerance, scalability and more.

  • Venice Servers - these are servers that are assigned partition replicas by Helix and store them in local storage. To store the replicas, these servers run RocksDB, which is a highly performant key-value database.

  • Router - This is a stateless component that parses the incoming request, uses Helix to get the locations of relevant partitions, requests data from the corresponding Venice servers, aggregates the responses, and returns the consolidated result to the thin client.

Scaling Issues

As mentioned before, LinkedIn started running large multi-key batch requests on Venice where each request would contain hundreds or thousands of keys. Each request would result in tons of different partitions being touched and would mean a lot of network usage and CPU intensive work.

As the database scaled horizontally, the fanout would increase proportionally and more partitions would have to be queried per large batch request.

To fix the scaling issues, LinkedIn implemented a number of optimizations on Venice. We’ll go through some of the optimizations they enacted. Read the full article for more details.

RocksDB

Venice servers are the computers responsible for storing the key-value data on disk. The servers rely on a key-value database to do this. LinkedIn started by using Oracle Berkeley DB (BDB) for this database. They used the Java Edition.

After the scaling issues, engineers tested a number of other options and decided to switch to RocksDB.

With RocksDB, the garbage collection pause time was no longer an issue, because it’s written in C++ (you can read about the Java object lifecycle vs. C++ object lifecycle here).

Additionally, RocksDB has a more modern implementation of the Log Structured Merge (LSM) Tree data structure (the underlying data structure for how data is stored on disk), so that also helped improve p99 latency by more than 50%.

Due to the characteristics of the data being stored in Venice (derived data), engineers also realized that they could leverage RocksDB read-only mode for a subset of use cases. The data for these use cases was static after being computed. This switch yielded double throughput with reduced latency for that data.

Fast-Avro

Apache Avro is a data serialization framework that was developed for Apache Hadoop. With Avro, you can define your data’s schema in JSON and then write data to a file with that schema using an Avro package for whatever language you’re using.

Example of Avro schema definition

Writing data based on that Avro schema to a file in Python

In order to make serialization and deserialization faster, engineers adopted Fast-Avro for the Venice platform. Fast-Avro is an alternative implementation of Avro that relies on runtime code generation for serialization and deserialization rather than native implementation. You can read about the differences with Fast-Avro here.

Switching to Fast-Avro resulted in a 90% de-serialization improvement at p99 on the application end.

Read Compute

Network usage was another scalability issue that LinkedIn engineers faced.

Going back to People You May Know, that feature will typically send out 10k+ queries per second to Venice and each request will result in a 5 megabyte response. That means network usage of more than 50 gigabytes per second.

In order to reduce the amount of network load, engineers added a read compute feature to Venice, where Venice servers could be instructed to run computations on the values being returned and return the final computed value instead. That value can then be combined with other final computed values from the other Venice servers and returned to the client.

Currently, this feature only supports operations like dot-product, cosine-similarity, projection and a few others. However, it has reduced response sizes by 75% for some of the biggest use cases.

These are just a few examples of optimizations that LinkedIn engineers implemented. There are quite a few others described in the blog post like adding data streaming capabilities, smarter partition replica selection strategies, switching to HTTP/2 and more.

You can read the full blog post here.

How did you like this summary?

Your feedback really helps me improve curation for future emails.

Login or Subscribe to participate in polls.

Chaos Engineering - Quastor Pro

This is part of Quastor Pro, where we break down concepts in backend engineering that are frequently mentioned in our Quastor summaries.

In this breakdown, we talk about Chaos Engineering and why it's useful. We dive into the methodology and the principles behind it.

We discuss implementing Chaos Engineering with AWS Fault Injection Service, Azure Chaos Studio, Gremlin and open source tools like Chaos Monkey, kube-monkey and more.

We also go through a couple big tech companies and talk about how they employ Chaos methodologies like Facebook with Project Storm and LinkedIn with Waterbear.

If you subscribe to Quastor Pro, I'd highly recommend you use your job's Education/Learning & Development budget to pay.

If you don't have an Education / L&D budget (or you're a student), then please reply to this email and let me know. I'll be happy to give you a significant discount on the premium version of Quastor.

Thanks for your support.

Tech Snippets

  • Comparing Lossless Image Formats - Lossless compression is where you compress a file without losing the ability to convert it back to the original file. This is an interesting blog post that compares common lossless image formats like PNG, WebP, AVIF and JPEG XL. The author found that the PNG format was outclassed by more modern lossless formats like WebP and JPEG XL.

  • Building a WebAssembly-powered serverless platform - WebAssembly is a secure, fast and portable way to run code written in many different higher level programming languages. Although initially meant as a way to run more intensive apps in the browser, Web Assembly is being used for far, far more. This is an interesting blog post on using Web Assembly to build a serverless platform. It’s written in Rust with the actix-web framework and run on the Wasmtime webassembly runtime

  • Programming languages endorsed for server-side use at Meta - At Facebook, a language is supported if the company has built out all the developer experience infrastructure around code editing, debugging, deployment, interoperability, etc. Languages supported on the server-side are Hack, C++, Rust and Python. This is an interesting read from Meta Engineering on how they arrived at those languages and what they use each language for.

  • Network Troubleshooting Explained (It’s not always DNS) - This is a great blog post that goes through the layers of the TCP/IP model and explains how to debug network issues in each layer. The post gives a list of CLI commands that’ll be useful when you’re trying to figure out what the issue is.

Interview Question

You are given a list of strings called words and a string pattern.

Return a list of the strings inside words that match pattern.

A word matches the pattern if the letters in the word can be mapped one-to-one to characters in the pattern.

Previous Question

As a reminder, here’s our last question

You are given a positive integer as input.

Print the next smallest and next largest numbers that have the same number of 1 bits in their binary representation.

Solution

We can solve this question using Bit Manipulation.

We’ll start with finding the next largest number with the same number of 1 bits.

Remember, we want the next largest number, not just any larger number with the same number of 1 bits. We’ll want the smallest number that is bigger than our input (that also has the same number of 1 bits).

Therefore, we’ll have to flip 2 bits in our integer input.

We’ll have to flip a 0 bit to a 1 in order to make our number larger. We’ll call this bit b1.

Then, we’ll have to flip a different 1 bit to a 0 in order to keep the number of 1 bits the same. We’ll call this bit b2.

b1 must be a more significant bit than b2 (b1 must be to the left of b2), otherwise our new number will be smaller than the original.

Because we want to make our number bigger, but also the next biggest, b1 will be the right-most non-trailing zero.

We want the zero to be the right-most since we want it to be the smallest bigger number with the same number of 1 bits.

We want the zero to be non-trailing (it should have at least one 1 bit to the right of it) because we’ll have to flip a less significant bit from 1 to 0 (that bit will be b2) in order to have the same number of 1 bits as the input integer.

So, now we know that b1 will be the right-most, non-trailing zero.

We’ll flip the bit at b1 from a 0 to a 1.

Now, we have to

  1. Flip a bit from 1 to 0 (b2 bit)

  2. Select the b2 bit so that our number is the next largest (and not larger)

In order to do this, we’ll first count the number of 0s and 1s that are to the right of our b1 bit.

Let’s say we have a 0s and b 1s where a is the number of 0s and b is the number of 1s to the right of b1.

Then, we’ll add one additional 0 (a = a + 1) and take away one of the 1s (b = b - 1) in order to go back to the original number of 1 bits in the input.

Now, we’ll place all the a 0s to the left of the b 1s when we arrange the bits to the right of b1. This guarantees that we have the next largest integer.

We can reverse the process to find the next smallest integer with the same number of 1 bits.