How Airbnb Built a Low Latency, Highly Scalable Key Value Store

Plus, tips on building a culture around automated testing and strategies Spotify uses for large scale migrations.

January 21, 2023

Hey Everyone!

Today we’ll be talking about

The Architecture of Airbnb's Distributed Key Value Store
- Airbnb needed to create a highly scalable key value store that they could use to serve data for their ML applications.
- They built Mussel, a distributed key value store that uses some parts of HBase and combines it with Kafka and Apache Helix.
- We talk about how they built this system and how they dealt with issues around their storage engine.
Strategies for Performing a Large Scale Migration (from Spotify)
- Engineering leaders at Spotify wrote a great blog post on Do's and Don'ts when doing a large scale migration
- You should clearly define your goals and explain what you're doing/why you're doing it to stakeholders
- Keep stakeholders in the loop and dedicate time to help them migrate
- Use dashboards to keep track of progress and keep the migration timeline updated with any delays/postponements to increase transparency.
Tips on Building a Culture Around Automated Testing (from Booking.com)
- Don't just rely on the vanity metric of test coverage
- Identify areas in the codebase that change often and make sure they're well tested
- Understand the lifespan of your codebase and incorporate that into your testing
- Test edge cases and failure scenarios to ensure test completeness
Tech Snippets
- Techniques to Improve Reliability of Large Language Models
- Minimizing the Downsides of Dynamic Programming Languages
- Building a Distributed System with NodeJS

The Architecture of Airbnb's Distributed Key Value Store

At Airbnb, many of the services rely on derived data for things like personalization and recommendations. Derived data is just data that is generated from other data using large scale processing engines like Spark or from streaming events from message brokers like Kafka.

An example is the user profiler service, which uses real-time and historical derived data of users on Airbnb to provide a more personalized experience to people browsing the app.

In order to serve live traffic with derived data, Airbnb needed a storage system that provides strong guarantees around reliability, availability, scalability, latency and more.

Shouyan Guo is a senior software engineer at Airbnb and he wrote a great blog post on how they built Mussel, a highly scalable, low latency key-value store for serving this derived data. He talks about the tradeoffs they made and how they relied on open source tech like ZooKeeper, Kafka, Spark, Helix and HRegion.

Here’s a summary

Prior to 2015, Airbnb didn’t have a unified key-value store solution for serving derived data. Instead, teams were building out their own custom solutions based on their needs.

These use cases differ slightly on a team-by-team basis, but broadly, the engineers needed the key-value store to meet 4 requirements.

Scale - store petabytes of data
Bulk and Real-time Loading - efficiently load data in batches and in real time
Low Latency - reads should be < 50ms p99 (99% of reads should take less than 50 milliseconds)
Multi-tenant - multiple teams at Airbnb should be able to use the datastore

None of the existing solutions that Airbnb was using met all 4 of those criteria. They faced efficiency and reliability issues with bulk loading into HBase. They used RocksDB but that didn’t scale as it had no built-in horizontal sharding.

Between 2015 and 2019, Airbnb went through several iterations of different solutions to solve this issue for their teams. You can check out the full blog post to read about their solutions with HFileService and Nebula (built on DynamoDB and S3).

Eventually, they settled on Mussel, a scalable and low latency key-value storage engine.

Architecture of Mussel

Mussel supports both reading and writing on real-time and batch-update data. Technologies it uses include Apache Spark, HRegion, Helix, Kafka and more.

Apache Helix is a cluster management framework for coordinating the partitioned data and services in your distributed system. It provides tools for things like

Automated failover
Data reconciliation
Resource Allocation
Monitoring and Management

And more.

In Mussel, the key-value data is partitioned into 1024 logical shards, where they have multiple logical shards on a single storage node (machine). Apache Helix manages all these shards, machines and the mappings between the two.

On each storage node, Mussel is running HRegion, a fully functional key-value store.

HRegion is part of Apache HBase; HBase is a very popular open source, distributed NoSQL database modeled after Google Bigtable. In HBase, your table is split up and stored across multiple nodes in order to be horizontally scalable.

These nodes are HRegion servers and they’re managed by HMaster (HBase Master). HBase works on a leader-follower setup and the HMaster node is the leader and the HRegion nodes are the followers.

Airbnb adopted HRegion from HBase to serve as their storage engine for the Mussel storage nodes. Internally, HRegion works based on Log Structured Merge Trees (LSM Trees), a data structure that’s optimized for handling a high write load.

LSM Trees are extremely popular with NoSQL databases like Apache Cassandra, MongoDB (WiredTiger storage engine), InfluxDB and more.

Read and Write Requests

Mussel is a read-heavy store, so they replicate logical shards across different storage nodes to scale reads. Any of the Mussel storage nodes that have the shard can serve read requests.

For write requests, Mussel uses Apache Kafka as a write-ahead log. Instead of directly writing to the Mussel storage nodes, writes first go to Kafka, where they’re written asynchronously.

Mussel storage nodes will poll the events from Kafka and then apply the changes to their local stores.

Because all the storage nodes are polling writes from the same source (Kafka), they will always have the correct write ordering (Kafka guarantees ordering within a partition). However, the system can only provide eventual consistency due to the replication lag between Kafka and the storage nodes.

Airbnb decided this was an acceptable tradeoff given the use case being derived data.

Batch and Real Time Data

Mussel supports both real-time and batch updates. Real-time updates are done through Kafka as described above.

For batch updates, they’re done with Apache Airflow jobs. They use Spark to transform data from the data warehouse into the proper file format and then upload the data to AWS S3.

Then, each Mussel storage node will download the files from S3 and use HRegion bulkLoadHFiles API to load those files into the node.

Issues with Compaction

As mentioned before, Mussel uses HRegion as the key-value store on each storage node and HRegion is built using Log Structured Merge Trees (LSM Trees).

A common issue with LSM Trees is the compaction process. When you delete a key/value pair from an LSM Tree, the pair is not immediately deleted. Instead, it’s marked with a tombstone. Later, a compaction process will search through the LSM Trees and bulk delete all the key/value pairs that are marked.

This compaction process will cause higher CPU and memory usage and can impact read/write performance of the cluster.

To deal with this, Airbnb split up the storage nodes in Mussel to online and batch nodes. Both online and batch nodes will serve write requests, but only online nodes will serve read requests.

Online nodes are configured to delay the compaction process, so they can serve reads with very low latency (not hampered by compaction).

Batch nodes, on the other hand, will perform compaction updates to remove any deleted data.

Then, Airbnb will rotate these online and batch nodes on a daily basis, so batch nodes will become online nodes and start serving reads. Online nodes will become batch nodes and can run their compaction. This rotation is managed with Apache Helix.

For more details, read the full blog post here.

How did you like this summary?

Your feedback really helps me improve curation for future emails.

An Introduction to Apache Parquet

Apache Parquet is an open source, column-oriented data storage format that was developed at Twitter and Cloudera. It's very popular for big data processing and widely used in the Hadoop ecosystem.

It’s extremely efficient compared to just using a CSV to store/process your data.

Databricks ran some benchmarks and converted a 1 terabyte CSV to Parquet and found

File size was reduced from 1 terabyte to 130 gigabytes - 87% reduction
Query run time was reduced from 236 seconds to 7 seconds - 34 times faster
Overall expenses (storage + query costs) dropped 99.7%

Parquet is a column-oriented storage format, which makes it highly efficient for data compression and extremely fast for analytical queries.

To learn more, read this full technical breakdown of Apache Parquet.

sponsored

Tech Snippets

Techniques to Improve Reliability with Language Models from OpenAI - If you've played around with GPT3 and ChatGPT, you've probably heard of the area of prompting. For some tasks, the language model will succeed/fail based on how the input prompt is phrased. With simple math problems, GPT-3 will often fail if you just ask it the problem but it will succeed if you add the phrase "Let's think step by step". This blog post goes through more techniques that OpenAI uses to increase the reliability of answers.

Building Reliable Distributed Systems in NodeJS - Durable execution is a method of building distributed systems where you run code and persist each step taken by the code. This way, if the container running the code dies, then it can automatically continue running in another container with all state intact. Systems for durable execution will include functionality to perform retries and timeouts. Real world examples include Azure Durable Functions, Amazon SWF, Uber Cadence and more. This is a great blog post on building a durable execution system in NodeJS.

Minimizing the Downsides of Dynamic Programming Languages - JavaScript and Python are consistently within the top five most popular languages out there, so it's very likely that you'll be working with a dynamic language. Daniel Orner wrote a great post for the Stack Overflow Engineering blog with tips on how to improve your experience with these languages. Some tips include use type hints, be more explicit than you need to be and limit the use of frameworks.

Strategies for Performing Large Scale Migrations

Recently, Spotify went through a large migration with their mobile codebase; they made changes to make it easier to develop different features in isolated environments.

Mariana Ardoino and Raul Herbster are engineering leaders at Spotify and they wrote a great blog post on the Spotify Engineering blog on Do’s and Don’ts when you’re doing a large-scale migration. They go into some key challenges that you’ll face and how you should deal with them.

You can read the full blog post here.

Here’s a summary

Challenge 1 - Defining the Scope

The scope of the migration may feel massive, with many use cases that need to be addressed. It might be difficult to know where to begin and you may have stakeholders reaching out to you with questions on what they’re supposed to do.

Do’s

Define your goals - Talk to your users and figure out what you’re doing, why you’re doing it and how you’re going to do it.
Start small - Begin with a proof of concept and validate it with stakeholders. Go through the alpha, beta and general availability product life cycles. As you go through these stages, talk to users and use their feedback to add additional functionality.

Don’ts

Start a large migration without an estimated roadmap and a clear definition of success

Challenge 2 - Scaling Up

If you’re in a large organization, there might be many teams who are affected by the migration. This can cause progress to slow down and overwhelm stakeholders with the ongoing changes.

Do’s

Communicate - Keep the audience in the loop by sharing progress through newsletters and workplace posts
Implement Spike Weeks - Partner with specific teams and dedicate a few days (or a week) to work on migrating them to the new solution.

Challenge 3 - Competing Priorities

How do you deal with competing priorities in the organization? Stakeholders might not see the importance of the platform migration and give it a low priority compared to their own projects.

Do’s

Motivate - Showcase the positive impact of your migration to stakeholders to encourage them to get the migration work done.
Continuously Evaluate - Have regular checkpoints during the quarter to evaluate the speed of the migration and show if you're reaching your quarterly, half-yearly or yearly goals.
Take the Pain On-Platform - When possible, make required changes on behalf of the stakeholder teams so they can focus on their own work

Challenge 4 - Being Accountable

How do you avoid missing deadlines and having a migration that takes way longer than predicted?

Do’s

Clarify the Definition of Done - Have clear metrics, data and graphs that show your progress in the migration and what it means to be successful.
Use Dashboards - Metrics and dashboards will help communicate the progress and impact to stakeholders. Keeping them engaged will help prioritize your work to them.
Maintain the Timeline - Continuously keep the roadmap up to date. If deadlines are missed, then update that in the timeline and keep stakeholders informed. This increases transparency, enables feedback and helps identify future roadblocks.

For more details, check out the full blog post here.

How did you like this summary?

Your feedback really helps me improve curation for future emails. Thanks!

Tips on Building a Culture Around Automated Testing

Maxim Schepelin is an Engineering Manager at Booking.com and he wrote a fantastic blog post with questions you should ask yourself when creating a culture of automated testing in your team.

Some teams approach testing with a goal like “In this quarter, we’ll increase test coverage to X%”.

This is suboptimal as the ultimate goal is not just the vanity metric of percentage of lines of code covered with tests.

Instead, the goal is to have a fast feedback loop to validate new changes made to the code, for the whole lifespan of the codebase.

Therefore, things you should do are

Understand the expected lifespan of your codebase - How long do you expect code to stay in production? If it’s for 5+ years, then you’ll probably have many more changes around hardware, tooling, OS updates, language updates, etc. You should be thinking of these when you’re writing your test cases and possibly incorporate that into your testing.
Identify hotspots that change often - Identify hotspots where the code changes often and start with writing comprehensive tests to cover those areas. This is why solely looking at code coverage is misleading, as it ignores change frequency. If you have a high test coverage, but don’t cover any of the hotspots in the codebase then your tests won’t help reduce failures.
Test for potential failures - Test coverage just looks at if a given line of code is covered with a test case. It fails to look at test completeness. You should try to cover all possible paths of execution with your tests, especially for hotspots in the codebase. Have tests for edge cases.

The goal of testing is to increase confidence in the codebase and make it easier to iterate. Focusing on things like test completeness and ensuring that hotspots in your codebase are well tested will help give you a fast feedback loop.

For more details, read the full blog post by Maxim here.