How Uber Scaled Cassandra to Tens of Thousands of Nodes

An introduction to Apache Cassandra and issues Uber ran into when scaling. Plus, API versioning practices at LinkedIn, challenging algorithms to try and more.

August 19, 2023

Hey Everyone!

Today we'll be talking about

How Uber Scaled Cassandra to Tens of Thousands of Nodes
- Why use Cassandra
- What is a “Wide Column“ database
- How Cassandra is Decentralized
- Issues Uber faced with Unreliable Node Replacement
- How Uber fixed issues with Cassandra’s Transactions
Tech Snippets
- Challenging Algorithms and Data Structures to Try
- How to Review Code as a Junior Developer
- API versioning practices at LinkedIn
- Project-based Learning Tutorials

How Uber Scaled Cassandra to Tens of Thousands of Nodes

Cassandra is a NoSQL, distributed database created at Facebook in 2007. The initial project was heavily inspired by Google Bigtable and also took many ideas from Amazon’s Dynamo (Avinash Lakshman was one of the creators of Dynamo and he also co-created Cassandra at Facebook).

Similar to Google Bigtable, it’s a wide column store (I’ll explain what this means - a wide-column store is not the same thing as a column-store) and it’s designed to be highly scalable. Users include Apple, Netflix, Uber, Facebook and many others. The largest Cassandra clusters have tens of thousands of nodes and store petabytes of data.

Why use Cassandra?

Here’s some reasons why you’d use Cassandra.

Large Scale

Just to reiterate, Cassandra is completely distributed and can scale to a massive size. It has out of the box support for things like distributing/replicating data in different locations.

Additionally, Cassandra has a decentralized architecture where there is no single point of failure. Having primary/coordinator nodes can often become a bottleneck when you’re scaling your system, so Cassandra solves this by having every node in the cluster be exactly the same. Any node can accept a write/read and the nodes communicate using a Gossip Protocol in a peer-to-peer way.

I’ll discuss this decentralized nature more in a bit.

Write Heavy Workloads

Data workloads can be divided into read heavy (ex. A social media site like Twitter or Quora where users consume far more than they post) or write heavy (ex. A logging system that collects and stores metrics from all your servers)

Cassandra is optimized for write-heavy workloads so they make tradeoffs to handle a very high volume of writes per second.

One example of a trade off they make is with the data structure for the underlying storage engine (the part of the database that’s responsible for reading and writing data from/to disk). Cassandra uses Log Structured Merge Trees (LSM Trees). We’ve talked about how they work and why they’re write-optimized in a prior article (for Quastor Pro readers).

Highly Tunable

Cassandra is highly customizable so you can configure it based on your exact workload. You can change how the communications between nodes happens (gossip protocol), how data is read from disk (LSM Tree), the consensus between nodes for writes (consistency level) and much more.

The obvious downside with this is that you need significant understanding and expertise to push Cassandra to its capabilities. If you put me in charge of your Cassandra cluster, then things would go very, very poorly.

So, now that you have an understanding of why you’d use Cassandra, we’ll delve into some of the characteristics of the database.

Wide Column

Cassandra is a partitioned row store database. Like a relational database, your data is organized into tables with rows and columns. Each row is identified by a primary key and the row can store an arbitrary number of columns with whatever data you’d like.

However, that’s where the similarities end.

With a relational database, the columns for your tables are fixed for each row. Space is allocated for each column of every row, regardless of whether it’s populated (although the specifics differ based on the storage engine).

Cassandra is different.

You can think of Cassandra as using a “sorted hash table” to store the underlying data (the actual on-disk data structure is called an SSTable). As data gets written for each column of a row, it’s stored as a separate entry in the hash table. With this setup, Cassandra provides the ability for a flexible schema, similar to MongoDB or DynamoDB. Each row in a table can have a different set of columns based on that row’s characteristics.

Important Note - You’ll see Cassandra described as a “Wide-Column” database. When people say that, they’re referring to the ability to use tables, rows and columns but have varying columns for each row in the same table.

This is different from a Columnar database (where data in the same column is stored together on disk). Cassandra is not a column-oriented database. Here’s a great article that delves into this distinction.

For some reason, there’s a ton of websites that describe Cassandra as being column-oriented, even though the first sentence of the README on the Cassandra github will tell you that’s not the case. Hopefully someone at Amazon can fix the AWS site where Cassandra is incorrectly listed as column-oriented.

Distributed

Cassandra wouldn’t be very useful if it weren’t distributed. It’s not the type of database you might spin up to store your blog posts.

At Uber, they have hundreds of Cassandra clusters, ranging from 6 to 450 nodes per cluster. These store petabytes of data and span across multiple geographic regions. They can handle tens of millions of queries per second.

Cassandra is also decentralized, meaning that no central entity is in charge and read/write decisions are instead made only by holders of Cassandra Coin. If you’re interested in spending less time reading software architecture newsletters and more time vacationing in Ibiza, then I’d recommend you add Cassandra Coin to your portfolio.

Just kidding.

By decentralized, there is no single point of failure. All the nodes in a Cassandra cluster function exactly the same, so there’s “server symmetry”. There is no separate designation between nodes like Primary/Secondary, NameNode/DataNode, Master/Worker, etc. All the nodes are running the same code.

Instead of having a central node handling orchestration, Cassandra relies on a peer-to-peer model using a Gossip protocol. Each node will periodically send state information to a few other nodes in the cluster. This information gets relayed to other nodes (and spread throughout the network), so that all servers have the same understanding of the cluster’s state (data distribution, which nodes are up/down, etc.).

In order to write data to the Cassandra cluster, you can send your write request to any of the nodes. Then, that node will take on a Coordinator role for the write request and make sure it’s stored and replicated in the cluster.

To learn more about Cassandra, here’s some starting resources

Awesome Cassandra - GitHub repo with a bunch of useful resources about Cassandra
Cassandra: The Definitive Guide - This was my primary resource for writing this summary

Cassandra at Uber

Uber has been using Cassandra for over 6 years to power transactional workloads (processing user payments, storing user ratings, managing state for your 2am order of Taco Bell, etc.). Here’s some stats to give a sense of the scale

Tens of millions of queries per second
Petabytes of data
Tens of thousands of nodes
Hundreds of unique Cassandra clusters, ranging from 6 to 450 nodes per cluster

They have a dedicated Cassandra team within the company that’s responsible for managing, maintaining and improving the service.

In order to integrate Cassandra at the company, they’ve made numerous changes to the standard setup

Centralized Management - Uber added a stateful management system that handles the cluster orchestration/configuration, rolling restarts (for updates), capacity adjustments and more.
Forked Clients - Forked the Go and Java open-source Cassandra clients and integrated them with Uber’s internal tooling

For more details, on how Uber runs Cassandra, read the full article.

Issues Uber Faced while Scaling Cassandra

Here’s some of the issues Uber faced while scaling their system to one of the largest Cassandra setups out there. The article gives a good sense of some of the issues you might face when scaling a distributed system and what kind of debugging protocols could work.

Unreliable Nodes Replacement

In any distributed system, a common task is to replace faulty nodes. Uber has tens of thousands of nodes with Cassandra, so replacing nodes happens very frequently.

Reasons why you’ll be replacing nodes include

Hardware Failure - A hard drive fails, network connection drops, some dude spills coffee on a server, etc.
Fleet Optimization - Improving server hardware, installing a software update, etc.
Deployment Topology Changes - Shifting data from one region to another.
Disaster Recovery - Some large scale system failure causes a significant number of nodes to go down.

Uber started facing numerous issues with graceful node replacement with Cassandra. The process to decommission a node was getting stuck or a node was failing to get added. They also faced data inconsistency issues with new nodes.

These issues happened with a small percentage of node replacements, but at Uber’s scale this was causing headaches. If Uber had to replace 500 nodes a day with a 5% error rate, then 25 manual operations would require 2 full time engineers handling node replacement failures.

How Uber Solved the Node Replacement Issue

When you’re writing data to Cassandra, the database uses a concept called Hinted Handoff to improve reliability.

If replica nodes are unable to accept the write (due to routine maintenance or a failure), the coordinator will store temporary hints on their local filesystem. When the replica comes back online, it’ll transfer the writes over.

However, Cassandra was not cleaning up hint files for dead nodes. If a replica node goes down and never rejoins the cluster, then the hint files are not purged.

Even worse, when that node decommissions then it will transfer all its stored hint files to the next node. Over a period of time, this can result in terabytes of garbage hint files.

To compound on that, the code path for decommissioning a node has a rate limiter (you don’t want the node decommission process to be hogging up all your network bandwidth). Therefore, transferring these terabytes of hint files over the network was getting rate limited, making some node decommission processes take days.

Uber solved this by

Proactively purging hint files belonging to deleted nodes
Dynamically adjusting the hint transfer rate limiter, so it would increase the transfer speeds if there’s a big backlog.

Error Rate of Cassandra’s Lightweight Transactions

Lightweight Transactions (LWTs) allow you to perform conditional update operations. This lets you insert/update/delete records based on conditions that evaluate the current state of the data. For example, you could use an LWT to insert a username into the table only if the username hasn’t been taken already.

However, Uber was suffering from a high error rate with LWTs. They would often fail due to range movements, which can occur when multiple node replacements are being triggered at the same time.

After investigating, Uber was able to trace the error to the Gossip protocol between nodes. The protocol was continuously attempting to resolve the IP address of a replaced node, causing its caches to become out of sync and leading to failures in Lightweight Transactions.

The team fixed the bug and also improved error handling inside the Gossip protocol. Since then, they haven’t seen the LWT issue in over 12 months.

For more details on scaling issues Uber faced with Cassandra, read the full article here.

Tech Snippets

Challenging algorithms and data structures every programmer should try

This is an short list of some interesting algorithms and data structures to look into.

One of the data structures is Bloom Filters, which are used quite heavily in Log Structured Merge Trees (discussed in the Cassandra summary above).

Other concepts mentioned in the article are Splay Trees, Piece Tables and Myers String Diff.

austinhenley.com/blog/challengingalgorithms.html?utm_source=blog.quastor.org&utm_medium=referral&utm_campaign=how-ebay-uses-fault-injection-testing

How to Review Code as a Junior Developer

This is an awesome post on the Pinterest Engineering blog about how junior developers should do code reviews. Many junior devs may think their feedback (especially when given to developers who are more senior) is a waste of time, but that’s not the case!

https://medium.com/pinterest-engineering/how-to-review-code-as-a-junior-developer-10ffb7846958

API versioning practices at LinkedIn

This is an interesting article by LinkedIn engineering on how they built an API versioning system for the LinkedIn Marketing APIs. The system was built in their API Gateway and it makes it easy for engineers to upgrade the API without disrupting existing customers.

engineering.linkedin.com/blog/2022/-under-the-hood--how-we-built-api-versioning-for-linkedin-market

Project-based Learning Tutorials

An extremely effective way to learn a new programming language is to build a small project using that language. This is an awesome repository that collects articles on building a toy project in a variety of languages. Languages include Rust, Swift, Python, Kotlin and more

github.com/practical-tutorials/project-based-learning