How Mixpanel Fixed Their Load Balancing Problem

Mixpanel used a simple technique called The Power of 2 Choices to solve their workload skew issue.

September 01, 2022

Hey Everyone!

Today we’ll be talking about

How Mixpanel Fixed Their Load Balancing Problem
- An overview of Mixpanel's in-house database
- How an unequal workload was giving Mixpanel load balancing problems
- Solving this using the Power of 2 Choices load balancing technique
Tech Snippets
- Design Patterns and Principles for Large Scale Systems
- Examples of Well Written Documentation
- The Best Design Tools and Plugins
- How Big Tech Companies Think About Adopting Tech

How Mixpanel Fixed their Load Balancing Problem

Mixpanel is an analytics product that you can embed into your website/app to get detailed data on how your users are behaving (similar to Google Analytics).

In order to best serve their users, Mixpanel needs to support real-time event ingestion while also supporting fast analytical queries over all a user’s history.

When a user on a Mixpanel-tracked website clicks a button or navigates to a new page, that event needs to be ingested and stored in Mixpanel in under a minute (real-time event ingestion).

If a Mixpanel customer wants to see the number of sign up conversion events over the past 6 months, they should be able to query that data quickly (fast analytical queries).

Mixpanel accomplishes this with their in-house database, Arb. They leverage both row-oriented and column-oriented data formats where row-oriented works better for real-time event ingestion and column-oriented works well for analytical queries. This is based on the classic Lambda Architecture where you have a speed layer for real-time views and a batch layer for historical data.

If you're interested in learning more about Mixpanel's system architecture, you can read about it here.

In order to convert data from row format to a columnar format, Mixpanel has a service called Compacter.

Vijay Jayaram was the Principal Tech Lead Manager of the Performance team at Mixpanel, and he wrote a great blog post on technical challenges the company faced when scaling the Compacter service and how they overcame them.

Here’s a Summary

End-user devices with a Mixpanel tracker send event data to Mixpanel’s API and these events are pushed onto queues.

This data gets pushed onto Mixpanel’s storage system, where storage nodes will write the events to disk using a row-oriented format.

Then, the Compacter service will convert the data from row format to columnar format, making it faster to query.

Given the nature of the work, the Compacter service is very computationally expensive. It runs in an autoscaling nodepool on Google Kubernetes Engine.

When a storage node has a row file of a certain size/age, it will send a request to a randomly selected compacter node to convert it. The compacter node will then return a handle to the resulting columnar file.

If a compacter node has too many requests, then it’ll load shed and return an error. The storage node will retry after a backoff period.

A Skew Problem

Mixpanel engineers were having a great deal of trouble scaling the compacter in time to absorb the spikes in load. The compacter service failed to autoscale and this resulted in a spike in errors (as storage node requests were getting shedded and the retries were also getting shedded).

Engineers would have to manually set the autoscaler’s minimum number of nodes to a higher number to deal with the load. This resulted in a waste of engineer time and also inefficient provisioning.

When Mixpanel looked at the average utilization of nodes in the compacter service, they expected it to be at 80-90%. This would mean that the compute provisioned in the service was being used efficiently.

However, they found that average CPU utilization was ~40%. They checked the median utilization and the 90th percentile utilization to find that while median utilization was low, the 90th percentile utilization was near 80%.

This meant that half the compacter nodes provisioned were doing little work, while the top 10% of nodes were maxed out.

This was why the autoscaling was messed up, because the autoscaling algorithm was using the average utilization to make its scaling decisions.

Cause for Skew

Engineers were confused about why there was a skew since the storage nodes were randomly selecting compacter nodes based on a uniform random distribution (Randomized Static load balancing). Each compacter node was equally likely to be selected for a row-to-column conversion job.

However, because the individual jobs had a very uneven distribution in terms of computational load, this caused a large work skew between the compacter nodes. Mixpanel has a vast range of customers, from startups with thousands of events per day to large companies with billions of events per day.

This meant that the individual jobs were distributed based on a power law, where the largest jobs were significantly larger than the smallest jobs. Some compacter nodes were getting significantly more time-consuming jobs than other nodes and this is what caused the work skew between the nodes.

Having unequal load will also present problems for many other load balancing algorithms as well, like Round Robin.

The Power of 2-Choices

Mixpanel considered several solutions to solve this including inserting a queue between the storage nodes and compacters or inserting a more complex load balancing service. You can check out the full post to read about these options.

They went with a far simpler solution. They used a popular strategy called The Power of 2-Choices, which uses randomized load balancing.

Instead of the storage nodes randomly picking 1 compacter, they randomly pick 2 compacter nodes. Then, they ask each node for its current load and send the request to the less loaded of the two.

There’s been quite a few papers on this strategy, and it’s been found to drastically reduce the maximum load over having just one choice. It’s used quite frequently with load balancers like Nginx. Mixpanel wrote some quick Python simulations to confirm their intuition about how The Power of 2-Choices worked.

Implementing this into their system was extremely easy and it ended up massively closing the gap between the median utilization and the 90th percentile utilization.

Average utilization increased to 90% and the error rate dropped to nearly 0 in steady state since the compacters rarely had to shed load.

For more details, you can read the full summary here.

How did you like this summary?

Your feedback really helps me improve curation for future emails.

If you find Quastor useful, you should check out Pointer.io

It's a reading club for software developers that sends out super high quality engineering-related content.

It's read by CTOs, engineering managers, and senior engineers so you should definitely sign up if you're on that path (or if you want to go down that path in the future).

It's completely free!

sponsored (but I'm a big fan of Pointer)

Tech Snippets

Design Patterns and Principles That Support Large Scale Systems - This is a great blog post that goes through a lot of the concepts that are important when building large scale systems. It goes over embracing asynchronicity, circuit breakers, feature flags, fallbacks, monitoring and much more.

Why Don’t You Use… - This is a great list of reasons why a Big Tech company might avoid using technology XYZ. There are obviously good reasons (lacking features, not performant, too expensive) but there can also be bad, underlying reasons (not invented here syndrome, corporate politics, etc.). This is a great article that goes through many of the reasons and how companies think about adopting new tech.

Well Written Documentation - When choosing a library/framework/tool, one important factor is how well written the documentation is. Therefore, if you’re building something for developers you should have an idea of some of the principles that go behind well documented tools. This is a repo that lists out tools/frameworks/languages with well written documentation so you can see some great examples for yourself.

The best design tools and plugins for everything - If you’re working on a website for yourself, you’ll quickly realize how difficult it is to design something that looks nice. This is a great github repo with a list of resources for web design. Things like gradient tools, icon creators, color pickers, and websites that’ll teach you design principles.

Interview Question

Write a function to calculate pow(x, n), which is x raised to the power n.

Here’s a link to the question in LeetCode

Previous Question

As a reminder, here’s our last question

The variance of a string is defined as the largest difference between the number of occurrences of any 2 characters present in the string.

The string "aaabbc" has a variance of 2 since a appears 3 times and c appears 1.

Note the two characters may or may not be the same. The string "a" has a variance of 0.

Given a string s consisting of lowercase English letters only, return the largest variance possible among all substrings of s.

Here's the question in LeetCode

Solution

To solve this question, we'll first look at the simpler case where the string only has 2 different characters.

We can calculate the maximum possible variance of those 2 characters in the string using a variant of Kadane's algorithm.

Kadane's algorithm allows you to calculate the maximum subarray sum of an array in linear time. Here's my favorite video on Kadane's algorithm, I'd highly recommend watching that first if you don't know how the algorithm works.

Let's say the 2 characters in our string are c1 and c2 and we want to find the largest possible difference between the number of c1's and c2's in all the substrings in our string.

We can convert all the c1's in our string to 1s and all the c2's in our string to -1s.

This turns the question into a variant of the maximum subarray problem. We have a string with a bunch of 1s and -1s and we want to find the substring that has the largest difference between these two... find the substring with the maximum subarray.

The substring with the "maximum subarray" will have the largest number of c1's and the fewest number of c2's (hence the largest difference between c1's and c2's).

However, a key difference is that we must include both c1's and c2's in our string, so we have to have at least one c1 and at least one c2. So we'll have to adjust Kadane's algorithm for this.

We can do that by adding two boolean flags: has_c2 and first_c2.

has_c2 makes sure that we have at least one c2 in our substring. first_c2 is whether the first character in our substring is a c2.

If first_c2 is true and we come across another c2 character in our substring, then we can immediately drop the first occurrence of c2 (first character in our substring) and increase the substring variance by 1.

Here's the Python 3 code for finding the maximum possible variance of a string and all it's substrings.

Now we can tackle the entire problem.

We have to find the largest variance possible among any two characters in our string.

Therefore, we'll create a list of all the characters in our string and then iterate through all the pairs of unique characters and find the maximum variance.

For each pair, we'll have to run the findVariance function twice, one with the first character as c1 and the second character as c2 and then another time with the first character as c2 and the second character as c1.

Here's the full Python 3 code.