How Grab Implemented Rate Limiting

We'll talk about Bloom Filters, Roaring Bitmaps and the Sliding Window Rate Limiting Algorithm. Plus, Basecamp's first year of results after shifting away from the cloud.

December 20, 2023

Hey Everyone!

Today we’ll be talking about

How Grab Implemented Rate Limiting
- Grab needs to rate limit the number of notifications a user gets from the app. Too many notifications and the user might get annoyed.
- They sort users into different groups and keep track of them with the Roaring Bitmap data structure (based on compressing bitmaps in a clever way). We’ll also talk about Bloom Filters and why Grab didn’t go that route
- Using the Sliding Window Rate Limiting algorithm and Redis’s sorted set data structure to keep track of how many notifications a user has gotten.
Tech Snippets
- Basecamp’s First Year Results from Shifting Away from the Cloud
- How to quickly get into flow when coding
- How to scale an app to 10 million users on AWS
- Making $30k+ in side income by Writing Online as a Software Engineer
- The Importance of Linchpins, and Why You Should Become One

How Grab Implemented Rate Limiting

Grab is one of the largest companies in Southeast Asia with a market cap of over $12 billion USD. They’re a “super app”, where you can use Grab to book a taxi, order food delivery, send money to friends, get a loan and much more.

Having all these verticals means that different divisions in the company can sometimes compete for user attention.

A product manager in the Grab Food division might want to send users a notification with a 10% off coupon on their next food delivery. Meanwhile, someone in Grab Taxi might try to send users a notification on how they just added a new Limousine option.

Bombarding users with all these notifications from the different divisions is a really great way to piss them off. This could result in users turning off marketing communication for the Grab app entirely.

Therefore, Grab has strict rate limiting in place for the amount of notifications they can send to each user.

Naveen and Abdullah are two engineers at Grab and they wrote a terrific blog post delving into how this works.

Rate Limiting Communication to Users

Grab has different rate limiting levels depending on how often a user uses the app.

If someone just downloaded the app and is only using it for food delivery, then they should receive a small amount of notifications.

On the other hand, if someone is a Grab power-user and they’re constantly using taxi services, food delivery and other products, then they should have much higher rate limits in place.

To deal with this, Grab broke up their user base into different segments based on the amount of interaction the customer has with the app.

To limit the amount of notifications, they built a rate limiter that would check which segment a user was in, and would rate limit the amount of comms that user was sent based on their interaction with the app.

Grab has close to 30 million monthly users and sends hundreds of millions of notifications per day.

We’ll talk about how Grab built this rate limiter to work at scale.

Storing and Retrieving Member Segment Information

The first core problem Grab was trying to solve was to track membership in a set. They have millions of users and want to place these users into different groups based on characteristics (how much the user uses the app). Then, given a user they want to check if that user is in a certain group.

Because they have tens of millions of active users, they were looking for something more space efficient than a set or a bitmap/bit array. Instead, they wanted something that could track set membership without having memory usage grow linearly with the user base size. They also wanted insertions/lookups for the set to be logarithmic time complexity or better.

Two space-efficient ways of keeping track of which items are in a set are

Bloom Filter
Roaring Bitmaps

We’ll delve into both of these and talk about the trade-offs Grab considered.

Bloom Filter

A bloom filter will tell you with certainty if an element is not in a group of elements. However, it can have false positives, where it indicates an element is in the set when it’s really not.

For example, you might create a bloom filter to store usernames (string value). The bloom filter will tell you with certainty if a certain username has not been added to the set.

On the other hand, if the bloom filter says that a username has been added, then you have to check to make sure it’s not a false positive.

Under the hood, Bloom Filters work with a bit array (an array of 0s and 1s).

You start with a bit array full of 0s. When you want to add an item to the Bloom Filter, then you hash the item to convert it to an integer and map that integer to a bit in your array. You set that bit (flip it to 1).

When you want to check if an item is inside your bit-array, you repeat the process of hashing the item and mapping it to a bit in your array. Then, you check if that bit in the array is set (has a value of 1). If it’s not (the bit has a value of 0), then you know with 100% certainty that the item is not inside your bloom filter.

However, if the bit is set, then all you know is that the item could possibly be in your set. Either the item has already been added to the bloom filter, or another item with the same hashed bit value has been added (a collision). If you want to know with certainty if an item has been added to the group, then you’ll have to put in a second layer to check.

Another downside of Bloom Filters is that you cannot remove items from the Bloom Filter. If you add an item, and later on want to remove it from the filter, you can’t just set the bit’s value from 1 to 0. There might be other values in the Bloom Filter that have hashed to that same bit location, so setting the 1 to a 0 would cause them to be removed as well.

Due to the issues with collisions and the inability to remove items from the set, Grab decided against using Bloom Filters.

Roaring Bitmap

Instead of Bloom Filters, a data structure that can be used to represent a set of integers is a roaring bitmap. This is a much newer data structure and was first published in 2016.

Roaring bitmaps will let you store a set of integers but they solve the two problems with bloom filters

You definitively know whether or not an item is in the roaring bitmap
You can remove items from the roaring bitmap

They’re similar to the bitmap data structure but they’re compressed in a very clever way that makes them significantly more space efficient while also being efficient with insertions/deletions.

Basically, when you try to insert an integer into the roaring bitmap, it will first break the number down into smaller parts (the most significant bits and the least significant bits) and then store each smaller part in a container. This container is either an array, a bitmap or a run-length encoding depending on the specific characteristics of the content of the container.

If you’re interested in delving into exactly how roaring bitmaps work, this is the best article I found that explains them in a simple, visual manner.

Grab developed a microservice that abstracts the roaring bitmap’s implementation and provides an API to check set membership and get a list of all the elements in the set.

Tracking number of Comms sent to each user

After identifying the appropriate segment, they apply the specific limits to the user using a distributed rate limiter.

In order to do this, they have to keep track of all the users who’ve recently gotten notifications and count how many notifications have been sent to each of those users.

For handling this, they tested out Redis (using AWS Elasticache) and DynamoDB.

They decided to go with Redis because

Higher throughput at Lower Latency
Cost Effective
Better at handling Spiky Rate Limiting Workloads at Scale

Using Redis, they implemented the Sliding Window Rate Limiting algorithm.

This is a pretty common rate limiting strategy where you divide up time into small increments (depending on your use case this could be a second, a minute, an hour, etc.) and then count the number of requests sent within the window.

If the number of requests goes beyond the rate-limiting threshold, then the next request will get denied. As time progresses, the window “slides” forward. Any past request that is outside the current window is no longer counted towards the limit.

Grab built this with the Sorted Set data structure in Redis. The algorithm takes O(log n) time complexity where n is the number of request logs stored in the sorted set.

For more details, you can read the full article here.

How To Remember Concepts from Quastor Articles

We go through a ton of engineering concepts in the Quastor summaries and tech dives.

In order to help you remember these concepts for your day to day job (or job interviews), I’m creating flash cards that summarize these core concepts from all the Quastor summaries and tech dives.

The flash cards are completely editable and are meant to be integrated in a Spaced-Repetition Program (Anki) so that you can easily remember the concepts forever.

This is an additional perk for Quastor Pro members in addition to the technical deep dive articles.

Quastor Pro is only $12 a month and you should be able to expense it with your job’s learning & development budget. Here’s an email you can send to your manager. Thanks for the support!

Tech Snippets

How to quickly get into flow when coding

This is a fantastic article with a ton of tips on how you can improve your focus and productivity when writing code.

One key tip is to leave work slightly unfinished at the end of each day. Write down the next steps for your task in a clear, elaborate way and leave that on your desk.

This way, you’ll know exactly what to do when you come into work the next day. Getting into flow will be much easier.

Read the full article for all the tips.

read.engineerscodex.com/p/simple-software-engineering-habits

How to Scale an App to 10 Million Users on AWS

If you’re confused about the myriad of different services on AWS, this is a terrific introduction to some of the tooling you’ll need to scale a web app.

The article goes through the flow of deploying an MVP to AWS Amplify, using EC2, Route 53, Elasticache, S3 and more. It also delves into how you can ensure high uptime with things like deploying to different availability zones, automatic failover and more.

newsletter.systemdesign.one/p/aws-scale

Basecamp’s First Year Results from Shifting Away from the Cloud

Basecamp is a project management platform with tens of millions of users.

Last year, they announced how they were moving away from the cloud towards an on-prem setup. They estimated their cost savings and talked about how the shift could save them $7 million dollars over the next 5 years.

Now, it’s been a year since the first announcement and they’ve written a new post talking about the results they’ve seen since the migration.

world.hey.com/dhh/the-big-cloud-exit-faq-20274010

The Importance of Linchpins, and Why You Should Become One

For a community, a Linchpin is the central person who organizes and holds the group together. This might be by organizing meetups, driving initiatives, fostering connections, etc.

Being a Linchpin is a fantastic way to build strong relationships, get more visibility and eventually have more career (or personal) opportunities.

This is a terrific article that delves into how you can become a Linchpin for your own community.

debliu.substack.com/p/the-importance-of-linchpins-and-why

Making $30k+ in side income by Writing Online as a Software Engineer

Jordan Cutler is a Senior Software Engineer and he started a newsletter earlier this year where he gives advice on career advancement for developers.

In this post, he transparently talks about his progress over the year and how the newsletter has grown. He also delves into the economics of his newsletter and how much revenue he’s been able to generate.

It’s a fantastic read if you’re interested in side hustles as a developer.

careercutler.substack.com/p/software-engineer-side-hustle-making