How Booking.com scaled their Customer Review System

Plus, how to improve your ability to focus and a look at the history of the Java ecosystem

January 17, 2023

Hey Everyone!

Today we’ll be talking about

How Booking.com Scaled Their Customer Review System
- Booking.com sharded their customer reviews across a cluster of servers. In order to scale this system, they needed to add more machines to the cluster.
- They talk about how they made it easy to scale the system up with Jump Consistent Hash, a consistent hashing algorithm developed at Google.
- We'll talk about the process they use to scale up the cluster to avoid any consistency/routing issues.
How to improve your Focus
- Andrew Huberman is a professor at the Stanford School of Medicine and he runs a great podcast where he gives actionable tips on how to improve your health. In this episode, he talks about how to improve your ability to focus.
- Using Binaural Beats to get into flow
- Working in 90 minute durations with a 10 minute break where you deliberately defocus
Tech Snippets
- A Look at the History of the Java Ecosystem (and why it's still champ)
- How Pinterest Improved their Android Video Player
- Coding Interview Roadmap with specific LeetCode questions and solutions
- Best Practices for Pull Request Creation and Feedback

How Booking.com Scaled their Customer Review System

Booking.com is one of the largest online travel agencies in the world; they booked 15 million airplane tickets and 600 million hotel room nights through the app/website in 2021.

Whenever you use Booking.com to book a hotel room/flight, you’re prompted to leave a customer review. So far, they have nearly 250 million customer reviews that need to be stored, aggregated and filtered through.

Storing these many reviews on a single machine is not possible due to the size of the data, so Booking.com has partitioned this data across several shards. They also run replicas of each shard for redundancy and have the entire system replicated across several availability zones.

Sharding is done based on a field of the data, called the partition key. For this, Booking.com uses the internal ID of an accommodation. The hotel/property/airline’s internal ID would be used to determine which shard its customer reviews would be stored on.

A basic way of doing this is with the modulo operator.

If the accommodation internal ID is 10125 and you have 90 shards in the cluster, then customer reviews for that accommodation would go to shard 45 (equal to 10125 % 90).

Challenges with Scaling

The challenge with this sharding scheme comes when you want to add/remove machines to your cluster (change the number of shards).

Booking.com expected a huge boom in users during the summer as it was the first peak season post-pandemic. They forecasted that they would be seeing some of the highest traffic ever and they needed to come up with a scaling strategy.

However, adding new machines to the cluster will mean rearranging all the data onto new shards.

Let’s go back to our example with internal ID 10125. With 90 shards in our cluster, that accommodation would get mapped to shard 45. If we add 10 shards to our cluster, then that accommodation will now be mapped to shard 25 (equal to 10125 % 100).

This process is called resharding, and it’s quite complex with our current scheme. You have a lot of data being rearranged and you’ll have to deal with issues around ambiguity during the resharding process. Your routing layer won’t know if the 10125 accommodation was already resharded (moved to the new shard) and is now on shard 25 or if it’s still stuck in the processing queue and its data is still located on shard 45.

The solution to this is a family of algorithms called Consistent Hashing. These algorithms minimize the number of keys that need to be remapped when you add/remove shards to the system.

ByteByteGo did a great video on Consistent Hashing (with some awesome visuals), so I’d highly recommend watching that if you’re unfamiliar with the concept. Their explanation was the clearest out of all the videos/articles I read on the topic.

Using the Jump Hash Sharding Algorithm

For their sharding scheme, Booking now uses the Jump Hash Sharding algorithm, a consistent hashing algorithm that was created at Google. It’s extremely fast, takes minimal memory and is simple to implement (can be expressed in 5 lines of code).

With Jump Hash, Booking.com can rely on a property called monotonicity. This property states that when you add new shards to the cluster, data will only move from old shards to new shards; thus there is no unnecessary rearrangement.

With the previous scheme, we had 90 shards at first (labeled shard 1 to 90) and then added 10 new shards (labeled 91 to 100).

Accommodation ID 10125 was getting remapped from shard 45 to shard 25; it was getting moved from one of the old shards in the cluster to another old shard. This data transfer is pointless and doesn’t benefit end users.

What you want is monotonicity, where data is only transferred from shards 1-90 onto the new shards 91 - 100. This data transfer serves a purpose because it balances the load between the old and new shards so you don’t have hot/cold shards.

The Process for Adding New Shards

Booking.com set a clear process for adding new shards to the cluster.

They provision the new hardware and then have coordinator nodes that figure out which keys will be remapped to the new shards and loads them.

The resharding process begins and the old accommodation IDs are transferred over to the new shards, but the remapped keys are not deleted from the old shards during this process.

This allows the routing layer to ignore the resharding and continue directing traffic to remapped accommodation IDs to the old locations.

Once the resharding process is complete, the routing layer is made aware and it will start directing traffic to the new shards (the remapped accommodation ID locations).

Then, the old shards can be updated to delete the keys that were remapped.

For more details, you can read the full blog post here.

How did you like this summary?

Your feedback really helps me improve curation for future emails.

PagerDuty wanted to minimize the MTTR (Mean Time to Resolution) of any IT issues/incidents that occurred with their product to ensure they met their customer SLAs.

To do this, they leveraged the InfluxDB time series platform to build process automation tooling to handle the time series data that tracks infrastructure performance. They broke their process down into several stages: a monitoring stage, an incident management stage, and a runbook execution stage.

For monitoring, they use Telegraf, an open source server agent written in Go that compiles to a single binary with no external dependencies and minimal memory footprint.

The Telegraf instances send data to InfluxDB, which processes that high-volume time series data and intelligently dispatches triggers and alerts. These go to PagerDuty, which powers the incident management and runbook execution stages.

sponsored

Tech Snippets

Why Java is still Champ - Regardless of what ranking system you look at, Java has been one of the most popular programming languages for over 25 years now. The ReadME project wrote a fantastic article going through the history of Java and some of the bumps and turns that happened along the way. Things like Java 9, OpenJDK, other JVM languages (Kotlin, Scala) and more. It's a fantastic overview of the Java ecosystem and how the language has stayed relevant for so long.

Coding Interview Roadmap - If you're preparing for coding interviews, this is a great roadmap that you can use to build your study plan. It gives an ordering (a DAG to be more precise) of the topics you need to study, what order to practice them in, and which LeetCode problems to solve. All the problems listed also come with a free video solution on YouTube by NeetCode.

Best Practices for Pull Request Creation and Feedback - This is a useful blog post on the DoorDash Engineering blog with tips on how to minimize the number of bad PRs in your development cycle. PRs should be short and have a clear title and description. Having short PRs helps programmers get feedback on their code more quickly, so it reduces the number of rewrites. It also reduces friction in the code review process (you don't want to spend 2 hours reading someone else's PR).

How Pinterest improved their Android Video Player - Pinterest engineering wrote a fantastic blog post on how they improved the video playback experience in the Pinterest Android app. They reduced startup latency by warming the network connection with a dummy HTTP HEAD request. They changed configuration around the video buffering durations and aspect ratios and also made changes to their caching system.

Note - I'm experimenting with adding some new content that isn't software engineering specific, but still helpful to developers. I'd love to hear your feedback!

Please reply if you'd like me to continue adding summaries like this, or if you'd prefer if I stick to software engineering articles. Thanks!

How to Improve Your Ability to Focus

Andrew Huberman is a researcher at Stanford University and he has an amazing podcast (Huberman Lab) where he goes through current research on how you can live a happier, healthier and more productive life.

Being a programmer requires an ability to focus for long periods of time (get into a flow), so I found his podcast episode - Focus Tookit: Tools to Improve Your Focus & Concentration to be very useful.

Here's a quick summary of the tips he mentioned (based off peer-reviewed scientific literature).

Binaural Beats - There are playlists available on YouTube (or free apps if you google) that will play Binaural Beats (where you listen to two tones with slightly different frequencies at the same time). 40 Hz binaural beats have been shown to improve focus, attention and memory retention in a number of peer reviewed studies. Huberman recommends listening to 5-10 minutes of Binaural Beats prior to when you start your task to make it easier to get into a state of flow.
90 Minutes - The ideal duration for focused sessions is 90 minutes or less. Past that, fatigue begins to set in and the amount of focus people are able to dedicate begins to drop off. Therefore, Huberman sets a timer for 90 minutes when he begins a focused task and stops after that.
Defocus - After the 90 minutes (or less) focus session, you should spend 10-20 minutes where you deliberately defocus and give your brain/body a chance to rest. During this time, you should avoid focusing on any single thing (so avoid using your phone) and can work on menial tasks where your mind can wander (talk a short walk, do the dishes, wash the laundry, etc.). This is the best way to recharge for the next 90 minute focus session after the break.
Visual Field - A great deal of our cognitive focus is directed by our visual system. If you focus your eyes on a pen, you'll naturally start to focus on it and notice details about the pen. Cognitive focus tends to follow overt visual focus. Therefore, you can help ease yourself into a focused state by picking something in your room (part of the wall, an object, etc.) and staring at that object for 30 seconds to a few minutes (blinking is fine, don't try to force your eyes open). This helps you get into a focused state, and you can redirect your focus to your task after the 30 seconds is up.

These are a few of the tips Huberman mentions.

In the podcast, he also talks about using supplements like coffee, EPA, creatine and more.

Here's the full episode.

How did you like this summary?

Your feedback really helps me improve curation for future emails. Thanks!

How Grab Processes Billions of Events in Real Time

Grab is the largest transportation and food delivery company in Southeast Asia with more than 25 million monthly users completing ~2 billion transactions per year.

One of the marketing features in the Grab app is to offer real-time rewards whenever a user takes a certain action (or series of actions).

For example, if a user uses the Grab app to get a ride to work in the morning, the app might immediately reward her with a 50% off ride reward that she can use in the evening for the ride back home.

Jie Zhang and Abdullah Al Mamum are two senior software engineers at Grab and they wrote a great blog post on how they process thousands of events every second to send out hundreds of millions of rewards monthly.

Here’s a summary

Grab runs growth campaigns where they’ll reward a user with discounts and perks if the user completes a certain set of actions. Over a typical month, they’ll send out ~500 million rewards and over 2.5 billion messages to their end-users.

Trident is the engine Grab engineers built to handle this workload. It’s an If This, Then That engine which allows Grab’s growth managers to create new promotional campaigns. If a user does this, then award that user with that.

A growth campaign configured in Trident

The Architecture of Trident

Trident’s architecture was designed with the following goals

Independence - Trident must run independently of other backend services, and it should not bring performance impacts to downstream backend services.
Robustness - All events must be processed exactly once. No events can be missed and events should not be processed multiple times.
Scalability - Trident must be able to scale up processing power when volume on the Grab platform surges.

Whenever a customer uses one of Grab’s products the backend service associated with that product will publish an event to a specific Kafka stream.

Trident subscribes to all the events from these multiple Kafka streams and processes them. By utilizing Kafka streams, Trident is decoupled from the upstream backend services.

Kafka guarantees at-least-once message delivery and then Trident makes sure any duplicate events are filtered out. This gives Trident exactly-once event processing, fulfilling the robustness criteria.

After filtering out duplicates, Trident will process each event and check if it results in any messages/rewards that have to be sent to the user. Trident does this by taking the event and running a rule evaluation process where it checks if the event satisfies any of the pre-defined rules set by the growth campaigns.

All processed events are stored in Redis (for 24 hours) and events that trigger an action are persisted in MySQL as well.

If an action is triggered, Trident will then call the backend service associated with that action. These calls are rate-limited (with tighter limits during peak hours) so that Trident doesn’t accidently DoS attack any of Grab’s downstream backend services.

Scalability

The number of events that Trident has to process can vary widely based on the time of day, day of week and time of year. During the peak of 2020, Trident was processing more than 2,000 events per second.

Grab uses quite a few strategies to make sure Trident can scale properly. The strategies are illustrated in this diagram.

It boils down to two things: scaling the server level and scaling the data store level.

Scaling the Server Level

The source of events for Trident are Kafka streams. Upstream backend services that are handling delivery/taxi orders will publish events to these streams after they handle a user’s request.

Trident can handle increased load (more events coming down the Kafka streams) by

Auto-scaling horizontally - Grab can add more server instances to handle Trident’s workload. However, they have to be careful and make sure that load is being distributed evenly across the server instances by matching kafka partitions with the server auto-scaling.

Reducing Load - The majority of the processing that the Trident servers are doing is checking to see if the event matches the criteria for any of the campaigns and whether any actions are triggered.Grab engineers sped this process up by prefiltering events. They load active campaigns every few minutes and organize them into an in-memory hashmap with the event type as the key and the list of corresponding campaigns as the value.When processing an event, they can quickly figure out all the possible matching campaigns by first checking in the hash map.

If any actions are triggered, Trident will call downstream backend services to handle them. For example, the GrabRewards service could be called to give a user a free ride.

There are strict rate-limits built in to stop Trident from overwhelming these downstream services during a time of high load.

Scaling the Data Store Level

Trident uses two types of storage: cache storage (Redis) and persistent storage (MySQL and S3).

Scaling cache storage isn’t too complicated since Redis Cluster makes it pretty simple. Engineers can add new shards at any time to spread out the load and data replication prevents any data loss from shard failures.

In terms of persistent storage, Trident has two types of data in terms of access pattern: online data and offline data.

The online data is frequently accessed (so it has to be relatively quick) and medium size (a couple of terabytes). Grab uses MySQL for this data.

The offline data is infrequently accessed but very large in size (hundreds of terabytes generated per day). Grab uses AWS S3 for this.

For the MySQL database, Grab added read replicas that can handle some of the load from the read queries. This relieved more than 30% of the load from the master instance and allows MySQL queries to be performant.

In the future, they plan to vertically partition (split the database up by tables) the single MySQL database into multiple databases based on table usage.

For more details, read the full blog post here.

Interview Question

Given an integer array nums and an integer k, return the number of good subarrays of nums.

A good array is an array where the number of different integers in that array is exactly k.

For example, [1,2,3,1,2] has 3 different integers: 1, 2, and 3.

A subarray is a contiguous part of an array.

Here’s the question in LeetCode

How Booking.com scaled their Customer Review System

Plus, how to improve your ability to focus and a look at the history of the Java ecosystem

How Booking.com Scaled their Customer Review System

Challenges with Scaling

Using the Jump Hash Sharding Algorithm

The Process for Adding New Shards

How did you like this summary?

How PagerDuty minimizes MTTR and improves SLAs with InfluxDB

Tech Snippets

How to Improve Your Ability to Focus

How did you like this summary?

How Grab Processes Billions of Events in Real Time

The Architecture of Trident

Scalability

Scaling the Server Level

Scaling the Data Store Level

Interview Question