How Grab Processes Billions of Events in Real Time

Clean Code's recommendations on how to write readable functions. Plus, a free textbook on data mining, a collection of design patterns for building apps in the cloud and more.

Hey Everyone,

Today we’ll be talking about

  • Clean Code’s recommendations on how to write readable functions.

    • layers of abstraction in your code

    • descriptive function names

    • number of arguments

    • avoiding side effects in your functions

  • How Grab Processes Billions of Events in Real Time

    • Trident is an IFTTT engine that ingests billions of events on the Grab platform and triggers predefined actions based off those events

    • Grab talks about their design goals with Trident and how they implemented it

    • They talk about how they scale the server level and the data storage level of Trident to cope with variable demand

  • Plus some awesome tech snippets on

    • A style guide for better Git commit messages

    • A programmer’s guide to data mining

    • A collection of design patterns for building reliable, scalable and secure applications in the cloud

    • How much it costs to run Lichess (the most popular free chess site on the internet)

Questions? Please contact me at [email protected].

Quastor is a free Software Engineering newsletter that sends out summaries of technical blog posts, deep dives on interesting tech and FAANG interview questions and solutions.

Clean Code - writing readable Functions

Here’s a summary of Clean Code’s advice on writing readable functions.

This advice is geared towards functions written in an OOP language, although many of the concepts carry over to other programming paradigms.

Principle 1 - Small!

The majority of your functions should be less than 15 lines long, and they should hardly ever be more than 20 lines long.

You should always be able to fit the entire function on your monitor. If you can’t then you probably need to break the function up into separate functions.

If you’re following this principle, it means that your function will not be large enough to hold several nested structures. You won’t be able to fit a loop, and then a nested if/else statement inside that loop, etc.

Instead, you should make those nested structures separate function calls. This also adds documentary value since those functions will have nice descriptive names that explain what they do.

Principle 2 - Do One Thing

Another way of stating Principle 1 is that your function should only do one thing.

A good heuristic to use is to see if you can extract some code from your function and turn that code into a new function with a name that is not just a restatement of its implementation.

If you can do that, then that suggests that your function is doing multiple things and you should break it down further.

Principle 3 - One Level of Abstraction Per Function

Your programs can frequently be divided into layers of abstraction, where each layer represents a different model of the same information and processes, but in a different manner.

An example is if you’re developing a program to scrape a website, parse some data and then put that data in a database.

One layer of abstraction is the HTML string you get after sending an HTTP request to the website’s server.

You take that HTML string and send it to your second layer of abstraction, where an HTML parser converts that string into an object where the HTML page is represented as a nested data structure (read up on BeautifulSoup if you’d like a more concrete example). This is the second layer of abstraction.

The third layer of abstraction could be where you extract the specific data you need from the HTML page object and store that data in an array.

The fourth layer of abstraction would be where you write the data in that array to a database.

The statements in a function should all be at the same level of abstraction.

You should not be manipulating the HTML string object and writing to the database in the same function.

Instead, Clean Code suggests The Stepdown Rule, where your code can be read as a top-down narrative.

Every function should be followed by a function at the next layer of abstraction. This way, as you read the program, you’ll be descending down one layer of abstraction at a time as you go through the list of functions.

Principle 4 - Use Descriptive Names

If you’re following the first three principles, then coming up with a name shouldn’t be too difficult.

The smaller and more focused your functions are, the easier it is to come up with a name.

Don’t be afraid to make your names long. A long, descriptive name is much better than a short, ambiguous name. Your editor’s autocomplete function should make it easy to write code with long function names.

Also, use a naming convention that allows multiple words to be easily read in function names. CamelCase is one example.

Principle 5 - Prefer Fewer Function Arguments

The ideal number of arguments for a function is zero. Having more than three arguments for a function should be avoided.

Having lots of function arguments makes it harder to read and understand the function since it forces you to know extra details (and those details are usually from a different layer of abstraction).

Additionally, having lots of function arguments makes it more difficult for you to write tests. You have to write all the test cases to ensure that all the various combinations of arguments work properly!

Principle 6 - Avoid Side Effects

A Side Effect is where your function modifies some state variable outside of its local environment, in addition to returning a value (the intended effect).

An example might be a function that reads a certain value from a database and then returns that value.

If that same function also modifies state in some way (and that modification lasts after the function terminates), then that modification is a side effect.

Clean Code recommends you avoid side effects.

Your function should either change the state of an object(s), or it should return some information. A single function should not do both!

Principle 7 - Use Exceptions, not Error Codes

Your functions should rarely return error codes. Instead, if something goes wrong, you should throw an exception with a well-written error message.

One issue with error codes is that it leads to deeply nested structures. When you return an error code, the caller must deal with the error code immediately.

On the other hand, a caller can separate the exception handler logic from the other code (a try, catch statement with separate try logic and catch logic).

Quastor is a free Software Engineering newsletter that sends out summaries of technical blog posts, deep dives on interesting tech and FAANG interview questions and solutions.

Tech Snippets

  • Cloud Design Patterns - This is a great list of cloud design patterns that you can employ when building cloud applications. Most of the patterns include code samples that show how to implement the pattern on Azure but the patterns are relevant to any distributed system and can be run on any cloud platform.

  • A Programmer’s Guide to Data Mining - This is an awesome textbook that takes a practical approach to learning data mining. It uses a learn-by-doing approach, where you’ll experiment with Python code to learn the techniques in the book. It has chapters on Recommendation Systems, Classification, Bayes, Clustering, and more.The textbook is free. You can read the entire book on the website or download a PDF of the book here.

  • How to write better Git Commit Messages - This is a good article with some practical tips on how to write better commit messages.Some of the suggestions are

    • Use imperative mood in the subject line. So, you should say something like Add fix for dark mode toggle state.

    • Specify the type of commit. Have a consistent set of words to describe your changes. Is it a bugfix? Update? Refactor? etc.

    • Be direct. Eliminate filler words and phrases in your commits. Don’t use words like though, maybe, I think, kind of.

  • It takes $420k per year to run Lichess - Lichess is the most popular open source chess website on the internet with no ads. More than 5 million games are played on Lichess every day and the site can sustain more than 100,000 simultaneous online users.Thibault Duplessis is the founder of Lichess and he recently tweeted that it costs $420,000 per year in total to keep the website running.You can view a full cost breakdown here.

How Grab Processes Billions of Events in Real Time

Grab is the largest transportation and food delivery company in Southeast Asia with more than 25 million monthly users completing ~2 billion transactions per year.

One of the marketing features in the Grab app is to offer real-time rewards whenever a user takes a certain action (or series of actions).

For example, if a user uses the Grab app to get a ride to work in the morning, the app might immediately reward her with a 50% off ride reward that she can use in the evening for the ride back home.

Jie Zhang and Abdullah Al Mamum are two senior software engineers at Grab and they wrote a great blog post on how they process thousands of events every second to send out hundreds of millions of rewards monthly.

Here’s a summary

Grab runs growth campaigns where they’ll reward a user with discounts and perks if the user completes a certain set of actions. Over a typical month, they’ll send out ~500 million rewards and over 2.5 billion messages to their end-users.

Trident is the engine Grab engineers built to handle this workload. It’s an If This, Then That engine which allows Grab’s growth managers to create new promotional campaigns. If a user does this, then award that user with that.

Trident process flow

The Architecture of Trident

Trident architecture

Trident’s architecture was designed with the following goals

  • Independence - Trident must run independently of other backend services, and it should not bring performance impacts to downstream backend services.

  • Robustness - All events must be processed exactly once. No events can be missed and events should not be processed multiple times.

  • Scalability - Trident must be able to scale up processing power when volume on the Grab platform surges.

Whenever a customer uses one of Grab’s products the backend service associated with that product will publish an event to a specific Kafka stream.

Trident subscribes to all the events from these multiple Kafka streams and processes them. By utilizing Kafka streams, Trident is decoupled from the upstream backend services.

Kafka guarantees at-least-once message delivery and then Trident makes sure any duplicate events are filtered out. This gives Trident exactly-once event processing, fulfilling the robustness criteria.

After filtering out duplicates, Trident will process each event and check if it results in any messages/rewards that have to be sent to the user. Trident does this by taking the event and running a rule evaluation process where it checks if the event satisfies any of the pre-defined rules set by the growth campaigns.

All processed events are stored in Redis (for 24 hours) and events that trigger an action are persisted in MySQL as well.

If an action is triggered, Trident will then call the backend service associated with that action. These calls are rate-limited (with tighter limits during peak hours) so that Trident doesn’t accidently DoS attack any of Grab’s downstream backend services.

Scalability

The number of events that Trident has to process can vary widely based on the time of day, day of week and time of year. During the peak of 2020, Trident was processing more than 2,000 events per second.

Grab uses quite a few strategies to make sure Trident can scale properly. The strategies are illustrated in this diagram.

Outline of Trident’s scale strategy

It boils down to two things: scaling the server level and scaling the data store level.

Scaling the Server Level

The source of events for Trident are Kafka streams. Upstream backend services that are handling delivery/taxi orders will publish events to these streams after they handle a user’s request.

Trident can handle increased load (more events coming down the Kafka streams) by

  • Auto-scaling horizontally - Grab can add more server instances to handle Trident’s workload. However, they have to be careful and make sure that load is being distributed evenly across the server instances by matching kafka partitions with the server auto-scaling.

  • Reducing Load - The majority of the processing that the Trident servers are doing is checking to see if the event matches the criteria for any of the campaigns and whether any actions are triggered.Grab engineers sped this process up by prefiltering events. They load active campaigns every few minutes and organize them into an in-memory hashmap with the event type as the key and the list of corresponding campaigns as the value.When processing an event, they can quickly figure out all the possible matching campaigns by first checking in the hash map.

If any actions are triggered, Trident will call downstream backend services to handle them. For example, the GrabRewards service could be called to give a user a free ride.

There are strict rate-limits built in to stop Trident from overwhelming these downstream services during a time of high load.

Scaling the Data Store Level

Trident uses two types of storage: cache storage (Redis) and persistent storage (MySQL and S3).

Scaling cache storage isn’t too complicated since Redis Cluster makes it pretty simple. Engineers can add new shards at any time to spread out the load and data replication prevents any data loss from shard failures.

In terms of persistent storage, Trident has two types of data in terms of access pattern: online data and offline data.

The online data is frequently accessed (so it has to be relatively quick) and medium size (a couple of terabytes). Grab uses MySQL for this data.

The offline data is infrequently accessed but very large in size (hundreds of terabytes generated per day). Grab uses AWS S3 for this.

For the MySQL database, Grab added read replicas that can handle some of the load from the read queries. This relieved more than 30% of the load from the master instance and allows MySQL queries to be performant.

In the future, they plan to vertically partition (split the database up by tables) the single MySQL database into multiple databases based on table usage.

Interview Question

Given an integer array nums and an integer k, return the number of good subarrays of nums.

A good array is an array where the number of different integers in that array is exactly k.

  • For example, [1,2,3,1,2] has 3 different integers: 1, 2, and 3.

A subarray is a contiguous part of an array.

Previous Solution

As a reminder, here’s our last question

Given an integer, convert it to a Roman Numeral.

Example

Input: 3

Output: “III”

Input: 4

Output: “IV”

Input: 1994

Output: “MCMXCIV”

Solution

This question can actually be solved by a pretty elegant solution.

We first put all the possible combinations of roman numeral symbols and their integer values in an ordered dictionary (dictionaries are ordered in Python 3.6+ by default).

We put them in order from largest to smallest.

Then, we iterate through our dictionary and for each integer value, we divide the number by that integer value.

That tells us how many of those roman numerals we will need.

For example, 4000 will be represented as MMMM.

4000 / 1000 = 4, which means 4 Ms.

Then, we can add that number of roman numeral symbols to our character array (that represents the final roman numeral).

After, we can set our number to the remainder that’s left.

Then, we continue on to the next roman numeral, integer value pair in our dictionary.

Here’s the Python 3 code

Quastor is a free Software Engineering newsletter that sends out summaries of technical blog posts, deep dives on interesting tech and FAANG interview questions and solutions