The Architecture of Grab's Data Lake
We'll talk about data storage formats, merge on read, copy on write and more. Plus, a detailed guide to software architecture documentation, how Figma overhauled their performance testing framework and more.
Hey Everyone!
Today we’ll be talking about
The Architecture of Grab's Data Lake
Introduction to Data Storage Formats
Design Choices when picking a Data Storage Format
High Throughput vs. Low Throughput Data at Grab
Using Avro and Merge on Read for High Throughput Data
Using Parquet and Copy on Write for Low Throughput Data
Tech Snippets
A Detailed Guide to Software Architecture Documentation
The really important job interview questions engineers should ask (but don’t)
How Figma overhauled their Performance Testing Framework
The Architecture of Grab's Data Lake
Grab is one of the largest technology companies in Southeast Asia, with over 35 million monthly users. They run a "super-app" that offers ride-sharing, food delivery, banking, and communication all within a single app.
As you might guess, operating all these services generates a lot of data. Grab's data analysts need to comb through this data for insights that can help the company improve its operations.
The data is primarily stored in a data lake. Some of the data is high-throughput and gets frequent updates (multiple updates every second/minute). Other data is low-throughput and is rarely updated (updated daily/weekly).
Grab needs to store this data efficiently and also allow data analysts to run ad-hoc queries on it without it being too costly.
One crucial design choice is picking the right storage format for the data on the data lake. Choosing the wrong format can make data storage significantly more expensive and make gaining insights from the data more difficult.
We'll first explore data storage formats, discussing the tradeoffs involved and some commonly used technologies.
Then, we'll talk about what Grab chose and why. You can read the full article by Grab here.
Introduction to Data Storage Formats
Let's say you have a bunch of sensor data from a weather satellite. If you need to store it on S3, how would you do it?
Do you just upload the CSV? What if the file is 50 gigabytes and has many repeat values? Would you pick a format that compresses the data?
The format you choose for encoding your data is crucial.
Here are some of the tradeoffs you might consider:
Human Readability - It's pretty easy for you to read a CSV or a JSON file however these formats aren't very efficient. Compressed formats like Protobuf are much smaller but not human-readable.
Row vs. Column-oriented - In a row-oriented format, you put data in the same row next to each other on disk. In a column-oriented format, you put data in the same column next to each other on disk. This choice has a ton of effects on read/write performance, compression efficiency and more. We did a deep dive on Row vs. Column-oriented databases that you can read here.
Schema Evolution - how easy is changing the data schema over time without breaking existing data? Some formats support adding/removing fields.
Compression - How efficiently can you compress the data? Compression can save storage costs and reduce latencies.
Splitability - How easy is it to divide the file format into smaller chunks? Spitability is important if you need to do parallel processing on the data.
Ecosystem Compatibility - Is the format commonly used? Does it have good support with tools like Spark, Redshift, Snowflake , etc.
Some common formats that you'll frequently see are
JSON - this is a text-based format so it's human-readable. Almost every programming language will have support for JSON. I'm assuming you've already used JSON before but you can read more here if you haven't.
CSV - a text format for storing data in a table-like way. Each line corresponds to a row and every value in the line is separated by a comma. CSV is also human-readable but it's not an efficient way to store data.
Avro - a binary data serialization framework developed within the Hadoop ecosystem. Data on disk is row-oriented and compressed. Avro supports robust schema evolution. Learn more here.
Parquet - developed in 2013 as a joint effort between Twitter and Cloudera. Parquet is column-oriented and provides efficient compression on disk. It supports schema evolution so you can add/remove columns without breaking compatibility. Read more here.
ORC - created in 2013 at Facebook. It's optimized for large streaming reads and efficient data compression. Read more.
Data Storage Formats at Grab
For Grab, they looked at the characteristics of their data sources and used that to determine their storage formats.
One characteristic they used was throughput.
High Throughput data is updated/changed frequently (several times a second). An example is the stream of booking events from customers who book a ride share.
Low Throughput data could be transaction events generated from a nightly batch process.
High Throughput Data
For High Throughput data, Grab uses Apache Avro with a strategy called Merge on Read (MOR).
Here's the main operations with Merge on Read:
Write Operations - When data is written, it's appended to the end of a log file. This is much more efficient than merging it in the current data and reduces the latency of writes.
Read Operations - When you need to read data, the base file is combined with the updates in the log file to provide the latest view. This can make reads more costly as you have to merge the updated data.
Periodic Compaction - To prevent reads from becoming too costly, updates in the log files are periodically merged with the base files. This limits the number of past updates you must merge during a read.
Low Throughput Data
For low throughput data, Grab uses Parquet with Copy on Write (CoW).
Here's the main operations for Copy on Write:
Write Operations - Whenever there's a write, you create a new version of the file that includes the latest change. You can also keep the previous version for consistency and rollback purposes. This helps prevent data corruption, inconsistent reads, and more.
Read Operations - You read the latest versioned data file. Reads are faster than Merge on Read since there is no merge process.