How Zomato built a Petabyte Scale Logging System

We'll talk about building a highly scalable logging system with ClickHouse. Plus, how AWS S3 is showing its age, a guide to pair programming and more.

Hey Everyone!

Today we’ll be talking about

  • How Zomato built a Petabyte Scale Logging System

    • Zomato is the largest food-delivery app in South Asia with 80 million active users

    • They recently switched their logging infrastructure from Elasticsearch to ClickHouse

    • We’ll talk about why they did this and how they can handle 150 million events logged per minute

  • Tech Snippets

    • AWS S3 is showing its Age

    • How to get Buy-In with Push and Pull

    • Challenging Software Projects to Try

    • A Guide to Pair Programming

You’ll often hear about the mythical “10x engineer” - the go-to person on the team whenever you need a feature shipped fast. However, 10x engineers aren’t just super-technical, they also have a great sense of what to build.

If you’re working on the wrong feature, then it doesn’t matter how fast you work. The company won’t see a big impact from your work.

Product for Engineers wrote a great article delving into the most impactful engineers and identified six common traits that they share.

Here’s a couple of the traits.

  1. Always Prototyping and Experimenting - they ship MVPs early and often, iterate quickly based on feedback and aren’t afraid to pivot or kill features that aren’t working.

  2. Are Comfortable Writing - Clear writing skills are a must for documenting features, providing PR feedback, and making big technical decisions with RFCs.

  3. Understand the Broader Context - they understand the organization’s goals and align their decisions/work with the company’s strategy.

For the rest of the traits, check out the Product for Engineers newsletter.

They send out fantastic articles every month to help you develop the skills you need to deliver the most impact (and get promoted faster).

sponsored

How Zomato built a Petabyte Scale Logging System

Zomato is the largest food-delivery app in South Asia with over 80 million active users and 3200+ cities. At peak load, the app is handling thousands of food orders per minute.

With any app of this scale, observability is crucial for ensuring high availability. One core part of observability is logs.

Zomato’s backend services generate a huge amount of logs. On a daily basis, there’s over 50 terabytes of uncompressed log files with a maximum rate of 150 million events logged per minute.

Previously, Zomato used Elasticsearch (ELK Stack) to manage these logs but they started running into scaling problems.

Due to the exponential growth in the amount of logs, managing the Elasticsearch clusters became increasingly difficult and costly. They had to over-provision servers to handle variable traffic patterns, which led to rising costs and a poor experience.

To address these challenges, the Zomato team decided to switch to ClickHouse.

We’ll first give a brief introduction to ClickHouse. Then, we’ll delve into some of the strategies the Zomato team used to scale their logging infrastructure.

Introduction to ClickHouse

ClickHouse is a distributed, open-source database that was developed 12 years ago at Yandex (the Google of Russia) to power Metrica (a real-time analytics tool that’s similar to Google Analytics).

For some context, Yandex’s Metrica runs ClickHouse on a cluster of ~375 servers and they store over 20 trillion rows in the database (close to 20 petabytes of total data). So… it was designed with scale in mind.

ClickHouse is column-oriented with a significant emphasis on high write throughput and fast read operations. In order to minimize storage space (and enhance I/O efficiency), the database uses some clever compression techniques. It also supports a SQL-like query language to make it accessible to non-developers (BI analysts, etc.).

You can read about the architectural choices of ClickHouse here.

Zomato’s Logging System Architecture

Here’s the architecture of the setup they used.

Here’s a brief explanation of the diagram

  • Log Collection - Zomato uses Filebeat to collect logs from docker containers and EC2 instances. These logs are forwarded to Kafka for buffering. Zomato uses custom-built Golang workers to collect these logs from Kafka and send them to ClickHouse (more on this below)

  • Data Storage - ClickHouse ingests and stores the log data. Zomato used a ClickHouse cluster with 10 M6g.16xlarge AWS EC2 nodes (this article is from July 2023, so this exact figure has probably changed).

  • UI App - Zomato built a custom dashboard that developers can use to view/filter through the log data. They put a lot of effort into making the dashboard responsive, so it has a First Contentful Paint score of 0.95 seconds and a Largest Contentful Paint score of 1.9 seconds.

Efficient Insertions with Batching

ClickHouse provides Kafka plugins so you can directly send Kafka messages to your ClickHouse database.

However, the Zomato team decided not to rely on these. Instead, they wanted to batch log messages when inserting them to reduce the overhead on ClickHouse.

Each batch had a maximum size of 20,000 log entries; this kept the delay between generation of a log message and its insertion into ClickHouse to less than 5 seconds.

The Zomato team built Golang workers to handle this task.

Optimizing Searches with Skip Indexes

ClickHouse is extremely efficient when searching data by primary key, but queries involving other conditions can be costly (like any other database).

To minimize full table scans, ClickHouse provides “Skip” indexes that enable the database to skip reading significant chunks of data that are guaranteed to have no matching values.

One example of a skip index is a minmax index. This is where the database will store the minimum/maximum values of a column for every datablock. Then, when you’re querying data based on that specific column, ClickHouse can automatically skip any datablocks that have values that fall outside of the query.

Another type of skip index relies on Bloom Filters. A Bloom Filter is a probabilistic  data structure that tells you with certainty if an element is not in the set.

This helps the database skip data blocks that are definitely not related to the query. You can read more about ClickHouse’s skip indexes here.

Monitoring & Resilience

For monitoring, ClickHouse comes with a lot of embedded instrumentation.

This let the Zomato team collect metrics via Prometheus and visualize them in Grafana.

They monitored health metrics like

  • CPU and memory usage

  • Network usage

  • Data insertion/query times

  • How many insertions were getting rejected

To ensure resiliency, the engineering team implemented load shedding. Under times of stress, they would terminate any queries that consumed excessive resources in order to prioritize lightweight queries. This would ensure system availability to the most users.

Impact

The switch to ClickHouse was crucial in helping Zomato scale their logging infrastructure

Some achievements include:

  • 99% of queries were answered within 10 seconds

  • Ingestion lag time of less than 5 seconds

  • Over a million dollars per year of cost savings compared to Elasticsearch

For more details, read the full article here.

The fastest way to get promoted is to work on projects that have a big impact on your company. Big impact => better performance review => promotions and bigger bonuses.

But, how do you know what work is useful?

The key is in combining your abilities as a developer with product skills.

If you have a good sense of product, then you can understand what users want and which features will help the company get more engagement, revenue and profit.

Product for Engineers is a fantastic newsletter that’s dedicated to helping you learn these exact skills.

It’s totally free and they send out curated lessons for developers on areas like

  • How to run successful A/B tests

  • Using Feature Flags to ship faster

  • Startup marketing for engineers

and much more.

sponsored

Tech Snippets