How Airbnb Processes a Million User Events Every Second

An introduction to Apache Flink, the Lambda Architecture and the Architecture of Airbnb's platform. Plus, how Duolingo cut their AWS bill by 20%, Google's State of the Art Quantum chip and more.

December 10, 2024

Hey Everyone!

Today we’ll be talking about

How Airbnb Processes a Million User Events Every Second
- How Airbnb built a User Events Platform to track, process and store billions of user interactions
- Introduction to the Lambda Architecture
- Overview of Apache Flink
- The Architecture of Airbnb’s User Events Platform
Tech Snippets
- Breaking down the Browser Rendering Process
- How to Maintain Code Quality in the age of AI
- How Duolingo cut their Cloud Spend by 20%
- Explaining the Modular Monolith Architecture
- Google’s State of the Art Quantum Chip

The Architecture of Airbnb’s User Signals Platform

Airbnb is one of the largest travel platforms in the world with over 200 million active users. When travelers browse through the app, there are millions of properties/destinations that Airbnb can recommend. Small improvements in their recommendation system can result in a huge increase in bookings (and hundreds of millions of dollars in revenue)

To provide the best recommendations, Airbnb needs to keep track of past user actions like viewing listings, favoriting experiences, starting a booking process, etc. This data needs to be processed, cleaned and stored in a database.

The Airbnb team built the User Signals Platform to handle this. It ingests and processes over 1 million user events per second and stores them in a key-value database. The platform serves over 70k+ queries per second to other internal teams at Airbnb that need access to this data.

Last week, the Airbnb engineering team published a terrific blog post delving into how they built this platform and the design choices they made.

User Signal Platform Goals

The Airbnb team had quite a few objectives for the User Signals Platform. Some of the goals were:

Ingest Real-time and Historical User Data - The platform should store real-time user engagement data as it occurs, but it should also allow for batch jobs that write historical user engagement data.
Low Latency - Other services at Airbnb will be relying on the User Signals Platform for real-time user engagement data, so the platform should ingest and process new user events in under 1 second.
Asynchronous Computation - Engineers at Airbnb should be able to run asynchronous computation jobs on the data in the User Signals platform to generate deeper insights.

In this article, we’ll talk about the architecture of the User Signals Platform and also delve into the design patterns and technologies Airbnb used.

Introduction to Lambda Architecture

The core design pattern Airbnb used for their platform is the Lambda Architecture. The Lambda architecture is composed of two layers:

Speed/Streaming Layer (Real-Time): Processes streaming data as it arrives, delivering low-latency, up-to-date results. Airbnb implements this with Apache Flink and achieves latencies less than one second.
Batch Layer (Offline): Periodically processes large volumes of historical data to generate more accurate or corrected views. The batch layer ensures long-term accuracy and handles late-arriving data or retrospective fixes. The batch layer will typically operate on a longer timescale, updating views every few hours.

Credits to GeeksforGeeks

By combining these two layers, the Lambda architecture provides the best of both worlds. The speed layer ensures fresh, low-latency data for online queries and personalization. The batch layer ensures correctness, allowing retrospective updates and improvements to data quality.

Introduction to Apache Flink

The core technology Airbnb used for their User Signals Platform is Apache Flink, an open source engine built for processing real-time data with very low latency.

Prior to Flink, data processing systems would rely on “micro-batching” to process data in “real-time”. They would collect data over a small fixed period (every few seconds/minutes) and then process that data as a batch job.

On the other hand, Flink takes an event-driven approach. Instead of waiting for a batch window to fill up, Flink processes each event as soon as it arrives. This results in much lower latencies.

Another benefit of Flink is that it is stateful. Traditional data processing systems might require an external database to maintain state across events. Flink integrates state management directly into the engine, allowing it to remember, accumulate, and update contextual information as events stream in. This is great if you want to do operations like aggregations or joins across your messages.

Other benefits of Apache Flink are:

Fault Tolerance - Flink provides checkpointing mechanisms to ensure that, if a job or node fails, the system can recover to a previously consistent state. This guarantees exactly-once processing, so each event is reflected in the application’s state exactly once, even in the face of failures.
Understanding Event-Time - Flink understands the concept of “event time“ (the time when an event actually occurred) instead of just processing time (when the event is processed by the system). This makes it much easier to handle out-of-order events or late-arriving data accurately.
Integration with the Ecosystem - Flink is widely used and comes with connectors to all the other data tools you might be using (Kafka, Postgres, S3, etc.)

Architecture of Airbnb’s User Signals Platform

Here’s the architecture of Airbnb’s User Signals Platform

As mentioned earlier, it’s based on the Lambda Architecture, so it consists of a real-time ingestion layer and a batch layer.

Here are the steps:

User Events: Guests interacting with Airbnb’s apps generate raw events when they view properties, add an experience to their wishlist, search for “rooms in London”, etc.
Real-Time Transformation (Speed Layer): Events flow into Kafka, where Flink jobs consume and transform them into “User Signals.” Some transformations are just simple mappings from raw events, while others may require joining multiple events based on user ID to create richer signals.
KV Storage and Serving: The transformed User Signals are stored in a Key-Value store with append-only writes. Using append-only writes helps ensure idempotency and makes data operations much simpler.
Batch Processing (Batch Layer): Periodic batch jobs will reprocess the historical data sets and identify any discrepancies or missed events from the speed layer. They’ll backfill the missing/incorrect data to ensure long-term data accuracy and consistency.
Asynchronous Computations: In addition to the immediate user signals that are stored in the KV store, Airbnb has Flink jobs that consume the new user signals to generate more insights. These jobs do things like categorize users into segments or group a single user’s actions into “sessions” to get a better understanding of the user’s intent. These jobs are run asynchronously.
Online Queries and Services: The USP service provides a way for downstream services at Airbnb to use the user signals data for their own insights.

Results

With this setup, the User Signals Platform processes over 1 million events per second across 100+ Flink jobs. The USP service serves over 70k queries per second to various teams/services at Airbnb.

Tech Snippets

Breaking down the Browser Rendering Process

Abhishek Saha published a fantastic blog post that talks about exactly what happens between going to “www.google.com“ and seeing the page load on your computer.

He delves into the DNS lookup, TCP/TLS handshake, Browser Rendering process and much more. The article is filled with interactive graphics to help you understand the process.

abhisaha.com/blog/exploring-browser-rendering-process

Willow - Google’s state-of-the-art quantum chip

Google’s Quantum AI team just announced Willow, their latest quantum processor.

Google used the Random Circuit Sampling (RCS) benchmark to measure its performance and it was able to complete an extremely computation in under 5 minutes. That same computation would take 10 septillion years for the world’s fastest supercomputer.

At scale, Quantum computers would break many current encryption methods so there’s a big push for quantum-resistant cryptography.

blog.google/technology/research/google-willow-quantum-chip

Explaining the Modular Monolith Architecture

Modular Monoliths are becoming increasingly popular, where you have a balance between the efficiency of traditional Monoliths and the separation of Microservices.

This is a great article that delves into this pattern, it’s defining characteristics, pros and cons.

blog.levelupcoding.com/p/luc-66-breaking-down-modular-monolithic-architecture-blending-tradition-with-innovation

How Duolingo cut their Cloud Spend by 20%

Duolingo published a fantastic blog post delving into the exact strategies they used to cut their AWS cloud spend.

Some of the key optimizations included
- Extending cache TTLs for rarely-changing resources
- Reducing unnecessarily verbose logging in production
- Changing databases to more optimal configurations

and more.

blog.duolingo.com/reducing-cloud-spending

How to Maintain Code Quality in the age of AI

While AI can help write code faster, it can also create a tradeoff with control vs. quality. In this article, Refactoring.fm provides a six-step process for “Lifecycle of Quality“ to help ensure that your codebase doesn’t suffer as you use tools like Claude or gpt-4o.

refactoring.fm/p/code-quality-in-the-age-of-ai