How Airbnb Processes a Million User Events Every Second
An introduction to Apache Flink, the Lambda Architecture and the Architecture of Airbnb's platform. Plus, how Duolingo cut their AWS bill by 20%, Google's State of the Art Quantum chip and more.
Hey Everyone!
Today we’ll be talking about
How Airbnb Processes a Million User Events Every Second
How Airbnb built a User Events Platform to track, process and store billions of user interactions
Introduction to the Lambda Architecture
Overview of Apache Flink
The Architecture of Airbnb’s User Events Platform
Tech Snippets
Breaking down the Browser Rendering Process
How to Maintain Code Quality in the age of AI
How Duolingo cut their Cloud Spend by 20%
Explaining the Modular Monolith Architecture
Google’s State of the Art Quantum Chip
One of the hardest decisions you’ll have to make is around what technologies your team adopts. A wrong decision can be extremely costly and take years to reverse. On the other hand, not making a decision can be just as costly (lost revenue, poor developer productivity, etc.)
Product for Engineers wrote a fantastic blog post on their advice for choosing technologies to adopt.
Some of their tips include
Prioritize based on set Criteria - There will always be some shiny new toy that your team can adopt. Instead, prioritize based on problems your team is facing. This can be excessive costs, scaling challenges, or a customer need.
Mimic the Real World when Evaluating - The engineers who will be using the technology should have significant sway in the decision. They should be able to test the technology in production (safely) and build proof of concepts before deciding.
Ensure you consider technical AND business factors - You should talk to all stakeholders and clarify what the set of evaluation criteria are. Some potential criteria include performance, cost, reliability, support, flexibility and more.
Subscribe to Product for Engineers for the rest of their tips on picking technologies. It’s free!
sponsored
The Architecture of Airbnb’s User Signals Platform
Airbnb is one of the largest travel platforms in the world with over 200 million active users. When travelers browse through the app, there are millions of properties/destinations that Airbnb can recommend. Small improvements in their recommendation system can result in a huge increase in bookings (and hundreds of millions of dollars in revenue)
To provide the best recommendations, Airbnb needs to keep track of past user actions like viewing listings, favoriting experiences, starting a booking process, etc. This data needs to be processed, cleaned and stored in a database.
The Airbnb team built the User Signals Platform to handle this. It ingests and processes over 1 million user events per second and stores them in a key-value database. The platform serves over 70k+ queries per second to other internal teams at Airbnb that need access to this data.
Last week, the Airbnb engineering team published a terrific blog post delving into how they built this platform and the design choices they made.
User Signal Platform Goals
The Airbnb team had quite a few objectives for the User Signals Platform. Some of the goals were:
Ingest Real-time and Historical User Data - The platform should store real-time user engagement data as it occurs, but it should also allow for batch jobs that write historical user engagement data.
Low Latency - Other services at Airbnb will be relying on the User Signals Platform for real-time user engagement data, so the platform should ingest and process new user events in under 1 second.
Asynchronous Computation - Engineers at Airbnb should be able to run asynchronous computation jobs on the data in the User Signals platform to generate deeper insights.
In this article, we’ll talk about the architecture of the User Signals Platform and also delve into the design patterns and technologies Airbnb used.
Introduction to Lambda Architecture
The core design pattern Airbnb used for their platform is the Lambda Architecture. The Lambda architecture is composed of two layers:
Speed/Streaming Layer (Real-Time): Processes streaming data as it arrives, delivering low-latency, up-to-date results. Airbnb implements this with Apache Flink and achieves latencies less than one second.
Batch Layer (Offline): Periodically processes large volumes of historical data to generate more accurate or corrected views. The batch layer ensures long-term accuracy and handles late-arriving data or retrospective fixes. The batch layer will typically operate on a longer timescale, updating views every few hours.
By combining these two layers, the Lambda architecture provides the best of both worlds. The speed layer ensures fresh, low-latency data for online queries and personalization. The batch layer ensures correctness, allowing retrospective updates and improvements to data quality.
Introduction to Apache Flink
The core technology Airbnb used for their User Signals Platform is Apache Flink, an open source engine built for processing real-time data with very low latency.
Prior to Flink, data processing systems would rely on “micro-batching” to process data in “real-time”. They would collect data over a small fixed period (every few seconds/minutes) and then process that data as a batch job.
On the other hand, Flink takes an event-driven approach. Instead of waiting for a batch window to fill up, Flink processes each event as soon as it arrives. This results in much lower latencies.
Another benefit of Flink is that it is stateful. Traditional data processing systems might require an external database to maintain state across events. Flink integrates state management directly into the engine, allowing it to remember, accumulate, and update contextual information as events stream in. This is great if you want to do operations like aggregations or joins across your messages.
Other benefits of Apache Flink are:
Fault Tolerance - Flink provides checkpointing mechanisms to ensure that, if a job or node fails, the system can recover to a previously consistent state. This guarantees exactly-once processing, so each event is reflected in the application’s state exactly once, even in the face of failures.
Understanding Event-Time - Flink understands the concept of “event time“ (the time when an event actually occurred) instead of just processing time (when the event is processed by the system). This makes it much easier to handle out-of-order events or late-arriving data accurately.
Integration with the Ecosystem - Flink is widely used and comes with connectors to all the other data tools you might be using (Kafka, Postgres, S3, etc.)
Architecture of Airbnb’s User Signals Platform
Here’s the architecture of Airbnb’s User Signals Platform
As mentioned earlier, it’s based on the Lambda Architecture, so it consists of a real-time ingestion layer and a batch layer.
Here are the steps:
User Events: Guests interacting with Airbnb’s apps generate raw events when they view properties, add an experience to their wishlist, search for “rooms in London”, etc.
Real-Time Transformation (Speed Layer): Events flow into Kafka, where Flink jobs consume and transform them into “User Signals.” Some transformations are just simple mappings from raw events, while others may require joining multiple events based on user ID to create richer signals.
KV Storage and Serving: The transformed User Signals are stored in a Key-Value store with append-only writes. Using append-only writes helps ensure idempotency and makes data operations much simpler.
Batch Processing (Batch Layer): Periodic batch jobs will reprocess the historical data sets and identify any discrepancies or missed events from the speed layer. They’ll backfill the missing/incorrect data to ensure long-term data accuracy and consistency.
Asynchronous Computations: In addition to the immediate user signals that are stored in the KV store, Airbnb has Flink jobs that consume the new user signals to generate more insights. These jobs do things like categorize users into segments or group a single user’s actions into “sessions” to get a better understanding of the user’s intent. These jobs are run asynchronously.
Online Queries and Services: The USP service provides a way for downstream services at Airbnb to use the user signals data for their own insights.
Results
With this setup, the User Signals Platform processes over 1 million events per second across 100+ Flink jobs. The USP service serves over 70k queries per second to various teams/services at Airbnb.
You’ll often hear about the mythical “10x engineer” - the go-to person on the team whenever you need a feature shipped fast. However, 10x engineers aren’t just super-technical, they also have a great sense of what to build.
If you’re working on the wrong feature, then it doesn’t matter how fast you work. The company won’t see a big impact from your work.
Product for Engineers wrote a great article delving into the most impactful engineers and identified six common traits that they share.
Here’s a couple of the traits.
Always Prototyping and Experimenting - they ship MVPs early and often, iterate quickly based on feedback and aren’t afraid to pivot or kill features that aren’t working.
Are Comfortable Writing - Clear writing skills are a must for documenting features, providing PR feedback, and making big technical decisions with RFCs.
Understand the Broader Context - they understand the organization’s goals and align their decisions/work with the company’s strategy.
For the rest of the traits, check out the Product for Engineers newsletter.
They send out fantastic articles every month to help you develop the skills you need to deliver the most impact (and get promoted faster).
sponsored