How Airbnb Rebuilt their Payments System

July 06, 2022

Hey Everyone!

Today we’ll be talking about

How Airbnb Rebuilt their Payments System
- Airbnb shifted from a Ruby on Rails monolith to a services oriented architecture. This helped with developer velocity but introduced complexity and some performance issues.
- They added a service mesh to their payments system to reduce complexity for engineers at Airbnb who were using the payments services.
- Improving the performance of read queries with denormalized read replicas
Tech Snippets
- Design Docs at Google
- A Staff Engineer's Guide to Career Development
- How to build a Chess Engine, from scratch
- The Story of Heroku

How Airbnb rebuilt their Payments System

In 2020, Airbnb migrated from a Ruby on Rails monolith to a service-oriented architecture. This shift helped Airbnb increase developer velocity as the engineering team grew to thousands of globally distributed developers and millions of lines of code.

However, this change also brought multiple challenges that the team had to deal with. Data was now scattered across many different services so aggregating information in the presentation layer was complicated, especially for complex domains like payments. Getting all the information on fees, currency fluctuations, taxes, discounts and more meant that there were far too many different services to call.

Airbnb addressed this by adding a service mesh to provide a unified endpoint to the client services in the presentation layer. A service mesh is a layer of proxy servers added to facilitate communication between microservices. You can also add observability, security and reliability features into the service mesh rather than at the application layer (in the microservices).

Ali Can Göksel is a senior software engineer at Airbnb and he wrote a great blog post on how Airbnb re-architected their Payments layer to incorporate Viaduct, a service mesh built on GraphQL.

Here’s a summary

During their migration to a services-oriented architecture, Airbnb broke up their payments layer into multiple services.

This helped provide

A clear boundary between different payment services. This enabled better domain ownership and faster development iteration.
Better data separation into the different domains. Data was kept in a normalized shape (where you reduce data redundancy). This resulted in better correctness and consistency.

The downside of this is that the clients in the presentation layer now had to integrate with multiple payment services to fetch all the required data. They had to look into multiple services and read from even more tables than prior to the services migration.

This resulted in 3 main challenges

The system was hard to understand - Client teams needed a deep understanding of the Payments domain in order to find the right payments services to gather all the data they needed. This reduced developer velocity for those client teams and also meant engineers on the payment side needed to provide continuous guidance and consultation.
The system was difficult to change - When the payments team had to update their APIs, they had to make sure that all dependent presentation services adopted these changes.
Poor performance and scalability - The technical quality of the complex read flows was not up to standards. Aggregating payment-related data for large hosts with thousands of yearly bookings was creating a scaling problem.

Unified Entry Point

Airbnb addressed these challenges by adding a Payments Data Read Layer that acted as a service mesh. To do this, they used Viaduct, which is built on GraphQL.

With this, clients can query the layer for the data entity instead of having to identify dozens of services and their APIs.

This greatly reduced the number of APIs that needed to be exposed.

However, just using a single entry point doesn’t resolve all the complexity. Their payments system has 100+ data models, and exposing all of them from a single entry point would still be overly complex for client engineers.

To simplify this, they created higher-level domain entities to further hide internal payment details. They made fewer than 10 high level entities, so it became much easier for client teams to find the data they wanted. Also, Airbnb could now make changes to internal Payment business logic while keeping the entity schema the same so they wouldn’t have to rewrite any code for the consumer of the payment data read layer.

Improving Performance

As stated earlier, one of the challenges in the previous system was poor performance and scalability. The complex read flows of fetching the data from all the different services caused too much latency, especially for large hosts.

The core problem was reading and joining many different tables and services while executing client queries. To solve this, Airbnb added secondary denormalized Elasticsearch indices to serve as read replicas.

This moves the expensive operations from query time to ingestion time. Instead of doing lots of joins during a query, the data has to be written to the replicas during ingestion. It also sacrifices data consistency due to replication lag.

They created a system where real-time data could be written to the secondary store via database change data capture mechanisms and historical data could be written through daily database dumps. They were able to reliably achieve less than 10 seconds of replication lag.

Airbnb denormalized the data from the tables into Elasticsearch. This greatly reduced the touchpoints of the query and also made pagination and aggregation much more efficient.

After combining all of the above improvements, their new payments read flow looked like the following

The presentation service would query one of the domain entities for the data it was looking for. If the request didn’t need strong consistency, then it could go to the Elasticsearch index for that entity. Otherwise, it would go to the entity’s master database. The indexing service would make sure the Elasticsearch replicas are updated with new changes from the master.

This shift to denormalization resulted in up to 150x latency improvements and improved reliability to 99.9%.

For more details, you can read the full post here.

What did you think of this summary?

Tech Snippets

How to build a Chess Engine, from scratch - This is an awesome blog post that goes through how you can build a chess engine in JavaScript. It uses the chess.js library for move generation/validation, piece placement/movement and checkmate detection. Read the blog post to learn about tree searching algorithms, alpha-beta pruning, move evaluation functions and more on how to write a program that can play chess well.
A Staff Engineer’s Guide to Career Development - Andrew Hao is a Staff Software Engineer at Lyft. He wrote a fantastic blog post on how senior engineers can get promoted to the staff level and things you should be doing as a developer to advance your career. He talks about building a network/influence at your company, finding the right problems to tackle and how to lead.
Design Docs at Google - One key part of Google’s culture is the use of Design Docs for defining software design. The engineers in charge of a project/application will first write a design doc that outlines the context and scope of the new system, goals and non-goals, the architecture, API, data storage, and more. This is a great blog post that goes into detail on how design docs are written and what is covered inside one.
The Story of Heroku - Heroku is an innovator in the Platform as a Service offering that many developers now rely on. Lee Robinson wrote a really interesting blog post on the early days of Heroku, technical choices they made, how they became a dominant platform and some of their recent missteps.

Interview Question

You are given an array and an integer K where K is less than the size of the array.

Return the smallest K numbers in the array.

Previous Question

As a refresher, here’s the previous question

You are given an array filled with letters and numbers.

Find the longest subarray with an equal number of letters and numbers.

Return the longest subarray.

If there are multiple results, then return the subarray with the lowest starting index.

Solution

We can solve this question in a single pass of the input array.

As we iterate through our input array, we’ll count the amount of numbers and letters we’ve seen so far.

We’ll keep track of the difference between the two (amount of numbers - amount of letters) in a hash table. We’ll also keep track of the earliest index at which this difference occurs. The difference will be the key and the earliest index will be the value.

If we ever come across the same difference again (the current difference between numbers and letters is already in our hash table), then that means that between the earlier index and our current index, there are an equal amount of letters and numbers.

Therefore, we can find the length of the subarray between that earlier index and our current index. If this is the longest subarray with equal amounts that we’ve seen so far, then we’ll store it as our longest subarray.

Then we can continue iterating through the array.

Once we’ve finished iterating through the array, we can just return the longest subarray we found.