The Architecture of Wayfair's Ad Bidding System

Hey Everyone,

Today we’ll be talking about

  • Using Rate Limiters and Load Shedding at Stripe

    • What’s Rate Limiting and what’s Load Shedding

    • The different types of limiters Stripe uses in production

    • Different algorithms for building a rate limiter

  • How Wayfair programmatically bids for millions of ads every day

    • Wayfair’s old system

    • Design goals for the new Ad Bidding Platform

    • The architecture of the new system

  • Plus, some tech snippets on

    • How PNG works - Compromising Speed for Quality

    • Implementing RSA from Scratch

    • Learning Containers from the Bottom Up

    • Why you should read academic CS papers

We also have a solution to our last interview question and a new question from Google.

Quastor is a free Software Engineering newsletter that sends out summaries of technical blog posts, deep dives on interesting tech and FAANG interview questions and solutions.

Become a Founding CTO with Fractal

Startups are one of the best ways to fast-track your career.

But, becoming a CTO and going out on your own can be very risky.

Fractal massively derisks the founding process and they’ll put you past the biggest hurdles a new founder faces.

Once you’ve been accepted to their program, you’ll be given access to Fractal’s library of thoroughly researched business ideas.

They will partner you with a highly capable and complementary cofounder and also fund your company.

After, they’ll provide ongoing support with a team of experts who can help you with everything from recruiting to technology strategy to fundraising.

Fractal’s risk-optimized approach gives first-time founders the best chance at success.

sponsored

Stripe Engineering wrote a fantastic blog post on how they think about rate limiters.

Here’s a summary.

Rate Limiting is a technique used to limit the amount of requests a client can send to your server.

It’s incredibly important to prevent DoS attacks from clients that are (accidentally or maliciously) flooding your server with requests.

A rule of thumb for when you should use a rate limiter is if your users can reduce the frequency of their API requests without affecting the outcome of their requests, then a rate limiter is appropriate.

For example, if you’re running Facebook’s API and you have a user sending 60 requests a minute to query for their list of Facebook friends, you can rate limit them without affecting their outcome. It’s unlikely that they’re adding new Facebook friends every single second.

Rate Limiting is great for day-to-day operations, but you’ll occasionally have incidents where some component of your system is down and you can’t process requests at your normal level.

In these scenarios, Load Shedding is a technique where you drop low-priority requests to make sure that critical requests get through.

Stripe is a payment processing company (you can use their API to collect payments from your users) so a critical request for them is a request to charge a user money.

An example of a non-critical method would be a request to read charge data from the past.

Stripe uses 4 different types of limiters in production (2 rate limiters and 2 load shedders).

Request Rate Limiter

Restricts each user to n requests per second. However, they also built in the ability for a user to briefly burst above the cap to handle legitimate spikes in usage.

Concurrent Requests Limiter

Restricts each user to n API requests in progress at the same time.

This helps stripe manage the load of their CPU-intensive API endpoints.

Fleet Usage Load Shedder

Stripe divides their traffic into two types: critical API methods and non-critical methods.

An example of a critical method would be creating a charge (charging a customer for something), while a non-critical method is listing a charge (looking at past charges).

Stripe always reserves a fraction of their infrastructure for critical requests. If the reservation number is 10%, then any non-critical request over the 90% allocation would be rejected with a 503 status code.

Worker Utilization Load Shedder

Stripe uses a set of workers to independently respond to incoming requests in parallel. If workers start getting backed up with requests, then this load shedder will shed lower priority traffic.

Stripe divides their traffic into 4 categories

  • Critical Methods

  • POSTs

  • GETs

  • Test mode traffic (traffic from developers testing the API and making sure payments are properly processed)

If worker capacity goes below a certain threshold, Stripe will begin shedding less-critical requests, starting from test mode traffic.

Building a rate limiter in practice

There are quite a few algorithms you can use to build a rate limiter. Algorithms include

Token Bucket - Every user gets a bucket with a certain amount of “tokens”. On each request, tokens are removed from the bucket. If the bucket is empty, then the request is rejected.

New tokens are added to the bucket at a certain threshold (every n seconds). The bucket can hold a certain number of tokens, so if the bucket is full of tokens then no new tokens will be added.

Fixed Window - The rate limiter uses a window size of n seconds for a user. Each incoming request from the user will increment the counter for the window. If the counter exceeds a certain threshold, then requests will be discarded.

After the n second window passes, a new window is created.

Sliding Log - The rate limiter track’s every user’s request in a time-stamped log. When a new request comes in, the system calculates the sum of logs to determine the request rate. If the request rate exceeds a certain threshold, then it is denied.

After a certain period of time, previous requests are discarded from the log.

Stripe uses the token bucket algorithm to do their rate limiting.

Quastor is a free Software Engineering newsletter that sends out summaries of technical blog posts, deep dives on interesting tech and FAANG interview questions and solutions.

Tech Snippets

  • How PNG works - Reducible runs an awesome YouTube channel where he explains computer science topics in an interactive and intuitive way. He’s like the 3Blue1Brown for CS.This is an awesome video where he explains how the PNG file format works and the tradeoffs made in it’s design.

  • Learning Containers From The Bottom Up - If you want to get an in-depth understanding of containers and how they work, this is a fantastic resource for that. Ivan Velichko goes through containers, the differences with VMs, images, orchestrators and more in this post.

  • Implementing RSA in Python from Scratch - If you want to understand the mathematics behind RSA, then this is an awesome series of blog posts that goes through how to implement the algorithm in Python.

  • Why you should read CS Papers - This is an interesting post from the Stack Overflow blog that discusses why programmers should be reading CS academic papers. If you want to be at the cutting-edge then papers like the Dynamo paper (Amazon’s Dynamo database), the RDD paper (underpins Apache Spark), the Google File System paper (inspiration for HDFS) and more are required reading.It can be intimidating to start, so Papers We Love is a great community to get started.

How Wayfair Bids for Ads at Scale

Wayfair is an e-commerce company that sells home furniture. You can think of them as the “Amazon for home furniture”.

They’ve partnered with over 23,000 suppliers and have more than 30 million products on their website.

To get customers to Wayfair’s site, they spend a huge amount on advertising on search engines and social media sites. In 2019, Wayfair spent over a billion dollars on advertising for things like Google Search Ads, Facebook/Instagram Ads, YouTube Ads, etc.

When you search google for “desk lamp”, Wayfair will be one of the companies bidding for the ad slot for that term. Wayfair might be willing to pay $1.00 for each click on their ad for desk lamps.

Determining the optimal bid for millions of ad placements is an extremely difficult task and it means the difference of hundreds of millions of dollars for Wayfair’s bottom line. An ad bid that’s too low will not win the ad placement slot while submitting bids that are too high will hurt Wayfair’s profit margins.

To manage this, Wayfair’s Bidding & Optimization Platform team built a central platform for Ads bidding.

Satish Gopalakrishnan is an Associate Director of Engineering at Wayfair and he wrote a great blog post on the design behind this system.

Here’s a summary

In the early days of ad bidding at Wayfair, there was no centralized platform for placing bids. When an engineer wanted to test a new method for placing ad bids, they would have to re-implement several parts of the system.

This meant a duplication of similar types of work, a slower time to market and a lack of observability and coordination.

To solve this, Wayfair decided to build a central platform.

Design Goals

The design goals of Wayfair’s ad bidding platform were

  • Extensibility - There is no single best method for placing ad bids. You might want to use an ML model to determine the bid or you might want to use simple heuristics.The method of bidding can differ based on product category, targeted keyword, etc.Therefore, the platform should be able to easily support different methods of placing bids.

  • Observability - The platform should provide visibility into why decisions were made in a certain manner.

  • Configurability - There should be controls to change parameters like bid aggressiveness and return on ad-investment. It should be easy to configure a high aggressiveness for certain products and a low aggressiveness for others.

  • Experimentation - The platform should allow for easy trial-and-error experimentation to test out new bidding approaches.

  • Scalability - The platform should easily scale as Wayfair adds more products and bidding approaches. It should also make high quality data easily available for the Data Science and Marketing teams.

Platform Architecture

blog_fig_3.png

The architecture for the Ads Bidding platform is displayed above. We’ll go through each of the components.

FORGE

FORGE does the data engineering necessary to create the data that the Ad Bidding platform uses to create bids.

FORGE gets data from various external and internal sources/signals. Then, it applies data transformations to the data ranging from aggregations to identity mappings. FORGE also does more complex operations to detect anomalous data and exclude it.

It’s a sequential pipeline and the transformations are done in a modular manner, so they’re easy to configure/remove.

Bidding Console

On the bottom-left of the diagram is the Bidding Console. This is where Wayfair analysts can adjust parameters in the Ad Bidding Platform for things like bid aggressiveness.

It’s very customizable and can be targeted to specific products/keywords.

Single Bid Calculators

The core building block of the Ad Platform is the Single Bid Calculator (SBC). SBCs provide the core logic that determines the actual bids.

An SBC takes in the specific product (Herman Miller Aeron Chair for ex.) as input. Then, it uses all relevant data/signals for that product to produce a bid for the ad placement. The SBC relies on ML or heuristics (or both) in order to come up with the bid.

blog_fig_4.png

The Ad Platform will use multiple SBCs to come up with multiple bids for that Herman Miller Aeron Chair product.

This gives Wayfair the flexibility to

  • Use some combination of the bids as the final bid

  • Evaluate multiple bidding approaches at the same time and do a counterfactual analysis (e.g. Would bidding approach X have generated better bids for a specific scenario)

  • Have fallback bids in case an SBC fails

Observability and Vendor Mapping

After the final bids are produced, they are sent to an Observability layer where various factors about the bid are recorded.

After, the bids are translated into vendor-specific terms and then sent to the various advertising vendor platforms (Google Ads, Bing Ads, etc.)

For more details, you can read the full blog post here.

Quastor is a free Software Engineering newsletter that sends out summaries of technical blog posts, deep dives on interesting tech and FAANG interview questions and solutions.

Interview Question

Write a program to solve a Sudoku puzzle by filling the empty cells.

A sudoku solution must satisfy all of the following rules:

  1. Each of the digits 1-9 must occur exactly once in each row.

  2. Each of the digits 1-9 must occur exactly once in each column.

  3. Each of the digits 1-9 must occur exactly once in each of the 9 3x3 sub-boxes of the grid.

The '.' character indicates empty cells.

Previous Question

As a refresher, here’s the previous question

Given a positive integer n, build a square matrix that has elements from 1 to n^2 in spiral order.

Solution

We can break this problem down into two parts.

The first part is to build the n x n matrix and get all the numbers from 1 to n^2

The second part is to iterate through the matrix in spiral order and iteratively set each item in the matrix to the corresponding number in our array from 1 to n^2.

In order to iterate through the matrix in spiral order, we create four pointers: top, bottom, left and right.

top points at the top row and bottom points at the last row.

left points at the first column and right points at the last column.

Then, we create a while loop that runs while top <= bottom and left <= right.

On each iteration of the while loop, we iterate through one row or column in one of the four possible directions. Based off how “spiral movement” works, the directions go in the order

  1. left to right

  2. top to bottom

  3. right to left

  4. bottom to top

We use a counter to keep track of which direction we’re going in.

At the end of each iteration of the while loop, we increment the counter (unless the counter is at direction 4, in which case we set it back to direction 1).