How Cloudflare Optimized their Reverse Proxy with Tries

Cloudflare used a nifty data structure called a Trie to cut latency by 75% for a code path that serves 35 million requests per second. Plus, a dive into database fundamentals, a collection of debugging stories and more.

October 15, 2024

Hey Everyone!

Today we’ll be talking about

How Cloudflare cut latency by 75% with Tries
- Cloudflare’s proxy framework and how they benchmark
- Their function for removing HTTP headers and optimizing it by inverting it
- Why Cloudflare used Tries as the data structure and how they optimized it with their own version
Tech Snippets
- A Dive into Database Fundamentals
- Notes on how to use LLMs in your product
- A Collection of Debugging Stories
- 45 Ways to Break an API server

How to Create a Culture that Attracts the Best Developers

At any given moment, the best developers will usually have many different job opportunities available to them. If you want to hire and retain these engineers, you’ll need to make sure that your team culture encourages them to stay.

PostHog published a fantastic article in Product for Engineers that delves into some of the things you can do to help foster this kind of culture.

Generate enthusiasm - Help engineers take ownership of their own projects by encouraging them to make decisions about what to do next. This can help make them much more enthusiastic about their job.
Remove pointless approval processes and get out of the way - Creating arbitrary reporting obligations, an arduous deployment process, unnecessary meetings, etc. are an excellent way to kill morale on your team.
Nail your hiring process - The number one factor great developers love is working with other great developers. Make sure you maintain a strict hiring bar to ensure this.

For the rest of the tips, check out Product for Engineers. It’s a fantastic newsletter with 20,000+ subscribers that teaches developers how to build great products.

How CloudFlare Optimized their Reverse Proxy with Tries

When you’re visiting a website, the data you send/receive will go through several proxies.

Proxies are just servers placed in-between the user and the website backend. They can be split into two types: forward proxies and reverse proxies.

Forward proxies are set up by clients/users. A forward proxy might do things like mask the user’s IP address, enforce access restrictions (no gambling websites for example), etc.

Reverse proxies are set up by the website to sit in front of their backend servers. They handle tasks like load balancing, filtering out malicious traffic (DDoS attacks), logging and more.

A huge portion of the websites on the internet use Cloudflare as a reverse proxy. They handle caching static content, DDoS protection, load balancing and much more. On average, their infrastructure handles over 60 million HTTP requests per second.

With that scale, optimizing every single service at Cloudflare to cut down on microseconds (1 microsecond is 1 millionth of a second) can lead to huge performance and efficiency gains.

CloudFlare engineering recently published a fantastic blog post on a specific optimization they made in Pingora, a Rust framework for building reverse proxy services. In the post, they talk about what data structures/algorithms they considered, how they did their benchmarking and the performance gains they saw.

We’ll be summarizing the post and adding some extra context.

CloudFlare’s Proxy Framework

Pingora is an open source Rust framework that CloudFlare uses to build reverse-proxy services. One of the services in this framework is pingora-origin.

When a request enters CloudFlare’s infrastructure, the various services at CloudFlare will add HTTP headers to that request. These headers are used by other internal services at CloudFlare.

Before the request leaves CloudFlare’s infrastructure, they need to remove all the HTTP headers they added for internal purposes.

pingora-origin is the service that handles removing these HTTP headers (amongst other responsibilities). It has to do this for every request that leaves CloudFlare infrastructure, so approximately 35 million requests per second.

CloudFlare needs to ensure pingora-origin (and their other services) is as optimized as possible. One of the strategies they use to do this is rigorous benchmarking.

Benchmarking

CloudFlare keeps track of the latency of their services down to the nanosecond.

One of the tools they use to do this is the criterion Rust crate (Rust package). This provides an API for timing rust code down to the nanosecond and also gives feedback on how performance improves/regresses over time.

To benchmark pingora-origin‘s code for removing internal HTTP headers, CloudFlare uses a large set of simulated HTTP requests with a random number of headers (uniformly distributed between internal and non-internal headers).

They found that the function in pingora-origin for removing the internal HTTP headers took an average of 3.65 microseconds. This code needs to run tens of millions of times every second, so this adds up.

Reducing Reads through Inversion

The original way CloudFlare had implemented the clear_internal_headers function was with a loop that iterated through all the possible internal headers and checked which of those headers were in the request.

// slow code 
pub fn clear_internal_headers(request_header: &mut RequestHeader) {
    INTERNAL_HEADERS.iter().for_each(|h| {
        request_header.remove_header(h);
    });
}

This approach has a time complexity of O(N x L) where N is the length of the HTTP request’s headers list and L is the average length of the internal HTTP headers (that need to be removed). The code needs to iterate through the entire list of request headers for every potential internal header.

The issue is that each request has an average of 10-30 internal headers that need to be removed. However, CloudFlare has over 100 internal headers that could potentially be in the string. They’re searching through the entire header list in the request over 100 times for each of the potential matches.

To optimize this, CloudFlare decided to flip the process around and do a reverse search.

Instead, they would search through the request header and identify all the internal HTTP headers that were present. Then, they could just look at each of the internal HTTP headers that were confirmed as present and remove them.

// Inverted Approach
pub fn clear_internal_headers(request_header: &mut RequestHeader) {
   // Identify internal headers that are present
   let to_remove = request_header
       .headers
       .keys()
       .filter_map(|name| INTERNAL_HEADER_SET.get(name))
       .collect::<Vec<_>>();


   // Remove the internal headers that were
   // found in the HTTP request
   to_remove.into_iter().for_each(|k| {
       request_header.remove_header(k);
   });

Optimizing the Data Structure

With this new inverted approach, the CloudFlare team needed to choose which data structure to use. It would need to quickly check all the internal header names against the HTTP request’s header list and determine which ones are in the list.

To solve this, the CloudFlare team considered hashmaps, sorted sets (implemented internally with B Trees) and regex.

However, they eventually settled on a data structure built specifically for string searching operations like this: the trie.

Tries at CloudFlare

A trie is a tree data structure that’s used for prefix searches over a group of known strings.

Let’s say you have the strings and, ant, dad, do and dot.

Here’s what a trie would look like for those strings.

The root node represents an empty string prefix and the first level in the trie represents the first letter of all the strings in the known strings set. The second level represents the first two letters and so on.

Some of the benefits of trie are

Fast Search Operations - As you traverse the path of the string in the trie, you narrow your search space to all the matches with each step. If you have a bunch of phone numbers in your trie and you want to find all phone numbers that start with area code 605, then a Trie is a way to get all those matches in O(s) time where s is the length of the substring (605 in this case).
Space Efficient - when you have a very large dictionary, tries can be extremely space efficient. All the substring common prefixes are stored in the same path of nodes. For example, bat and bad would be stored in the same nodes for the first two characters. Compressed Tries and Radix Tries are different implementations of tries that reduce memory usage further.
Dynamic Data - adding or removing substrings from a trie can be done in O(s) time where s is the length of the substring. You just have to traverse the path of the new substring in the trie.

After benchmarking a few trie implementations from crates.io, CloudFlare couldn’t find an optimized library. Most tries are used in response to keyboard events, not for processing tens of millions of requests per second.

CloudFlare ended up building their own optimized trie that they call trie-hard. You can view the implementation details here.

With the new implementation, they were able to reduce the average runtime of clear_internal_headers to 0.93 microseconds (from an original average of 3.65 microseconds).

How to Learn Product Skills as a Developer

The fastest way to get promoted is to work on projects that have a big impact on your company. Big impact => better performance review => promotions and bigger bonuses.

But, how do you know what work is useful?

The key is in combining your abilities as a developer with product skills.

If you have a good sense of product, then you can understand what users want and which features will help the company get more engagement, revenue and profit.

Product for Engineers is a fantastic newsletter that’s dedicated to helping you learn these exact skills.

It’s totally free and they send out curated lessons for developers on areas like

How to run successful A/B tests
Using Feature Flags to ship faster
Startup marketing for engineers

and much more.

Tech Snippets

A Dive into Database Fundamentals

This article is a terrific deep dive into database fundamentals. It talks about ACID properties (and how they can be implemented in a toy database), storage engines, indexes, LSM trees, the CAP theorem and much more.

It gives a fantastic overview of many of the most interesting topics in databases and also provides additional resources if you want to go further.

tontinton.com/posts/database-fundementals

Notes on how to use LLMs in your product.

With the current hype-cycle around Generative AI, many companies are looking for ways to incorporate LLMs into their products. This is a great blog post by Will Larson on his thoughts for using LLMs to build new products for customers.

You should first make sure your mental models around LLMs are accurate. Flaws of LLMs include high latency, pricing and accuracy. Larger LLMs have more accuracy but they’ll cost more and have higher latency.

If you have a large corpus that you need to incorporate, then Retrieval Augmented Generation can be an effective way to get around the token limit. However, RAG is not a solution to a search problem. Effective RAG depends on high-quality retrieval and filtering mechanisms to work well.

lethain.com/mental-model-for-how-to-use-llms-in-products

A Collection of Debugging Stories

This is a collection of stories that people have published detailing their journey of debugging a certain problem.

Reading these can help you hone your own debugging skills and it can also be fun.

github.com/danluu/debugging-stories

45 ways to break an API server

This is a really handy reference list of good tests you should run to check your API.

It includes tests for attempted code injection, different character encodings, unsupported HTTP methods/versions and much more.

dev.to/zvone187/45-ways-to-break-an-api-server-negative-tests-with-examples-4ok3