How Cloudflare Optimized their Reverse Proxy with Tries

Cloudflare used a nifty data structure called a Trie to cut latency by 75% for a code path that serves 35 million requests per second. Plus, a dive into database fundamentals, a collection of debugging stories and more.

Hey Everyone!

Today we’ll be talking about

  • How Cloudflare cut latency by 75% with Tries

    • Cloudflare’s proxy framework and how they benchmark

    • Their function for removing HTTP headers and optimizing it by inverting it

    • Why Cloudflare used Tries as the data structure and how they optimized it with their own version

  • Tech Snippets

    • A Dive into Database Fundamentals

    • Notes on how to use LLMs in your product

    • A Collection of Debugging Stories

    • 45 Ways to Break an API server

How CloudFlare Optimized their Reverse Proxy with Tries

When you’re visiting a website, the data you send/receive will go through several proxies.

Proxies are just servers placed in-between the user and the website backend. They can be split into two types: forward proxies and reverse proxies.

Forward proxies are set up by clients/users. A forward proxy might do things like mask the user’s IP address, enforce access restrictions (no gambling websites for example), etc.

Reverse proxies are set up by the website to sit in front of their backend servers. They handle tasks like load balancing, filtering out malicious traffic (DDoS attacks), logging and more.

A huge portion of the websites on the internet use Cloudflare as a reverse proxy. They handle caching static content, DDoS protection, load balancing and much more. On average, their infrastructure handles over 60 million HTTP requests per second.

With that scale, optimizing every single service at Cloudflare to cut down on microseconds (1 microsecond is 1 millionth of a second) can lead to huge performance and efficiency gains.

CloudFlare engineering recently published a fantastic blog post on a specific optimization they made in Pingora, a Rust framework for building reverse proxy services. In the post, they talk about what data structures/algorithms they considered, how they did their benchmarking and the performance gains they saw.

We’ll be summarizing the post and adding some extra context.

CloudFlare’s Proxy Framework

Pingora is an open source Rust framework that CloudFlare uses to build reverse-proxy services. One of the services in this framework is pingora-origin

When a request enters CloudFlare’s infrastructure, the various services at CloudFlare will add HTTP headers to that request. These headers are used by other internal services at CloudFlare.

Before the request leaves CloudFlare’s infrastructure, they need to remove all the HTTP headers they added for internal purposes.

pingora-origin is the service that handles removing these HTTP headers (amongst other responsibilities). It has to do this for every request that leaves CloudFlare infrastructure, so approximately 35 million requests per second.

CloudFlare needs to ensure pingora-origin (and their other services) is as optimized as possible. One of the strategies they use to do this is rigorous benchmarking.

Benchmarking

CloudFlare keeps track of the latency of their services down to the nanosecond.

One of the tools they use to do this is the criterion Rust crate (Rust package).  This provides an API for timing rust code down to the nanosecond and also gives feedback on how performance improves/regresses over time.

To benchmark pingora-origin‘s code for removing internal HTTP headers, CloudFlare uses a large set of simulated HTTP requests with a random number of headers (uniformly distributed between internal and non-internal headers).

They found that the function in pingora-origin for removing the internal HTTP headers took an average of 3.65 microseconds. This code needs to run tens of millions of times every second, so this adds up.

Reducing Reads through Inversion

The original way CloudFlare had implemented the clear_internal_headers function was with a loop that iterated through all the possible internal headers and checked which of those headers were in the request.

// slow code 
pub fn clear_internal_headers(request_header: &mut RequestHeader) {
    INTERNAL_HEADERS.iter().for_each(|h| {
        request_header.remove_header(h);
    });
}

This approach has a time complexity of O(N x L) where N is the length of the HTTP request’s headers list and L is the average length of the internal HTTP headers (that need to be removed). The code needs to iterate through the entire list of request headers for every potential internal header.

The issue is that each request has an average of 10-30 internal headers that need to be removed. However, CloudFlare has over 100 internal headers that could potentially be in the string. They’re searching through the entire header list in the request over 100 times for each of the potential matches.

To optimize this, CloudFlare decided to flip the process around and do a reverse search.

Instead, they would search through the request header and identify all the internal HTTP headers that were present. Then, they could just look at each of the internal HTTP headers that were confirmed as present and remove them.

// Inverted Approach
pub fn clear_internal_headers(request_header: &mut RequestHeader) {
   // Identify internal headers that are present
   let to_remove = request_header
       .headers
       .keys()
       .filter_map(|name| INTERNAL_HEADER_SET.get(name))
       .collect::<Vec<_>>();


   // Remove the internal headers that were
   // found in the HTTP request
   to_remove.into_iter().for_each(|k| {
       request_header.remove_header(k);
   });

Optimizing the Data Structure

With this new inverted approach, the CloudFlare team needed to choose which data structure to use. It would need to quickly check all the internal header names against the HTTP request’s header list and determine which ones are in the list.

To solve this, the CloudFlare team considered hashmaps, sorted sets (implemented internally with B Trees) and regex.

However, they eventually settled on a data structure built specifically for string searching operations like this: the trie.

Tries at CloudFlare

A trie is a tree data structure that’s used for prefix searches over a group of known strings.

Let’s say you have the strings and, ant, dad, do and dot.

Here’s what a trie would look like for those strings.

The root node represents an empty string prefix and the first level in the trie represents the first letter of all the strings in the known strings set. The second level represents the first two letters and so on.

Some of the benefits of trie are

  • Fast Search Operations - As you traverse the path of the string in the trie, you narrow your search space to all the matches with each step. If you have a bunch of phone numbers in your trie and you want to find all phone numbers that start with area code 605, then a Trie is a way to get all those matches in O(s) time where s is the length of the substring (605 in this case).

  • Space Efficient - when you have a very large dictionary, tries can be extremely space efficient. All the substring common prefixes are stored in the same path of nodes. For example, bat and bad would be stored in the same nodes for the first two characters. Compressed Tries and Radix Tries are different implementations of tries that reduce memory usage further.

  • Dynamic Data - adding or removing substrings from a trie can be done in O(s) time where s is the length of the substring. You just have to traverse the path of the new substring in the trie.

After benchmarking a few trie implementations from crates.io, CloudFlare couldn’t find an optimized library. Most tries are used in response to keyboard events, not for processing tens of millions of requests per second.

CloudFlare ended up building their own optimized trie that they call trie-hard. You can view the implementation details here.

With the new implementation, they were able to reduce the average runtime of clear_internal_headers to 0.93 microseconds (from an original average of 3.65 microseconds).

Tech Snippets