How Netflix does Load Shedding

Netflix recently published a blog post about how they use load shedding in their backend. Plus, lessons after one year of Rust in production, the bad state of Java deserialization and more.

September 23, 2024

Hey Everyone!

Today we’ll be talking about

How Netflix uses Service-Level Load Shedding to Improve Reliability
- Benefits of Load Shedding at the Service-level
- How Services at Netflix Categorize Requests
- CPU-Based Load Shedding
- Load Shedding Anti-Patterns
Tech Snippets
- The Sorry State of Java Deserialization
- A Terrible Way to Jump Into Colocating Your Own Stuff
- One Year of Rust in Production

How Netflix uses Service-Level Load Shedding to Improve Reliability

Netflix is the world's biggest streaming service, responsible for close to 15% of global internet traffic in 2023. They’re the largest single source of internet traffic in the world (nearly as much as YouTube and TikTok combined).

One issue Netflix engineers have to deal with is virality. Shows can quickly go viral on social media and drive a huge surge of users to the platform.

In 2020, Netflix added load shedding in their API gateway, Zuul. This ensured that traffic critical to playing movies and TV shows would get prioritized over less critical traffic (logging, analytics, unnecessarily “frills”, etc.)

For example, during normal circumstances, the Netflix site will autoplay a movie’s trailer when you hover your mouse over it. However, if the backend is under heavy strain then Netflix’s API gateway will ignore any requests from the frontend for these autoplay movie trailers.

Recently, the Netflix team wanted to incorporate load shedding at the service level in their backend. They use a microservices-oriented architecture and they wanted individual microservices to load shed lower-priority incoming requests when the service is under strain (previously, load shedding was only done in the API gateway).

They published a fantastic blog post about this a few months ago. We’ll be summarizing it and adding some extra context.

Benefits of Service-Level Load Shedding

Adding service-level load shedding can obviously be complex. However, Netflix saw quite a few advantages.

1. Finer-Grained Prioritization: Instead of only having load shedding rules at the API gateway, individual service teams can now add their own logic and apply more specific, contextual prioritization based on their service's needs.

2. Broader Application: This approach can be used for backend-to-backend communication, making it useful in many more contexts beyond traffic management at the edge (in the API gateway). For example, if a certain backend service is misconfigured and sending too many requests, then service-to-service load shedding can prevent that bug from causing an outage.

3. Cost Efficiency: Previously, Netflix maintained separate clusters for non-critical vs. critical requests to ensure critical requests were prioritized. Now, they can combine different request types into a single cluster and shed low-priority requests when necessary. This helped them save big on cloud costs.

How Netflix Categorizes Requests

Netflix categorizes incoming requests into four predefined priority buckets. This setup was inspired by how Linux allows users to categorize incoming network traffic (Linux tc-prio).

Netflix introduced four pre-defined priority buckets for requests:

1. CRITICAL: These requests affect core functionality and will never be shed unless the system is in complete failure. For Netflix, examples of critical traffic are authentication requests, user payments, user-initiated requests to play a video, etc.

2. DEGRADED: These requests affect user experience but aren't critical. They will be progressively shed as load increases. An example of this might be requests for higher resolution video quality options. Netflix could downgrade users from 4K to 1080p if there’s a huge surge in load.

3. BEST_EFFORT: These requests don't affect the user experience and will be handled on a best-effort basis. They may be shed progressively during normal operation. An example might be requests related to collecting analytics.

4. BULK: This category is for background work that can be routinely shed without impacting core services. This might be batch processing jobs or other tasks that can be retried later.

Netflix services will use various request attributes (HTTP headers or the request body) to categorize them into one of these buckets.

CPU-Based Load Shedding

Most services at Netflix autoscale based on CPU utilization, so they also use this metric for their prioritized load shedding framework.

Once CPU utilization hits a predefined threshold, services at Netflix will start with load shedding bulk requests, then best_effort, then degraded and finally critical.

The percentage of requests (Y axis) that are being shed as CPU utilization (X axis) increases

Netflix ran experiments to test this system. They prevented auto-scaling and dialed up requests per second to 6x the normal volume. The backend successfully shed non-critical traffic first, then critical traffic, while maintaining reasonable latency for successful requests.

Load Shedding Anti-Patterns

Netflix identified two main anti-patterns to avoid:

1. No Shedding: If the load shedding doesn’t “kick in” quickly then this can result in higher latency for all requests. Even worse, it might result in a “death spiral” where one instance becomes unhealthy and that leads to more load on other instances. This eventually causes all instances to become unhealthy and fail before auto-scaling can kick in.

2. Shedding Too Aggressively: The other side is that load shedding may be too aggressive. Netflix judges this by looking at how many requests per second are handled after the load shedding kicks in. If the number of successful requests handled per second is decreasing after the load shedding then that means that it’s too aggressive and you’re rejecting requests that the service could have handled.

Results

Despite being implemented recently, service-level load shedding has already helped prevent outages at Netflix.

During one occasion, an infrastructure outage impacted streaming for many users on the site. After the outage was resolved, Netflix saw a 12x spike in requests per second (due to a backlog of queued requests).

This could have caused a second outage as their systems weren’t scaled to handle this spike. However, the prioritized load shedding system was able to shed non-critical traffic and successfully served 99.4% of critical requests.

Tech Snippets

The Sorry State of Java Deserialization

This is a really interesting read that looks at the performance of different Java deserialization methods using a simplified version of the “1 billion rows challenge” (read 1 billion temperature measurements and then aggregate it with min/max/average)

The author bechmarks Parquet, Protobuf, JDBC and more. As you might’ve guessed from the title, he found the results to be *very* subpar.

www.marginalia.nu/log/a_110_java_io

A Terrible Way to Jump Into Colocating Your Own Stuff

If you want to run your own infrastructure, you don’t have to build your own facilities. One popular strategy is “colocation” where you rent space in a building to put your servers.

This way, you don’t have to worry about handling internet connectivity, cooling, power, physical security, etc.

This goes through some of the steps you might have to do when you start cohosting.

rachelbythebay.com/w/2024/09/22/colo

One Year of Rust in Production

Dmitry is a senior software engineer and a solo-preneur. He’s the founder of JustFax, a side business that he’s built with Rust and he also uses Rust for his day job.

He wrote an interesting blog post delving into his experiences after using Rust for a year in production.

Like many other developers, Dmitry was impressed by the type-safety of Rust and how resilient his application was (he never experienced a Rust process crash).

In terms of the tooling, Dmitry had mixed feelings. He liked how stable the tooling was (unlike JavaScript or Python for example) but faced issues with long compile times and a poor documentation for many tools.

yieldcode.blog/post/one-year-of-rust-in-production