How Twitch Processes Millions of Video Streams

How Twitch routes video streams through their private backbone network for transcoding

August 05, 2022

Hey Everyone!

Today we’ll be talking about

How Twitch Processes Millions of Video Streams
- Using a Private Backbone Network to ensure low latency
- Switching away from HAProxy for a custom solution
- Intelligently routing video to servers for processing
Tech Snippets
- How Long Should You Stay at the Same Job From Tech Career Growth
- Docusaurus 2.0 Launched (a great open source tool if you want to start a tech blog)
- Use One Big Server and Thinking About Vertical Scaling
- The Many Different Flavors of Hashing and Hash Functions

How Twitch Processes Millions of Video Streams

Twitch is a live streaming platform where content creators can stream live video to an audience. There are millions of broadcasters on the platform and tens of millions of daily active users. At any given time, millions of people are streaming video through the Twitch platform (on web, mobile, smart TV, etc.)

Twitch’s Video Ingest team is responsible for developing the distributed systems and services that

Acquire live streams from Twitch content creators
Perform real-time processing (transcoding, compression, etc.)
Provide a high throughput control plane to make the video available for world-wide distribution with low latency

Eric Kwong, Kevin Pan, Christopher Lafata and Rohit Puri are software engineers on the Video Ingest team and they wrote a great blog post on their infrastructure/architecture, problems they encountered and solutions they employed.

Here’s a summary

Twitch maintains nearly a hundred servers (Points of Presence or PoPs) in different geographic regions around the world that streamers and viewers can connect to for uploading/downloading video.

These Points of Presence (PoPs) are connected through Twitch’s private Backbone Network, which is dedicated to transmitting their content. Relying on the public Internet would be susceptible to bottlenecks/instability so instead, 98% of all Twitch traffic remains on their private network.

Between the PoPs are origin data centers, which are also geographically distributed. These origin data centers handle tasks around video processing (like transcoding a livestream into different bitrates/formats for all the various devices that viewers may be using).

Video travels from a streamer’s computer to a PoP. From there, it’s sent to an origin data center for processing and then transmitted to all the PoPs that are close to the stream’s viewers. This is all done over Twitch’s Backbone network.

An image on Twitch's infrastuctuire and how streamers, also known as Broadcasters, connect.

Previously, all the PoPs ran HAProxy (a reverse proxy that is commonly used for load balancing) for forwarding the video streams to the origin data centers. However, Twitch faced several issues with this approach as they scaled.

Inefficient Usage of Origin Data Center Resources - Each PoP was configured to send its video streams to a specific origin data center (located in the same geographic area as the PoP). This meant that the origin data centers for a region ran at full load during the busy hours of that geographic area, but utilization became very minimal outside of that time period. When one region has minimal utilization, another geographic region might be having their busy hours but they couldn’t take advantage of the origin data centers of the minimal utilization region.

Difficult to Handle Unexpected Changes - The relatively static nature of the HAProxy configuration also made it difficult to handle unexpected surges of live video traffic. Reacting to system fluctuations like the loss of capacity of an origin data center was also very difficult.

Creating Intelligest

Twitch decided to revamp the software in their PoPs and completely retire HAProxy.

To replace it, they developed Intelligest, a proprietary ingest routing system that could intelligently distribute live video ingest traffic from the PoPs to the origins.

The Ingest architecture consists of two components: the Intelligest Media Proxy and the Intelligest Routing Service (IRS). The Intelligest Media Proxy is a data plane component so it runs in all the PoPs and sends the video streams to various origin data centers. The Intelligest Routing Service is a control plane and tells the Intelligest Media Proxy which origin data center to send the video to.

When a broadcaster starts streaming, his computer will transmit video to the nearest Twitch Point of Presence (PoP) server. The Intelligest Media Proxy is running on that PoP and it will extract all the relevant metadata from the stream.

It will then query the Intelligest Routing Service (IRS) and ask which origin data center it should route the video stream to. The IRS service has a real-time view of all of Twitch’s infrastructure and it will make a routing decision based on minimizing latency for the viewers and maximizing utilization of compute resources in all the origins.

The IRS service will send its decision back to the Intelligest Media Proxy, which can then route the video stream to the selected origin data center.

The Intelligest Routing Service relies on two other services implemented in AWS: Capacitor and The Well.

Capacitor monitors the compute resources in every origin and keeps track of any capacity fluctuations (due to maintenance/failures).

The Well monitors the backbone network and provides information about the status of network links so latency issues are minimized.

The IRS service uses a randomized greedy algorithm to compute routing decisions based on compute resources available, backbone network bandwidth and other factors.

For more details, you can read the full blog post here.

How did you like this summary?

Your feedback really helps me improve curation for future emails.

Tech Snippets

How Long Should You Stay at The Same Job? - Rahul Pandey is an ex-Facebook engineer and he runs a great YouTube channel/App with advice for big tech engineers. In this video, he talks about how long you should stay at your job to best benefit your career. Staying for less than a year at multiple jobs will look bad on your resume. However, staying at a company for too long can also result in a negative perception. Rahul goes into detail on this and talks about how you can address tenure issues during an interview.

Use One Big Server - We’ve done a lot of articles on “how company XYZ broke up their monolith”, but a microservices architecture is usually way too complicated/unnecessary for the vast majority of teams. Nima Badizadegan was a software engineer at Google (working on Colossus, their internal distributed file system) and he wrote a great blog post on how vertical scaling can actually be a lot cheaper than you might think. He talks about the most powerful options on various hosting providers, how much they cost and goes through some benchmarks.

Docusaurus 2.0 Launched - If you want to create a personal blog or a documentation site for a project you’re working on, Docusaurus is a fantastic open source site generator that you can use to spin up a site super quickly. It’s built and maintained by Facebook open source and powers the documentation sites for projects like React Native, Supabase, Algolia and thousands of others. Btw, the lead maintainer of Docusaurus (Sébastien Lorber) writes an awesome newsletter on React - This Week in React. I’d highly recommend checking it out if you’re a frontend dev.

The Many Flavors of Hashing - The topic of hashing goes far beyond what you'd read in standard algorithms textbooks like CLRS or Algorithm Design Manual. This is a great blog post that delves into different types of hash algorithms like integrity hashes for checksums, cryptographic hash functions, rolling hash functions and more.

Interview Question

You have a relational database with an Employee Table.

The Employee Table has a column with Employee Ids (primary key) and a column with Employee Salaries.

Write an SQL query to get the nth highest salary from the Employee table where n is a parameter.

Here’s the question in LeetCode.

Previous Question

As a reminder, here’s our last question

You are given an array of k linked lists.

Each list is sorted in ascending order.

Merge all the linked lists into one sorted linked list and return it.

Here’s the question in LeetCode.

Solution

We'll start with k linked lists and we'll merge every pair of linked lists (using the combineLists function).

This leaves us with k/2 remaining linked lists.

We can then repeat this merging process on the remaining lists, where we combine every pair to result in k/4 linked lists, k/8 linked lists, and so on.

We continue this process until we reach our final linked list.

If we have an odd number of linked lists to combine, then we just ignore the last linked lists and add it to the merged list to be combined in the next combination step.

The number of nodes we will be traversing for each combination step scales with k*N, where N is the average number of nodes in our linked lists and k is the number of linked lists. So k*N is the total number of nodes across all lists.

We'll have to repeat the merge process log(k) times, so this means our time complexity will be O(k*N*log(k)).