How Facebook Encodes Videos

June 17, 2022

Hey Everyone!

How Khan Academy rebuilt their Python 2 monolith into a services-oriented backend written in Go.
- Why Khan Academy picked Go and a brief overview of the language
- How engineers used Side-by-Side testing to ensure the new backend was functioning properly
- Results from the rewrite and lessons learned

How Facebook encodes Videos
- Adaptive Bitrate Streaming vs. Progressive Streaming and a brief intro to video encoders
- Facebook’s process for encoding videos.
- Facebook’s Benefit-Cost Model for determining the priority of an encoding job in their priority queue
Tech Snippets
- Code Catalog - A collection of well written open source code in a variety of languages
- How to measure Tech Debt
- A curated GitHub repo with best practices for Site Reliability Engineering

We also have a solution to our last Google interview question and a new question from Microsoft.

The Tech Career Growth newsletter is a must-read if you're trying to advance your career as a developer.

There are tons of resources for technical skills, but very few on the non-coding topics that matter the most to your career.

Topics like

best practices for code review
building a relationship with your manager
how to master your performance review

Tech Career Growth covers all of this and more!

Grab the insights + frameworks from senior engineers and managers at top tech companies by subscribing.

It's free!

This is a cross-promotion with Tech Career Growth. I'm a big fan of their newsletter!

How Facebook Encodes Videos

Hundreds of millions of videos are uploaded to Facebook every day.

In order to deliver these videos with high quality and little buffering, Facebook uses a variety of video codecs to compress and decompress videos. They also use Adaptive Bitrate Streaming (ABR).

We’ll first give a bit of background information on what ABR and video codecs are. Then, we’ll talk about the process at Facebook.

Progressive Streaming vs. Adaptive Bitrate Streaming

Progressive Streaming is where a single video file is being streamed over the internet to the client.

The video will automatically expand or contract to fit the screen you are playing it on, but regardless of the device, the video file size will always be the same.

source - HughesNet

There are numerous issues with progressive streaming.

Quality Issue - Your users will have different screen sizes, so the video will be stretched/pixelated if their screen resolution is different from the video’s resolution.
Buffering - Users who have a poor internet connection will be downloading the same file as users who have a fast internet connection, so they (slow-download users) will experience much more buffering.

Adaptive Bitrate Streaming is where the video provider creates different videos for each of the screen sizes that he wants to target.

They can encode the video into multiple resolutions (480p, 720p, 1080p) so that users with slow internet connections can stream a smaller video file than users with fast internet connections.

The player client can detect the user’s bandwidth and CPU capacity in real time and switch between streaming the different encodings depending on available resources.

You can read more about Adaptive Bitrate Streaming here.

Video Codec

A video codec compresses and decompresses digital video files.

Transmitting uncompressed video data over a network is impractical due to the size (tens to hundreds of gigabytes).

Video codecs solve this problem by compressing video data and encoding it in a format that can later be decoded and played back.

Examples of common codecs include H264 (AVC), MPEG-1, VP9, etc.

The various codecs have different trade-offs between compression efficiency, visual quality, and how much computing power is needed.

More advanced codecs like VP9 provide better compression performance over older codecs like H264, but they also consume more computing power.

You can read more about video codecs here.

Facebook’s Process for Encoding Videos

So, you upload a video of your dog to Facebook. What happens next?

Once the video is uploaded, the first step is to encode the video into multiple resolutions (360p, 480p, 720p, 1080p, etc.)

Next, Facebook’s video encoding system will try to further improve the viewing experience by using advanced codecs such as H264 and VP9.

The encoding job requests are each assigned a priority value, and then put into a priority queue.

A specialized encoding compute pool then handles the job.

Now, the Facebook web app (or mobile app) and Facebook backend can coordinate to stream the highest-quality video file with the least buffering to people who watch your video.

A key question Facebook has to deal with here revolves around how they should assign priority values to jobs?

The goal is to maximize everyone’s video experience by quickly applying more compute-intensive codecs to the videos that are watched the most.

Let’s say Cristiano Ronaldo uploaded a video of his dog at the same time that you uploaded your video.

There’s probably going to be a lot more viewers for Ronaldo’s video compared to yours so Facebook will want to prioritize encoding for Ronaldo’s video (and give those users a better experience).

They’ll also want to use more computationally-expensive codecs (that result in better compression ratios and quality) for Ronaldo.

The Benefit-Cost Model

Facebook’s solution for assigning priorities is the Benefit-Cost model.

It relies on two metrics: Benefit and Cost.

The encoding job’s priority is then calculated by taking Benefit and dividing it by Cost.

Benefit

The benefit metric attempts to quantify how much benefit Facebook users will get from advanced encodings.

It’s calculated by multiplying relative compression efficiency * effective predicted watch time.

The effective predicted watch time is an estimate of the total watch time that a video will be watched in the near future across all of its audience.

Facebook uses a sophisticated ML model to predict the watch time. They talk about how they created the model (and the parameters involved) in the article.

The relative compression efficiency is a measure of how much a user benefits from the codec’s efficiency.

It’s based on a metric called the Minutes of Video at High Quality per GB (MVHQ) which is a measure of how many minutes of high-quality video can you stream per gigabyte of data.

Facebook compares the MVHQ of different encodings to find the relative compression efficiency.

Cost

This is a measure of the amount of logical computing cycles needed to make the encoding family (consisting of all the different resolutions) deliverable.

Some jobs may require more resolutions than others before they’re considered deliverable.

As stated before, Facebook divides Benefit / Cost to get the priority for a video encoding job.

After encoding, Facebook’s backend will store all the various video files and communicate with the frontend to stream the optimal video file for each user.

For more details, read the full article here.

Tech Snippets

Code Catalog - This is a great collection of self-contained snippets of code from various open source projects that show non-trivial patterns and best practices to use. A couple examples are Stockfish's (one of the strongest Chess engines out there) representation of the chess board using a BitBoard.
How to Measure Tech Debt - There are many different things that can be "technical debt". Unfixed bugs, lack of test coverage, security issues, poor design/code quality, and more. This is a great post that looks at the different types of tech debt and gives you a framework for how to measure the importance of each type.
How They SRE - This is a curated github repository of best practices, tools and techniques for Site Reliability Engineering. Topics include building SRE teams, Monitoring & Observability, Automation, Alerting, Testing and more.

How Khan Academy rewrote their Backend

In 2020, Khan Academy went through a massive rewrite, where they replaced their old Python 2 monolith with a services-oriented backend written in Go.

Kevin Dangoor and Marta Kosarchyn are senior engineers at Khan Academy and they wrote a series of blog posts about the technical choices, execution and results of the rewrite. We’ll be summarizing the series below.

Summary

In late 2019, Khan Academy was looking to upgrade their backend. The site was built on a Python 2 monolith and it worked well for over 10 years.

However, Python 2 was about to reach the official end of life on January 1rst, 2020 so Khan Academy engineers decided they had to update.

KA (Khan Academy) had several options

Migrate from Python 2 to Python 3 - This would get KA a 10-15% boost in backend server code performance and Python 3’s language features.
Migrate from Python 2 to Kotlin - KA started using Kotlin for compute-intensive backend tasks since it was more performant than Python 2. Switching from Python to Kotlin could mean Khan Academy becomes more responsive and server costs go down.
Migrate from Python 2 to Go - Go is a simple and concise language with very quick compilation times, first-class support on Google App Engine and less memory usage than Kotlin (based on KA’s tests).

Of these options, Khan Academy decided to go with the third choice and do a rewrite of their Python 2 monolith with Go.

They ran performance tests and found that Go and Kotlin (on the JVM) perform similarly, with Kotlin being a few percent ahead. However, Go used a lot less memory.

The dramatic performance difference between Go and Python made the effort involved in the switch worth it.

Khan Academy’s Architecture

Brief Overview of Go

Go is a statically typed, compiled programming language that is syntactically similar to C (it’s often described as C for the 21rst century). Go was designed at Google by Ken Thompson, Rob Pike and Robert Griesemer.

Ken Thompson and Rob Pike were key employees at Bell Labs, and were instrumental in building the original Unix operating system (and a bunch of other stuff, they developed the UTF-8 encoding for example).

Go includes things like garbage collection, structural typing, and extremely fast compile times.

You can learn about the design of Go from this article by Rob Pike.

Monolith to Services

In addition to switching to Go, Khan Academy decided they would switch to a services-oriented architecture.

Previously, all of Khan Academy’s servers ran the same code and could respond to a request for any part of the website. Separate services were used for storing data and managing caches, but the logic for any request was the same regardless of which server responded.

Despite the additional complexity that comes with a services architecture, Khan Academy decided to go with it because of several big benefits.

Faster Deployments - Services can be deployed independently, so deployment and test runs can move more quickly. Engineers will be able to spend less of their time on deployment activities and get changes out more quickly when needed.
Limited Impact for Problems - KA engineers can now be more confident that a problem with a deployment will have a limited impact on other parts of the site.
Hardware and Configuration - By having separate services, engineers can now choose the right kinds of instances and hosting configuration needed for each service. This helps to optimize both performance and cost.

Despite the change in architecture, Khan Academy plans to continue using Google App Engine for hosting, Google Cloud Datastore for their database, and other Google Cloud products.

The Implementation

In the next part of the series, Kevin Dangoor talks about how Khan Academy rewrote their backend without downtime.

Big rewrites are extremely risky (Joel Spolsky, co-founder and former CEO of Stack Overflow, has a great blog post on this) so KA picked a strategy of incremental rewrites.

The hub of Khan Academy’s new backend is based on GraphQL Federation. Khan Academy is switching from using REST to GraphQL, and GraphQL Federation allows you to combine multiple backend services into one unified graph interface.

This way, you have a single, typed schema for all the data the various backend systems provide that is accessed through the GraphQL gateway. Each backend service provides part of the overall GraphQL schema and the gateway merges all of these separate schemas into one.

Khan Academy incrementally switched over from the old backend to the new backend.

In order to make sure that the new service is working correctly, they would query both the new distributed backend and the old monolith and compare the results and then return one of them.

KA had a 4 step process for handling the switch-over.

The monolith is in control - At first, the services backend will not have the functionality to answer a specific request, so the GraphQL gateway will route that request to the Python monolith. This request will be noted and KA engineers can write the Go code to handle the request in the new backend.
Side-by-side - Once the new backend can handle the request, KA engineers will switch to side-by-side. In this state, the GraphQL gateway will call both the Python code and the new Go code. It will compare the results, log the instances where there’s a difference, and return the Python result to the user.
Migrated - After lots of testing, the request’s status will be upgraded to migrated. At this point, the gateway will only send traffic to the Go service. In case something goes wrong, the Python code is still there and ready to respond.
Python code removed - Finally, after extensive testing, the Python code is removed.

Khan Academy used this process to rewrite their backend with high availability. They were still able to handle this task despite having a massive increase in website traffic.

The bulk of the rewrite was done in 2020, when schools switched to remote due to COVID-19 and students, parents and teachers made significantly more use of Khan Academy.

Within a period of 2 weeks, KA saw an increase of 2.5x in usage. You can read about how they handled it in this blog post by Marta Kosarchyn.

Final Results

Early this month, Marta Kosarchyn published a blog post detailing the final results of Khan Academy’s rewrite.

As of August 30th, 2021, the new services backend was handling 95% of all traffic to the site.

They were able to meet their initial estimate of completion that they made 20 months prior (despite the massive bump with COVID) and achieve performance goals for the new backend.

Avoid Scope Creep - At every turn, engineers sought to do as direct a port as possible from Python to Go, while still ending up with code that read like Go code. Khan Academy’s codebase had areas that they wished were structured differently, but if engineers tried to tackle those issues at the same time as the Go rewrite, they would never finish.
Side by side testing - We’ve already discussed this, but the side-by-side testing approach was critical to the success of the rewrite. It was an efficient way to make sure that the functionality being replaced was equivalent.
Borderless Engineering - Engineers would work on various product code areas and new services, stepping beyond the borders of their usual product area responsibility. This had to be done carefully, making sure that engineers would spend sufficient time to learn a new service area to be able contribute thoughtfully. It meant that engineers would sometimes switch teams or service ownership would transfer between teams.

Interview Question

A robot is located at the top-left corner of a m x n grid.

The robot can only move either down or right at any point in time.

The robot is trying to reach the bottom-right corner of the grid.

How many possible unique paths are there?

Here’s the question in LeetCode.

Previous Question

As a refresher, here’s the previous question

You are given an integer array called nums. You are also given an integer called target.

Find 3 integers in nums such that the sum is closest to target.

Return the sum of the 3 integers.

Each input will have exactly one solution

Here is the question in LeetCode.

Here's the solution

We can solve this question by first sorting nums from smallest to greatest.

After sorting, we’ll iterate through all the integers in nums.

Our loop counter will be the variable i.

nums[i] will be one of the numbers in our sum.

Now, we have to find the other 2 numbers greater than nums[i] that minimize the distance between sum and target.

We do that with two pointers, L and R.

L points to nums[i + 1] and R points to the last number in nums.

Now, we take the sum of the integers at i, L and R.

If the sum is closer to the target than any of the previous sums, then we’ll store this sum as our closest.

If the sum is greater than our target, then we’ll decrement R. This will make the sum smaller.

If the sum is less than our target, then we’ll increment L. This will make the sum larger.

We’ll end the current iteration when L is greater than or equal to R.

Then, we’ll increment i.

After the for loop ends, we can return the closest sum.

Here’s the Python 3 code.