How Khan Academy Rewrote their Backend

Khan Academy's switch from a Python 2 monolith to a services-oriented backend written in Go.

December 27, 2021

Hey Everyone,

Hope you’re all enjoying your holidays!

Today we’ll be talking about

How Khan Academy rebuilt their Python 2 monolith into a services-oriented backend written in Go.
- Why Khan Academy picked Go and a brief overview of the language
- How engineers used Side-by-Side testing to ensure the new backend was functioning properly
- Results from the rewrite and lessons learned
Plus, a couple awesome tech snippets on
- How Etsy built their bidding system for Etsy Ads
- How Databricks uses Scala at Scale
- Lessons Learned from Packaging 10,000 C++ Projects at Bloomberg
- How MDN’s autocomplete search works
- An awesome GitHub repo with resources on Web Dev

We also have a solution to our last coding interview question, plus a new question from Microsoft.

Quastor is a free Software Engineering newsletter that sends out summaries of technical blog posts, deep dives on interesting tech and FAANG interview questions and solutions.

How Khan Academy rewrote their Backend

Khan Academy recently went through a massive rewrite, where they replaced their old Python 2 monolith with a services-oriented backend written in Go.

Marta Kosarchyn is the CTO of Khan Academy and Kevin Dangoor was a Principal Software Architect there. They wrote a series of blog posts about the technical choices, execution and results of the rewrite. We’ll be summarizing the series below.

Summary

In late 2019, Khan Academy was looking to upgrade their backend. The site was built on a Python 2 monolith and it worked well for over 10 years.

However, Python 2 was about to reach the official end of life on January 1rst, 2020 so Khan Academy engineers decided they had to update.

KA (Khan Academy) had several options

Migrate from Python 2 to Python 3 - This would get KA a 10-15% boost in backend server code performance and Python 3’s language features.
Migrate from Python 2 to Kotlin - KA started using Kotlin for compute-intensive backend tasks since it was more performant than Python 2. Switching from Python to Kotlin could mean Khan Academy becomes more responsive and server costs go down.
Migrate from Python 2 to Go - Go is a simple and concise language with very quick compilation times, first-class support on Google App Engine and less memory usage than Kotlin (based on KA’s tests).

Of these options, Khan Academy decided to go with the third choice and do a rewrite of their Python 2 monolith with Go.

They ran performance tests and found that Go and Kotlin (on the JVM) perform similarly, with Kotlin being a few percent ahead. However, Go used a lot less memory.

The dramatic performance difference between Go and Python made the effort involved in the switch worth it.

Brief Overview of Go

Go is a statically typed, compiled programming language that is syntactically similar to C (it’s often described as C for the 21rst century). Go was designed at Google by Ken Thompson, Rob Pike and Robert Griesemer.

Ken Thompson and Rob Pike were key employees at Bell Labs, and were instrumental in building the original Unix operating system (and a bunch of other stuff, they developed the UTF-8 encoding for example).

Go includes things like garbage collection, structural typing, and extremely fast compile times.

You can learn about the design of Go from this article by Rob Pike.

Monolith to Services

In addition to switching to Go, Khan Academy decided they would switch to a services-oriented architecture.

Previously, all of Khan Academy’s servers ran the same code and could respond to a request for any part of the website. Separate services were used for storing data and managing caches, but the logic for any request was the same regardless of which server responded.

Despite the additional complexity that comes with a services architecture, Khan Academy decided to go with it because of several big benefits.

Faster Deployments - Services can be deployed independently, so deployment and test runs can move more quickly. Engineers will be able to spend less of their time on deployment activities and get changes out more quickly when needed.
Limited Impact for Problems - KA engineers can now be more confident that a problem with a deployment will have a limited impact on other parts of the site.
Hardware and Configuration - By having separate services, engineers can now choose the right kinds of instances and hosting configuration needed for each service. This helps to optimize both performance and cost.

Despite the change in architecture, Khan Academy plans to continue using Google App Engine for hosting, Google Cloud Datastore for their database, and other Google Cloud products.

The Implementation

In the next part of the series, Kevin Dangoor talks about how Khan Academy rewrote their backend without downtime.

Big rewrites are extremely risky (Joel Spolsky, co-founder and former CEO of Stack Overflow, has a great blog post on this) so KA picked a strategy of incremental rewrites.

The hub of Khan Academy’s new backend is based on GraphQL Federation. Khan Academy is switching from using REST to GraphQL, and GraphQL Federation allows you to combine multiple backend services into one unified graph interface.

This way, you have a single, typed schema for all the data the various backend systems provide that is accessed through the GraphQL gateway. Each backend service provides part of the overall GraphQL schema and the gateway merges all of these separate schemas into one.

Khan Academy incrementally switched over from the old backend to the new backend.

In order to make sure that the new service is working correctly, they would query both the new distributed backend and the old monolith and compare the results and then return one of them.

KA had a 4 step process for handling the switch-over.

The monolith is in control - At first, the services backend will not have the functionality to answer a specific request, so the GraphQL gateway will route that request to the Python monolith. This request will be noted and KA engineers can write the Go code to handle the request in the new backend.
Side-by-side - Once the new backend can handle the request, KA engineers will switch to side-by-side. In this state, the GraphQL gateway will call both the Python code and the new Go code. It will compare the results, log the instances where there’s a difference, and return the Python result to the user.
Migrated - After lots of testing, the request’s status will be upgraded to migrated. At this point, the gateway will only send traffic to the Go service. In case something goes wrong, the Python code is still there and ready to respond.
Python code removed - Finally, after extensive testing, the Python code is removed.

Khan Academy used this process to rewrite their backend with high availability. They were still able to handle this task despite having a massive increase in website traffic.

The bulk of the rewrite was done in 2020, when schools switched to remote due to COVID-19 and students, parents and teachers made significantly more use of Khan Academy.

Within a period of 2 weeks, KA saw an increase of 2.5x in usage. You can read about how they handled it in this blog post by Marta Kosarchyn.

Final Results

Early this month, Marta Kosarchyn published a blog post detailing the final results of Khan Academy’s rewrite.

As of August 30th, 2021, the new services backend was handling 95% of all traffic to the site.

They were able to meet their initial estimate of completion that they made 20 months prior (despite the massive bump with COVID) and achieve performance goals for the new backend.

They relied on several principles to make the process go smoothly.

Avoid Scope Creep - At every turn, engineers sought to do as direct a port as possible from Python to Go, while still ending up with code that read like Go code. Khan Academy’s codebase had areas that they wished were structured differently, but if engineers tried to tackle those issues at the same time as the Go rewrite, they would never finish.
Side by side testing - We’ve already discussed this, but the side-by-side testing approach was critical to the success of the rewrite. It was an efficient way to make sure that the functionality being replaced was equivalent.

Borderless Engineering - Engineers would work on various product code areas and new services, stepping beyond the borders of their usual product area responsibility. This had to be done carefully, making sure that engineers would spend sufficient time to learn a new service area to be able contribute thoughtfully. It meant that engineers would sometimes switch teams or service ownership would transfer between teams.

Quastor is a free Software Engineering newsletter that sends out summaries of technical blog posts, deep dives on interesting tech and FAANG interview questions and solutions.

Tech Snippets

Spellbook of Modern JavaScript Web Dev - This is an awesome GitHub repo that aggregates a ton of resources on web dev. It has resources on JavaScript and Node, Web Assembly, Web APIs, CSS and PostCSS, Web Design and some resources on Server Side programming.The resources are structured quite well, so it’s a good reference if you’re a web developer.
Scala at Scale at Databricks - Databricks is a software company that makes data engineering easier. It was founded by the creators of Apache Spark. Databricks is one of the largest Scala shops around, with millions of lines of Scala being used in production. This post goes into why they picked Scala, the tooling they use, issues they’ve come across and more.
Lessons Learned from Packaging 10,000 C++ Projects - Bret Brown and Daniel Ruoso are two engineers at Bloomberg who work on Developer Experience. They gave an awesome talk at CppCon 2021 about package management in C++ (which can be a big pain point for many developers).Bloomberg uses many different build systems and does not have a project structure monoculture (individual projects are structured differently according to team preference).
How Etsy built their bidding system for Etsy Ads - Etsy Ads is an advertising platform for Etsy sellers who want to advertise their goods across Etsy’s website and mobile app.You place bids in Etsy’s real-time auction for ad spots on Etsy’s web page.Etsy discusses their contextual bidding algorithm in this blog post, which uses machine learning to automatically determine the amount a seller should bid for an ad spot to maximize their earnings.

How MDN's autocomplete search works - MDN Web Docs is the go-to resource for any front-end engineer when they’re looking up HTML/CSS/JavaScript documentation. They added an autocomplete search feature to the website and this blog post dives into the implementation details.By the way, MDN web docs is open source, so you can see their actual implementation of autocomplete on their github repo.

Interview Question

Given two strings s and t, return the number of distinct subsequences of s which equal t.

A subsequence is a new string formed from the original string by deleting some (or none) of the characters without changing the remaining character’s relative positions.

Example

Input - s = “babgbag”, t = “bag”

Output - 5

Here’s the question in LeetCode.

Previous Solution

As a reminder, here’s our last question

You’re given an integer N as input.

Write a function that determines the fewest number of perfect squares that sum up to N.

Examples

Input - 16

Output - 1 (16 is a perfect square)

Input - 21

Output - 3 (1 + 4 + 16 = 21)

Here’s the question in LeetCode

Solution

Your first instinct might be to go with the greedy approach. Find the largest perfect square that is smaller than N and subtract that. Then, recursively do the same on whatever number is left.

The base case is when N = 0.

However, the greedy approach doesn’t work.

A counter-example is 18.

The largest square that is smaller than 18 is 16.

Then, you’re left with 2 so that would be 2 ones.

The greedy approach will give you an answer of 3, from 16 + 1 + 1.

But, 3 is incorrect. The correct answer is 2, from 9 + 9.

So, we’ll have to try every possible combination of squares.

We’ll subtract every perfect square that is smaller than N and call our function recursively on the difference.

Every function call will return the fewest number of perfect squares needed to reach 0 (our base case) and we’ll keep track of the minimum.

Then, we’ll return the minimum + 1 (since we subtracted by one perfect square in our function).

This approach has a ton of repeated computations, so we can use dynamic programming to improve our time complexity.

One way is top down DP with memoization.

We’ll implement a bottom up DP solution by iteratively building up a table that contains the fewest number of perfect squares to sum all the numbers from 1 to N.

Quastor is a free Software Engineering newsletter that sends out summaries of technical blog posts, deep dives on interesting tech and FAANG interview questions and solutions.