How GitHub built their Code Search feature

Plus, a curated list of resources for Data Engineering, a guide to troubleshooting network bugs and more.

February 10, 2023

Hey Everyone!

Today we’ll be talking about

How GitHub Built Their Search Feature
- Issues with Grep and Elasticsearch
- Using an Inverted Index data structure
- How they ingest and index repositories
- How they run queries
Tech Snippets
- A Curated List of Resources and Tools on Data Engineering (Backend)
- Your Email Validation Logic is Wrong (Full Stack)
- Forget Multi-tasking. It’s Context-Switching that Matters (Developer Productivity)
- Network Troubleshooting Explained - It’s not Always DNS (Backend)
- A Manager’s Guide to Holding Your Team Accountable (Engineering Leadership)

How GitHub Built Their Search Feature

One of GitHub’s coolest features is the code search, where you can search across every public repository on GitHub in just a few seconds.

The company has over 200 million repositories with hundreds of terabytes of code and tens of billions of documents so solving this problem is super challenging.

Timothy Clem is a software engineer at GitHub and he wrote a great blog post on how they do this. He goes into depth on other possible approaches, why they didn’t work and what GitHub eventually came up with.

Here’s a summary (with additional context)

The core of GitHub’s code search feature is Blackbird, a search engine built in Rust that’s specifically targeted for searching through programming languages.

The full text search problem is where you examine all the words in every stored document to find a search query. This problem has been extremely well studied and there’s tons of great open source solutions with some popular options being Lucene and Elasticsearch.

However, code search is different from general text search. Code is already designed to be understood by machines and the engine should take advantage of that structure and relevance.

Additionally, searching code has unique requirements like ignoring certain punctuation (parenthesis or periods), not stripping words from queries, no stemming (where you reduce a word to its root form. An example is walking and walked both stem from walk) and more.

Additionally, GitHub’s scale is massive. They previously used Elasticsearch and it took them months to index all of the code on GitHub (8 million repositories at the time). Today, they have over 200 million repositories, so they needed a faster alternative.

The Inverted Index Data Structure

If you have to do a search on your local machine, you’ll probably use grep (a tool to search plain text files for lines that match the query).

The grep utility uses pattern matching algorithms like Boyer-Moore which does a bit of string preprocessing and then searches through the entire string to find any matches with the query. It uses a couple heuristics to skip over some parts of the string but the time complexity scales linearly with the size of the corpus (in the best case. It scales quadratically in the worst case).

This is great if you’re just working on your local machine. But it doesn’t scale if you’re searching billions of documents.

Instead, full text search engines (Elasticsearch, Lucene, etc.) utilize the Inverted Index data structure.

This is very similar to what you’ll find in the index section of a textbook. You have a list of all the words in the textbook ordered alphabetically. Next to each word is a list of all the page numbers where that word appears. This is an extremely fast way to quickly find a word in a textbook.

To create an inverted index, you’ll first search through all the documents in your corpus.

For each document, extract all the tokens (a sequence of characters that represents a certain concept/meaning).

Do some processing on these tokens (combine tokens with similar meanings, drop any common words like the or a).

Then, create some type of index where the keys are all the tokens and the values are all the document IDs where that token appears.

For more details, check out this fantastic blog post by Artem Krylysov where he builds a full text search engine (credits to him for the image above).

At GitHub, they used a special type of inverted index called an ngram index. An ngram is just a sequence of characters of length n. So, if n = 3 (a trigram index) then all the tokens will have 3 characters.

Building the Inverted Index

GitHub has to build the inverted index data structure and create a pipeline to update it.

The corpus of GitHub repos is too large for a single inverted index, so they sharded the corpus by Git blob object ID. In case you’re unfamiliar, Git stores all your source code/text/image/configuration files as separate blobs where each blob gets a unique object ID.

This object ID is determined by taking the SHA-1 hash of the contents of the file being stored as a blob. Therefore, two identical files will always have the same object ID.

With this scheme, GitHub takes care of any duplicate files and also distributes blobs across different shards. This helps prevent any hot shards from a special repository that gets significantly more search hits than other repos. The repositories are split up and indexed across different shards.

Here’s a high level overview of the ingest and indexing system.

When a repository is added/updated, Github will publish an event to Kafka with data on the change.

Blackbird’s ingest crawlers will then look at each of those repositories, fetch the blob content, extract tokens and create documents that will be the input to indexing.

These documents are then published to another Kafka topic. Here, the data is partitioned between the different shards based on the blob object ID.

The Blackbird indexer will take these documents and insert them into the shard’s inverted index. It will match the proper tokens and other indices (language, owners, etc.). Once the inverted index has reached a certain number of updates, it will be flushed to disk.

Querying the Inverted Index

Here’s an overview of querying the Blackbird search engine.

The Blackbird Query service will take the user’s query and parse it into an abstract syntax tree and rewrite it with additional metadata on permissions, scopes and more.

Then, they can fan out and send concurrent search requests to each of the shards in the cluster. On each individual shard, they do some additional conversion of the query to lookup information in the indices.

They aggregate the results from all the shards, sort them by similarity score and then return the top 100 back to the frontend.

Results

Their p99 response times from individual shards are on the order of 100 milliseconds. Total response times are a bit longer due to aggregating responses, checking permissions and doing things like syntax highlighting.

The ingest pipeline can publish 120,000 documents per second, so indexing 15 billion+ docs can be done in under 40 hours.

For more details, read the full blog post here.

How did you like this summary?

Your feedback really helps me improve curation for future emails.

Tech Snippets

A Manager’s Guide to Holding Your Team Accountable (Leadership)

People generally tend to avoid confrontation, so this is a useful blog post on holding reports accountable and giving constructive feedback. It gives helpful starter questions on how you should approach an accountability/feedback conversation and how you can properly craft these discussions to develop a high functioning team.

A Curated List of Resources and Tools on Data Engineering (Backend)

Here’s a fantastic repo with a ton of different tools/projects covering different spectrums of Data Engineering. The repo covers databases, data ingestion, batch/stream processing and more.

Get a Look Into What CTOs are Reading (Full Stack)

If you find Quastor useful, you should check out Pointer.io. It’s a reading club for software developers that sends out super high quality engineering-related content.

It’s read by CTOs, engineering managers and senior developers so you should definitely sign up if you’re on that path (or if you want to go down that path in the future). It’s completely free! (cross promo)

Your Email Validation Logic is Wrong (Full Stack)

Email addresses are actually a lot more complicated than you might first assume. This is a list of 14 potential edge cases you might run into if you deal with a ton of emails. It contains things you might not know about using quotes in an email address, case-sensitivity, emojis and more.

Forget Multi-tasking. It’s Context-Switching that Matters (Productivity)

If you’d like to get more done, you should do your best to avoid unnecessary context-switches. Here’s a terrific article from HubSpot engineering with lots of actionable tips on how you can reduce context switches and stay in flow. If you get a slack notification for example, you should first assess how urgent it is and weigh the cost of being interrupted from flow. Whenever you switch contexts, that should be a conscious choice.

Network Troubleshooting Explained (It’s not always DNS) (Backend)

This is a great blog post that goes through the layers of the TCP/IP model and explains how to debug network issues in each layer. The post gives a list of CLI commands that’ll be useful when you’re trying to figure out what the issue is.