How GitHub Built Their Code Search Feature

We'll talk about the data structures and tech underlying GitHub's code search feature. Plus, Sergey Brin's recent talk on Gemini Pro, improving productivity with focus and more.

Arpan KG
March 04, 2024

Hey Everyone!

Today we’ll be talking about

How GitHub built their Code Search Feature
- Issues with Grep and Elasticsearch
- Using an Inverted Index data structure
- How they ingest and index repositories
- How they run queries
How to use 1:1 Meetings with your Manager to Propel your Career
- This is a great video by Rahul Pandey, a former Tech Lead at Facebook and founder of Taro.
- Don’t waste your 1:1 meetings on things that can be discussed in the open. Have “awkward” conversations.
- Take notes during the meetings in a document that’s shared with your manager. This gives your manager a way to track your progression over time.
Tech Snippets
- Sergey Brin (co-founder of Google) does an impromptu Q&A on Gemini Pro
- Your Email Validation Logic is Wrong
- Improving Productivity by avoiding Context Switching
- A Manager’s Guide to Holding your Team Accountable
- Network Troubleshooting Explained (It’s not always DNS)

How GitHub Built Their Search Feature

One of GitHub’s coolest features is the code search, where you can search across every public repository on GitHub in just a few seconds.

The company has over 200 million repositories with hundreds of terabytes of code and tens of billions of documents so solving this problem is super challenging.

Timothy Clem is a software engineer at GitHub and he wrote a great blog post on how they do this. He goes into depth on other possible approaches, why they didn’t work and what GitHub eventually came up with.

Here’s a summary (with additional context)

The core of GitHub’s code search feature is Blackbird, a search engine built in Rust that’s specifically targeted for searching through programming languages.

The full text search problem is where you examine all the words in every stored document to find a search query. This problem has been extremely well studied and there’s tons of great open source solutions with some popular options being Lucene and Elasticsearch.

However, code search is different from general text search. Code is already designed to be understood by machines and the engine should take advantage of that structure and relevance.

Additionally, searching code has unique requirements like ignoring certain punctuation (parenthesis or periods), not stripping words from queries, no stemming (where you reduce a word to its root form. An example is walking and walked both stem from walk) and more.

GitHub’s scale is also massive. They previously used Elasticsearch and it took them months to index all of the code on GitHub (8 million repositories at the time). Today, they have over 200 million repositories, so they needed a faster alternative.

The Inverted Index Data Structure

If you have to do a search on your local machine, you’ll probably use grep (a tool to search plain text files for lines that match the query).

The grep utility uses pattern matching algorithms like Boyer-Moore which does a bit of string preprocessing and then searches through the entire string to find any matches with the query. It uses a couple heuristics to skip over some parts of the string but the time complexity scales linearly with the size of the corpus (in the best case. It scales quadratically in the worst case).

This is great if you’re just working on your local machine. But it doesn’t scale if you’re searching billions of documents.

Instead, full text search engines (Elasticsearch, Lucene, etc.) utilize the Inverted Index data structure.

This is very similar to what you’ll find in the index section of a textbook. You have a list of all the words in the textbook ordered alphabetically. Next to each word is a list of all the page numbers where that word appears. This is an extremely fast way to quickly find a word in a textbook.

To create an inverted index, you’ll first search through all the documents in your corpus.

For each document, extract all the tokens (a sequence of characters that represents a certain concept/meaning).

Do some processing on these tokens (combine tokens with similar meanings, drop any common words like the or a).

Then, create some type of index where the keys are all the tokens and the values are all the document IDs where that token appears.

For more details, check out this fantastic blog post by Artem Krylysov where he builds a full text search engine (credits to him for the image above).

At GitHub, they used a special type of inverted index called an ngram index. An ngram is just a sequence of characters of length n. So, if n = 3 (a trigram index) then all the tokens will have 3 characters.

Building the Inverted Index

GitHub has to build the inverted index data structure and create a pipeline to update it.

The corpus of GitHub repos is too large for a single inverted index, so they sharded the corpus by Git blob object ID. In case you’re unfamiliar, Git stores all your source code/text/image/configuration files as separate blobs where each blob gets a unique object ID.

This object ID is determined by taking the SHA-1 hash of the contents of the file being stored as a blob. Therefore, two identical files will always have the same object ID.

With this scheme, GitHub takes care of any duplicate files and also distributes blobs across different shards. This helps prevent any hot shards from a special repository that gets significantly more search hits than other repos. The repositories are split up and indexed across different shards.

Here’s a high level overview of the ingest and indexing system.

When a repository is added/updated, Github will publish an event to Kafka with data on the change.

Blackbird’s ingest crawlers will then look at each of those repositories, fetch the blob content, extract tokens and create documents that will be the input to indexing.

These documents are then published to another Kafka topic. Here, the data is partitioned between the different shards based on the blob object ID.

The Blackbird indexer will take these documents and insert them into the shard’s inverted index. It will match the proper tokens and other indices (language, owners, etc.). Once the inverted index has reached a certain number of updates, it will be flushed to disk.

Querying the Inverted Index

Here’s an overview of querying the Blackbird search engine.

The Blackbird Query service will take the user’s query and parse it into an abstract syntax tree and rewrite it with additional metadata on permissions, scopes and more.

Then, they can fan out and send concurrent search requests to each of the shards in the cluster. On each individual shard, they do some additional conversion of the query to lookup information in the indices.

They aggregate the results from all the shards, sort them by similarity score and then return the top 100 back to the frontend.

Results

Their p99 response times from individual shards are on the order of 100 milliseconds. Total response times are a bit longer due to aggregating responses, checking permissions and doing things like syntax highlighting.

The ingest pipeline can publish 120,000 documents per second, so indexing 15 billion+ docs can be done in under 40 hours.

For more details, read the full blog post here.

Tech Snippets

Sergey Brin on Gemini 1.5 Pro (March 2nd, 2024 talk)

Sergey Brin is the co-founder of Google. A few days ago, he gave a small talk at an AGI house in San Francisco where he did an impromptu Q&A session with the audience.

In this talk, he delves into quite a few interesting questions like
- his reaction to the Gemini Image Generation controversy
- will AI replace coders
- integrating Google Maps data and other 3D data into Gemini’s training
- investments in hardware and Sam Altman’s (reported) 7 trillion dollar deal

and more

https://www.youtube.com/watch?v=BQ8ykNyZ6LY

A Manager’s Guide to holding your Team Accountable

People generally tend to avoid confrontation, so this is a useful blog post on holding reports accountable and giving constructive feedback. It gives helpful starter questions on how you should approach an accountability/feedback conversation and how you can properly craft these discussions to develop a high functioning team.

http://archive.is/10RnZ

Your E-Mail Validation Logic is Wrong

Email addresses are actually a lot more complicated than you might first assume. This is a list of 14 potential edge cases you might run into if you deal with a ton of emails. It contains things you might not know about using quotes in an email address, case-sensitivity, emojis and more.

http://www.netmeister.org/blog/email.html

Forget Multi-Tasking. It's Context-Switching That Matters.

If you’d like to get more done, you should do your best to avoid unnecessary context-switches.

Here’s a terrific article from HubSpot engineering with lots of actionable tips on how you can reduce context switches and stay in flow.

If you get a slack notification for example, you should first assess how urgent it is and weigh the cost of being interrupted from flow. Whenever you switch contexts, that should be a conscious choice.

product.hubspot.com/blog/forget-multitasking-try-context-switching

Network Troubleshooting Explained (It’s not always DNS)

This is a great blog post that goes through the layers of the TCP/IP model and explains how to debug network issues in each layer. The post gives a list of CLI commands that’ll be useful when you’re trying to figure out what the issue is.

thesoloadmin.com/no-its-not-always-dns/

Premium Content

Subscribe to Quastor Pro for long-form articles on concepts in system design and backend engineering.

Past article content includes

System Design Concepts

Measuring Availability
API Gateways
Database Replication
Load Balancing
API Paradigms
Database Sharding
Caching Strategies
Event Driven Systems
Database Consistency
Chaos Engineering
Distributed Consensus

Tech Dives

Redis
Postgres
Kafka
DynamoDB
gRPC
Apache Spark
HTTP
DNS
B Trees & LSM Trees
OLAP Databases
Database Engines

When you subscribe, you’ll also get Spaced Repetition Flashcards for reviewing all the main concepts discussed in prior Quastor articles.

How to use 1:1 Meetings to Propel Your Career

Rahul Pandey was a Tech Lead at Facebook and he’s the founder of Taro, a community for software engineers to share career advice.

He posted a fantastic video on tips for doing 1:1s effectively.

Here’s a summary

You should be having regular 1:1 meetings with your manager to discuss career growth, issues and provide feedback. These meetings are key to building trust in your relationship.

Pursue Awkward 1:1s

Mark Rabkin, a Vice President at Facebook, wrote a great article about making your 1:1 meetings awkward. You shouldn’t waste this time on topics that could easily be discussed in the open.

Instead, talk about things like

Meta & Feelings - Talk about emotions. Are there any fears you feel (around career, project, upcoming deadlines, etc.)? Discuss them.
Honest Feedback - Ask them how you can be better. Or, maybe there’s feedback they gave you in a performance review that you disagree with. Talk about that.
Seek Advice - Discuss a growth area you’re currently working on and ask for suggestions.

Embrace the concept of having an “awkward” 1:1 meeting with your manager.

Go Beyond Status Updates

It’s easy for weekly 1:1 meetings to devolve into what you did last week and your plans for this week.

Instead, you could think about your highlight and lowlight of the last week and share that with your manager, along with any associated feelings.

Another tactic is to share an observation about your environment/work and give your thoughts about it. You might talk about how everyone on the team shows up late to a certain meeting and that makes you feel like the team doesn’t find it useful and maybe it could be eliminated.

Write Down Takeaways

Many engineers don’t have a system to track the 1:1 discussions over time.

You should consistently be writing down meeting notes in a shared place that you and your manager have access to.

This will show your manager that you’re listening and that you care about what happens in the 1:1 meetings.

It will also give your manager a clear view of your progression over time. This can be very useful for performance reviews.

You can view the full video here.