How Github Rebuilt Their Push Processing System

We'll talk about how GitHub decoupled their system for processing code pushes. Plus, resources for CTOs and more.

Hey Everyone!

Today we’ll be talking about

  • How GitHub Rebuilt their Push Processing System

    • GitHub recently rebuilt their system for handling code pushes to make it more decoupled

    • We’ll give a brief overview of decoupled architectures and their pros/cons

    • After, we’ll talk about why GitHub split their push processing system from a single job to a group of smaller, independent jobs.

  • Tech Snippets

    • Pair Programming Antipatterns

    • Resources for CTOs

    • Nine ways to shoot yourself in the foot with Postgres

    • How CloudFlare debugged an issue with dropped packets

How GitHub Rebuilt their Push Processing System

GitHub has over 420 million repos and 100 million registered users. In May, the platform handled over 500 million code pushes from 8.5 million developers.

Whenever someone pushes code to a GitHub repository, this kicks off a chain of tasks.

GitHub has to do things like:

  • Update the repo with the latest commits

  • Dispatch any Push webhooks

  • Trigger relevant GitHub workflows

And much more. In fact, Github has 20 different services that run in response to a developer pushing code.

Previously, push requests were handled by a single, enormous job (called RepositoryPushJob). Whenever you pushed code, GitHub’s Ruby on Rails monolith would enqueue RepositoryPushJob and handle all the underlying sub-jobs in a sequential manner.

However, the company faced issues with this approach and decided to switch to a more decoupled architecture with Apache Kafka.

Last week, GitHub published a great blog post delving into the details.

In this article, we’ll first give an overview of decoupled architectures and the pros/cons. Then, we’ll talk about the changes Github made.

Overview of Decoupled Architectures

If a user does some significant action on your app, you might have to perform a series of different jobs. If you have a video sharing website, you’ll have a bunch of different things that need to be done when someone uploads a video (encoding, generating transcripts, checking for piracy, etc.)

A key question is how coupled you want these jobs to be.

On one side of the spectrum, you can combine these sub-jobs (encoding, generating transcripts, etc.) into a single larger job (ProcessVideo) and then execute them in a sequential manner.

On the other side, you can have different services for each of the jobs and have them execute in parallel. Whenever a user uploads a video, you’ll add an event with the video’s details to an event streaming platform (like Kafka). Then, the different sub-jobs will consume the event and run independently.

Some of the pros of a decoupled approach are

  • Scalability - Each of the components can be scaled up/down independently based on their specific load and demand.

  • Fault Isolation - Components are independent so a failure in one component can be contained (hopefully).

  • Easier Development - Each component can be deployed independently. This makes things much easier if you have a large number of developers working together. 

Cons with the decoupled approach include

  • Increased Complexity - Managing coordination between the independent components can be much more complex. You may need additional tooling for observability and monitoring.

  • System Overhead - Communication between components can become slower, especially if it requires a network request. If there are network requests involved, then you’ll have significantly more latency and failures that you’ll have to deal with.

  • Data Consistency - You’ll need to think about making sure data is consistent across the components. 

 GitHub’s Old Tightly Coupled Architecture

Previously, GitHub used a single massive job called RepositoryPushJob for handling pushes. This job managed all the sub-jobs and triggered them one after another in a sequential series of steps.

However, the GitHub team was facing quite a few issues with this approach

  • Difficulty with Retries - If RepositoryPushJob failed then it would have to be retried. However, this caused issues with some of the sub-jobs that were not idempotent (you couldn’t run them multiple times). For example, sending multiple push webhooks could cause issues with clients that were receiving the webhooks. 

  • Huge Blast Radius - The fact that jobs were set in a sequential series of steps meant that later sub-jobs had an implicit dependency on initial sub-jobs. As you increase the number of sub-jobs in RepositoryPushJob, the probability of failure increases.

  • Too Slow - Having a super long sequential process is bad for latency. The sub-jobs at the end of RepositoryPushJob had to wait for the sub-jobs in the beginning. This structure led to unnecessary latency for many user-facing push tasks (over a second in some cases).

GitHub New Architecture

To decrease the coupling in the push system, GitHub decided to break up RepositoryPushJob into smaller, independent jobs.

They looked at each of the sub-jobs in RepositoryPushJob and grouped them based on dependencies, retry-ability, owning service, etc.  Each group of sub-jobs was placed into an independent job with a clear owner and appropriate retry configuration.

Whenever a developer pushes to a repo, GitHub will add a new event to Kafka. A Kafka consumer service will monitor the Kafka topic and consume the events.

If there’s a new event, the service will enqueue all the independent background jobs onto a job queue for processing. A dedicated pool of worker nodes will then handle the jobs in the queue.

In order to catch any issues, GitHub built extensive observability to monitor the flow of events through the pipeline.

Results

GitHub has seen great results with the new system. Some of the improvements include

  • Reliability Improvements - The old RepositoryPushJob was able to fully process 99.987% of pushes with no failures. The new pipeline is able to fully process 99.999% of pushes.

  • Lower Latency - GitHub saw a notable decrease in the pull request sync time with a drop of nearly 33% (in the P50 time).

  • Smaller Blast Radius - previously, an issue with a single step in RepositoryPushJob could impact all subsequent sub-jobs. Now, failures are much more isolated and there’s a smaller blast radius for when things go wrong.

Tech Snippets