How Github Rebuilt Their Push Processing System
We'll talk about how GitHub decoupled their system for processing code pushes. Plus, resources for CTOs and more.
Hey Everyone!
Today we’ll be talking about
How GitHub Rebuilt their Push Processing System
GitHub recently rebuilt their system for handling code pushes to make it more decoupled
We’ll give a brief overview of decoupled architectures and their pros/cons
After, we’ll talk about why GitHub split their push processing system from a single job to a group of smaller, independent jobs.
Tech Snippets
Pair Programming Antipatterns
Resources for CTOs
Nine ways to shoot yourself in the foot with Postgres
How CloudFlare debugged an issue with dropped packets
How GitHub Rebuilt their Push Processing System
GitHub has over 420 million repos and 100 million registered users. In May, the platform handled over 500 million code pushes from 8.5 million developers.
Whenever someone pushes code to a GitHub repository, this kicks off a chain of tasks.
GitHub has to do things like:
Update the repo with the latest commits
Dispatch any Push webhooks
Trigger relevant GitHub workflows
And much more. In fact, Github has 20 different services that run in response to a developer pushing code.
Previously, push requests were handled by a single, enormous job (called RepositoryPushJob
). Whenever you pushed code, GitHub’s Ruby on Rails monolith would enqueue RepositoryPushJob
and handle all the underlying sub-jobs in a sequential manner.
However, the company faced issues with this approach and decided to switch to a more decoupled architecture with Apache Kafka.
Last week, GitHub published a great blog post delving into the details.
In this article, we’ll first give an overview of decoupled architectures and the pros/cons. Then, we’ll talk about the changes Github made.
Overview of Decoupled Architectures
If a user does some significant action on your app, you might have to perform a series of different jobs. If you have a video sharing website, you’ll have a bunch of different things that need to be done when someone uploads a video (encoding, generating transcripts, checking for piracy, etc.)
A key question is how coupled you want these jobs to be.
On one side of the spectrum, you can combine these sub-jobs (encoding, generating transcripts, etc.) into a single larger job (ProcessVideo) and then execute them in a sequential manner.
On the other side, you can have different services for each of the jobs and have them execute in parallel. Whenever a user uploads a video, you’ll add an event with the video’s details to an event streaming platform (like Kafka). Then, the different sub-jobs will consume the event and run independently.
Some of the pros of a decoupled approach are
Scalability - Each of the components can be scaled up/down independently based on their specific load and demand.
Fault Isolation - Components are independent so a failure in one component can be contained (hopefully).
Easier Development - Each component can be deployed independently. This makes things much easier if you have a large number of developers working together.
Cons with the decoupled approach include
Increased Complexity - Managing coordination between the independent components can be much more complex. You may need additional tooling for observability and monitoring.
System Overhead - Communication between components can become slower, especially if it requires a network request. If there are network requests involved, then you’ll have significantly more latency and failures that you’ll have to deal with.
Data Consistency - You’ll need to think about making sure data is consistent across the components.
GitHub’s Old Tightly Coupled Architecture
Previously, GitHub used a single massive job called RepositoryPushJob
for handling pushes. This job managed all the sub-jobs and triggered them one after another in a sequential series of steps.
However, the GitHub team was facing quite a few issues with this approach
Difficulty with Retries - If
RepositoryPushJob
failed then it would have to be retried. However, this caused issues with some of the sub-jobs that were not idempotent (you couldn’t run them multiple times). For example, sending multiple push webhooks could cause issues with clients that were receiving the webhooks.Huge Blast Radius - The fact that jobs were set in a sequential series of steps meant that later sub-jobs had an implicit dependency on initial sub-jobs. As you increase the number of sub-jobs in
RepositoryPushJob
, the probability of failure increases.Too Slow - Having a super long sequential process is bad for latency. The sub-jobs at the end of
RepositoryPushJob
had to wait for the sub-jobs in the beginning. This structure led to unnecessary latency for many user-facing push tasks (over a second in some cases).
GitHub New Architecture
To decrease the coupling in the push system, GitHub decided to break up RepositoryPushJob
into smaller, independent jobs.
They looked at each of the sub-jobs in RepositoryPushJob
and grouped them based on dependencies, retry-ability, owning service, etc. Each group of sub-jobs was placed into an independent job with a clear owner and appropriate retry configuration.
Whenever a developer pushes to a repo, GitHub will add a new event to Kafka. A Kafka consumer service will monitor the Kafka topic and consume the events.
If there’s a new event, the service will enqueue all the independent background jobs onto a job queue for processing. A dedicated pool of worker nodes will then handle the jobs in the queue.
In order to catch any issues, GitHub built extensive observability to monitor the flow of events through the pipeline.
Results
GitHub has seen great results with the new system. Some of the improvements include
Reliability Improvements - The old
RepositoryPushJob
was able to fully process 99.987% of pushes with no failures. The new pipeline is able to fully process 99.999% of pushes.Lower Latency - GitHub saw a notable decrease in the pull request sync time with a drop of nearly 33% (in the P50 time).
Smaller Blast Radius - previously, an issue with a single step in
RepositoryPushJob
could impact all subsequent sub-jobs. Now, failures are much more isolated and there’s a smaller blast radius for when things go wrong.