How GitHub improves Reliability and Availability

We'll talk about open source tools like Scientist and Flipper and go through a real world case study at GitHub. Plus, how to set engineering org values and more.

Hey Everyone!

Today we’ll be talking about

  • The Tooling GitHub uses for Improving Reliability and Availability

    • GitHub has an extremely complex codebase and they use many different tools to help them understand, debug and monitor their application

    • They use open source Ruby libraries like Scientist and Flipper for refactoring code and incrementally rolling out changes

    • For observability, they use DataDog and Splunk for keeping track of metrics and logs

  • Tech Snippets

    • SQLite is not a toy database

    • The trade-off we make with tests

    • How to set engineering org values

The Tooling GitHub uses to Improve Availability and Reliability

GitHub is the world’s largest developer platform with over 200 million code repositories on the site. Every month, tens of millions of developers use GitHub for storing their code.

At this size, GitHub’s tech stack and codebase is extremely complex. Small changes can result in a big ripple effect and finding the root cause of issues can be a challenge.

Nick Hengeveld has been a developer at GitHub since 2011 and he wrote a fantastic blog post delving into the techniques and tools GitHub uses to improve availability.

We’ll talk about observability, release engineering at GitHub.

Observability

Understanding what’s happening in the backend is crucial when you’re at the scale of GitHub. Some of the techniques that GitHub uses is analyzing metrics and events.

Metrics

Metrics are quantitative measurements of backend behavior that help engineers understand how the system is performing over time.

GitHub uses DataDog for storing metrics. They track data points like traffic levels, response times, error rates, cache hit rates and more.

Collecting this allows them to visualize patterns and identify areas that need attention before they escalate into an outage.

Events

Events provide qualitative information about specific occurrences within the system.

GitHub uses Splunk for storing and analyzing events. Examples of events include: deployments, repository creations/deletions, errors and more.

By analyzing these events, GitHub can better understand the root cause when problems arise. If there’s a sudden spike in CPU usage, they can look at error logs to see if there’s a bug causing the issue.

Release Engineering

In order to ensure smooth deployments, GitHub also uses best practices around testing and deployments.

Testing

For testing proposed changes, GitHub uses Scientist, an open-source Ruby library they developed for refactoring critical paths. Scientist allows them to run experiments that compare the performance of new code versus prior implementations. This lets them ensure changes aren’t causing regressions before they deploy them.

For example, when optimizing a database query, they might use Scientist to:

  1. Run both the old and new query simultaneously

  2. Compare the results to ensure they're identical

  3. Measure the performance difference between the two

Gradual Rollouts

For rolling out changes, GitHub uses Flipper, an open-source Ruby library that implements feature flags. This allows them to:

  1. Release new features incrementally

  2. Limit initial exposure of the new feature to early access users

  3. Gradually increase the percentage of users as they gain confidence

  4. Quickly roll back if unexpected issues arise

This approach significantly reduces the risk associated with deploying new features or changes to a system of GitHub's scale.

Real World Example

GitHub shared a recent example of how they used all this tooling to improve availability and reduce latency.

They looked at the data logs in Splunk and searched for the total request volume, average request latency and max latency for all the endpoints in their backend.

They found that one of the busiest endpoints was a service responsible for a simple redirect. This endpoint was regularly degrading to the timeout threshold.

GitHub delved into this using their observability tooling to see the individual service’s requests and sort them by how long each request took to process.

After examining the slowest requests, they found that the high latency was because the service was performing an access check that wasn’t required to send the redirect response.

GitHub used Flipper to add a feature flag that would skip the unnecessary access check. They configured the feature flag to first skip the check in a small number of requests. As they gained confidence, they gradually ramped up (while continuing to monitor DataDog and Splunk).

After confirming the performance improved for P75 and P99 latency, GitHub enabled the feature for all requests so that the unnecessary access check would be skipped.

Tech Snippets