How GitHub improves Reliability and Availability

We'll talk about open source tools like Scientist and Flipper and go through a real world case study at GitHub. Plus, how to set engineering org values and more.

Arpan KG
September 11, 2024

Hey Everyone!

Today we’ll be talking about

The Tooling GitHub uses for Improving Reliability and Availability
- GitHub has an extremely complex codebase and they use many different tools to help them understand, debug and monitor their application
- They use open source Ruby libraries like Scientist and Flipper for refactoring code and incrementally rolling out changes
- For observability, they use DataDog and Splunk for keeping track of metrics and logs
Tech Snippets
- SQLite is not a toy database
- The trade-off we make with tests
- How to set engineering org values

The Tooling GitHub uses to Improve Availability and Reliability

GitHub is the world’s largest developer platform with over 200 million code repositories on the site. Every month, tens of millions of developers use GitHub for storing their code.

At this size, GitHub’s tech stack and codebase is extremely complex. Small changes can result in a big ripple effect and finding the root cause of issues can be a challenge.

Nick Hengeveld has been a developer at GitHub since 2011 and he wrote a fantastic blog post delving into the techniques and tools GitHub uses to improve availability.

We’ll talk about observability, release engineering at GitHub.

Observability

Understanding what’s happening in the backend is crucial when you’re at the scale of GitHub. Some of the techniques that GitHub uses is analyzing metrics and events.

Metrics

Metrics are quantitative measurements of backend behavior that help engineers understand how the system is performing over time.

GitHub uses DataDog for storing metrics. They track data points like traffic levels, response times, error rates, cache hit rates and more.

Collecting this allows them to visualize patterns and identify areas that need attention before they escalate into an outage.

Events

Events provide qualitative information about specific occurrences within the system.

GitHub uses Splunk for storing and analyzing events. Examples of events include: deployments, repository creations/deletions, errors and more.

By analyzing these events, GitHub can better understand the root cause when problems arise. If there’s a sudden spike in CPU usage, they can look at error logs to see if there’s a bug causing the issue.

Release Engineering

In order to ensure smooth deployments, GitHub also uses best practices around testing and deployments.

Testing

For testing proposed changes, GitHub uses Scientist, an open-source Ruby library they developed for refactoring critical paths. Scientist allows them to run experiments that compare the performance of new code versus prior implementations. This lets them ensure changes aren’t causing regressions before they deploy them.

For example, when optimizing a database query, they might use Scientist to:

Run both the old and new query simultaneously
Compare the results to ensure they're identical
Measure the performance difference between the two

Gradual Rollouts

For rolling out changes, GitHub uses Flipper, an open-source Ruby library that implements feature flags. This allows them to:

Release new features incrementally
Limit initial exposure of the new feature to early access users
Gradually increase the percentage of users as they gain confidence
Quickly roll back if unexpected issues arise

This approach significantly reduces the risk associated with deploying new features or changes to a system of GitHub's scale.

Real World Example

GitHub shared a recent example of how they used all this tooling to improve availability and reduce latency.

They looked at the data logs in Splunk and searched for the total request volume, average request latency and max latency for all the endpoints in their backend.

They found that one of the busiest endpoints was a service responsible for a simple redirect. This endpoint was regularly degrading to the timeout threshold.

GitHub delved into this using their observability tooling to see the individual service’s requests and sort them by how long each request took to process.

After examining the slowest requests, they found that the high latency was because the service was performing an access check that wasn’t required to send the redirect response.

GitHub used Flipper to add a feature flag that would skip the unnecessary access check. They configured the feature flag to first skip the check in a small number of requests. As they gained confidence, they gradually ramped up (while continuing to monitor DataDog and Splunk).

After confirming the performance improved for P75 and P99 latency, GitHub enabled the feature for all requests so that the unnecessary access check would be skipped.

Tech Snippets

SQLite is not a toy database

This is an interesting read that delves into the capabilities of SQLite. You might just think of SQLite as a simple, embedded database.

However, it also has features like
- full-text search capabilities
- graph database functionality
- statistical functions for data analysis

and much more. It might be worth exploring these advanced features before you reach for a more complex system.

antonz.org/sqlite-is-not-a-toy-database

The trade-off we make with tests

It’s important to remember the trade-offs with testing and why 100% test coverage might hurt developer productivity.

Tests are an insurance policy against buts, but (like insurance) you might be over-paying. This is a great article with a framework for how you should evaluate the value of your tests.

To evaluate your testing strategy:

1. Measure the cost of writing tests (time spent by developers and dedicated testers)
2. Estimate the cost of bugs (including engineering time and business impact)
3. Ensure the cost of writing tests is lower than the cost of bugs
4. Consider opportunity costs - sometimes feature development may be more critical

ntietz.com/blog/too-much-of-a-good-thing-the-cost-of-excess-testing

How to set Engineering Org Values

Will Larson is the CTO of Carta and was previously the CTO of Calm. He wrote an in-depth article delving into the practical aspects of rolling out and sustaining engineering values.

Simply stating your values isn’t enough. They need to be actively integrated into daily processes in order to be useful.

Some of his tips are:
- Wait at least 6 months in a new role before establishing values
- Follow good process rollout practices: involve stakeholders, test, and iterate
- Integrate values into hiring, onboarding, promotions, and team recognition

lethain.com/setting-engineering-org-values