How GitHub improves Reliability and Availability

We'll talk about open source tools like Scientist and Flipper and go through a real world case study at GitHub. Plus, how to set engineering org values and more.

Hey Everyone!

Today we’ll be talking about

  • The Tooling GitHub uses for Improving Reliability and Availability

    • GitHub has an extremely complex codebase and they use many different tools to help them understand, debug and monitor their application

    • They use open source Ruby libraries like Scientist and Flipper for refactoring code and incrementally rolling out changes

    • For observability, they use DataDog and Splunk for keeping track of metrics and logs

  • Tech Snippets

    • SQLite is not a toy database

    • The trade-off we make with tests

    • How to set engineering org values

Have you ever been curious about the inner workings of popular codecs like H.264 or AV1?

They rely on a huge number of clever techniques and algorithms. One example is interframe coding, where you identify consecutive frames that are similar to each other and then only store the changes between frames rather than the entire frame itself.

If you’d like to learn more techniques like this from areas like video compression, computer memory, GPS, wireless communication and more, then you should check out Brilliant.

They released a course called How Technology Works that delves into all of these topics in an engaging, easy to understand way.

This is just one of hundreds of courses that Brilliant has that cover all topics across software engineering, machine learning, data science, quantitative finance and more.

With the link below, you can get a 30-day free trial to check it out. You’ll also get a 20% discount when you subscribe.

sponsored

The Tooling GitHub uses to Improve Availability and Reliability

GitHub is the world’s largest developer platform with over 200 million code repositories on the site. Every month, tens of millions of developers use GitHub for storing their code.

At this size, GitHub’s tech stack and codebase is extremely complex. Small changes can result in a big ripple effect and finding the root cause of issues can be a challenge.

Nick Hengeveld has been a developer at GitHub since 2011 and he wrote a fantastic blog post delving into the techniques and tools GitHub uses to improve availability.

We’ll talk about observability, release engineering at GitHub.

Observability

Understanding what’s happening in the backend is crucial when you’re at the scale of GitHub. Some of the techniques that GitHub uses is analyzing metrics and events.

Metrics

Metrics are quantitative measurements of backend behavior that help engineers understand how the system is performing over time.

GitHub uses DataDog for storing metrics. They track data points like traffic levels, response times, error rates, cache hit rates and more.

Collecting this allows them to visualize patterns and identify areas that need attention before they escalate into an outage.

Events

Events provide qualitative information about specific occurrences within the system.

GitHub uses Splunk for storing and analyzing events. Examples of events include: deployments, repository creations/deletions, errors and more.

By analyzing these events, GitHub can better understand the root cause when problems arise. If there’s a sudden spike in CPU usage, they can look at error logs to see if there’s a bug causing the issue.

Release Engineering

In order to ensure smooth deployments, GitHub also uses best practices around testing and deployments.

Testing

For testing proposed changes, GitHub uses Scientist, an open-source Ruby library they developed for refactoring critical paths. Scientist allows them to run experiments that compare the performance of new code versus prior implementations. This lets them ensure changes aren’t causing regressions before they deploy them.

For example, when optimizing a database query, they might use Scientist to:

  1. Run both the old and new query simultaneously

  2. Compare the results to ensure they're identical

  3. Measure the performance difference between the two

Gradual Rollouts

For rolling out changes, GitHub uses Flipper, an open-source Ruby library that implements feature flags. This allows them to:

  1. Release new features incrementally

  2. Limit initial exposure of the new feature to early access users

  3. Gradually increase the percentage of users as they gain confidence

  4. Quickly roll back if unexpected issues arise

This approach significantly reduces the risk associated with deploying new features or changes to a system of GitHub's scale.

Real World Example

GitHub shared a recent example of how they used all this tooling to improve availability and reduce latency.

They looked at the data logs in Splunk and searched for the total request volume, average request latency and max latency for all the endpoints in their backend.

They found that one of the busiest endpoints was a service responsible for a simple redirect. This endpoint was regularly degrading to the timeout threshold.

GitHub delved into this using their observability tooling to see the individual service’s requests and sort them by how long each request took to process.

After examining the slowest requests, they found that the high latency was because the service was performing an access check that wasn’t required to send the redirect response.

GitHub used Flipper to add a feature flag that would skip the unnecessary access check. They configured the feature flag to first skip the check in a small number of requests. As they gained confidence, they gradually ramped up (while continuing to monitor DataDog and Splunk).

After confirming the performance improved for P75 and P99 latency, GitHub enabled the feature for all requests so that the unnecessary access check would be skipped.

If you’re curious about how LLMs like GPT-4, LLaMA and Claude work then you should check out this fantastic course by Brilliant.

It’s fully interactive with animations, hands-on graphics and detailed explanations on things like

  • N-Gram Models

  • Transformers

  • Fine Tuning LLMs

Brilliant is an education platform that has a huge amount of math, data science, computer science and ML content. Their content is structured as bite-sized lessons with tons of interactive animations, graphics and more.

This makes it really easy to build a daily learning habit with the Brilliant app, making you a better problem solver and a faster learner.

With the link below, you can get a 30-day free trial to check it out. You’ll also get a 20% discount when you subscribe.

sponsored

Tech Snippets