How GitHub shifted from a Monolith to Microservices

October 21, 2021

Hey Everyone,

Today we’ll be talking about

GitHub’s shift from Monolith to Microservices
- The history of GitHub and why they’re making this change
- Pros of Monolith & Pros of Microservices
- The process GitHub is following to make the change
Plus, a couple awesome tech snippets on
- Email Authenticity 101: DKIM, DMARC, and SPF - This is a good read if you’re building features that rely on email, like email password resets or email two-factor authentication
- Implementing Hash Tables in C - Topics covered are hash functions, separate chaining (using linked lists, arrays, and Red Black trees), open addressing and more!
- First impressions of Rust for open source dev - Spoiler alert - He likes Rust.

We have a solution to our last coding interview question on backtracking, plus a new question from Microsoft.

Quastor Daily is a free Software Engineering newsletter sends out FAANG Interview questions (with detailed solutions), Technical Deep Dives and summaries of Engineering Blog Posts.

GitHub's transition from Monolith to Microservices

Sha Ma is the VP of Software Engineering at GitHub.

She gave a talk at Qcon 2020 about GitHub’s transition from a Monolith architecture to Microservices-oriented architecture.

You can view the full talk and transcript here.

Here’s a summary

History

GitHub was founded in 2008 by Chris Wanstrath, P.J. Hyett, Tom Preston-Werner and Scott Chacon.

The founders of the company were open source contributors and influencers in the Ruby community. Because of this, GitHub’s architecture is deeply rooted in Ruby on Rails.

With the Ruby on Rails monolith, GitHub scaled to 50 million developers on the platform, over 100 million repositories and over 1 billion API calls per day.

Over the past 18 months, GitHub has grown rapidly as a company. They’ve doubled the number of engineers at the company, and now have over 2000 employees.

The company is also highly distributed with over 70% of employees working outside of the headquarters, working in all timezones.

Because of this diversity of engineers, GitHub is having trouble scaling the monolith.

Having everyone learn Ruby before they can be productive and having everyone doing development in the same monolithic code base is no longer the most efficient way to scale GitHub.

Therefore, GitHub engineering took a deep look at a Microservices architecture.

Here are some of the pros GitHub saw for a Monolith and Microservice architecture.

Pros of a Monolith architecture

Infrastructure Simplicity - A new employee can get GitHub up and running on their local machine within hours.
Code Simplicity - You don’t have to add extra logic to deal with timeouts or worry about failing gracefully due to network latency and outages.
Architecture & Organization simplicity - Everyone has familiarity with the same codebase and it’s easier to move people around to work on different features within the monolith.

Pros of a Microservice architecture

System ownership - There are functional boundaries for teams through clearly defined API contracts. This gives teams much more ownership over their feature and also gives them freedom to choose the tech stack that makes the most sense for them. They just have to make sure the API contract is followed.
Separation of Concerns - Quicker ramp-up time for a new developer joining a team since a developer no longer has to understand all the inner workings of a large monolithic code base in order to be productive.
Scaling separately - Services can now be scaled separately based on their individual needs

Based on these tradeoffs, GitHub decided to shift to a Microservices-oriented architecture.

However, the change isn’t expected to be immediate or rapid. For the foreseeable future, GitHub plans to have a hybrid monolith-microservices environment.

For this reason, it’s important for them to continue to maintain and improve the monolith codebase.

How to break up the Monolith

The first step towards breaking up a monolith is to think about the separation of code and data based on feature functionalities.

This can be done within the monolith before physically separating them into a microservices environment. It’s generally a good architectural practice to make the codebase more management.

Start with the data and pay close attention to how it’s being accessed.

Each service should own and control access to its own data. Data access should only happen through clearly defined API contracts.

If you don’t enforce this, you can fall into a common microservice anti-pattern: the distributed monolith.

This is where you have the inflexibility of a monolith and the complexity of microservices.

Separating Data

Before making the transition to microservices, GitHub made sure they got data separation right. Getting it wrong can lead to the distributed monolith anti-pattern.

They first looked at their monolith and identified the functional boundaries within the database schemas.

Then, they grouped the actual database tables along these functional boundaries.

They grouped together everything related to repositories, everything related to users, and everything related to projects. The resulting functional groups are referred to as schema domains.

The repository schema domain holds all repository information like pull requests, issues, review comments, etc.

Then, GitHub implemented a query watcher in the monolith to detect and alert them anytime a query crosses multiple schema domains.

If a query touched more than one schema domain, then they would break the query up and rewrite it into multiple queries that respect the functional boundaries. They would then perform the necessary joins at the application layer.

Separating Services

When separating services out of the monolith to a microservice, you should start with the core services and then work your way out to the feature level.

Dependency direction should always go from inside of the monolith to outside of the monolith, NOT the other way around. If you have dependency directions from microservices to inside the monolith then that can lead to the distributed monolith anti-pattern.

At GitHub, the core service that they extracted first was Authentication and Authorization. The Rails monolith communicated with the microservice using Twirp, a gRPC-like service-to-service communications framework, with an inside-to-outside dependency direction (inside of the monolith to outside of the monolith).

When separating services out of the monolith, you should be on the lookout for things that keep developers working in the monolith.

A common example is shared tooling that is built over time and makes development inside the monolith more convenient. Make those shared resources available to developers outside of the monolith.

An example at GitHub was feature flags that provide monolith developers an easy way to control who sees a new feature.

Finally, make sure to remove old code paths once the new services are up and running. Have a plan to move 100% of the traffic over to the new service, so you don’t get stuck supporting two sets of code forever.

For more details, you can view the full talk here.

Quastor Daily is a free Software Engineering newsletter sends out FAANG Interview questions (with detailed solutions), Technical Deep Dives and summaries of Engineering Blog Posts.

Tech Snippets

Email Authenticity 101: DKIM, DMARC, and SPF - When you’re implementing things like password resets or two-factor auth, you’ll usually be relying on email.Therefore, it’s useful to know the basics of email domain security. The three core components are DKIM for signing, SPF for sender verification, and DMARC for stricter enforcement of the other two.Read the full blog post for an explanation on all three components and how they work together to make email more secure.
Implementing Hash Tables in C - This is a great article that goes in detail on how you can build a hash table. Topics covered are hash functions, separate chaining (using linked lists, arrays, and Red Black trees), open addressing and more!
First Impressions of Rust - John Millikin is a software engineer at Stripe and open source developer. He wrote an interesting blog post on his first impressions of Rust while using it to build rust-fuse.While building rust-fuse, he took notes on what he found good/bad about Rust and shared them in this blog post.

Interview Question

Given a linked list, swap every two adjacent nodes and return its head.

You must solve the problem without modifying the values in the list’s nodes. Only the nodes themselves can be changed.

Quastor Daily is a free Software Engineering newsletter sends out FAANG Interview questions (with detailed solutions), Technical Deep Dives and summaries of Engineering Blog Posts.

Previous Solution

As a reminder, here’s our last question

You are given two integers, n and k.

Return all possible combinations of k numbers out of the range [1,n]

You can return the answer in any order.

Here’s the question in LeetCode

Solution

We can solve this question with backtracking.

We’ll create an array called combinations to keep track of all the different combinations. This array will contain our final answer.

Now, we’ll have a recursive function called _combine that takes in three parameters.

idx - keeps track of the current index from 1 to n
cur - keeps track of our current combination
k - keeps track of the remaining number of elements needed to fill our current combination

If k is equal to 0, then that means our current combination is filled. We can add it to our combinations array.

Otherwise, we’ll iterate through all the integers between idx and n.

For each integer, we’ll put it in our current combination.

Then, we’ll continue down this path by calling _combine.

After, we backtrack by popping the integer out of our current combination.

After we run our _combine function, we can return the array of combinations.

Here’s the Python 3 code. What’s the time and space complexity of our solution? Reply back with your answer and we’ll tell you if you’re right/wrong.