Snapchat's Shift to Microservices

Hey Everyone!

Today we’ll be talking about

  • Snap's Shift from a Monolith to Microservices

    • Why Engineers Wanted to Shift to Microservices

    • Design Goals and Adopting Envoy

    • The Architecture of the New System

  • Tech Snippets

    • A Detailed Guide to CDNs

    • The Ultimate Guide to Onboarding Software Engineers

    • A Curated List of Static Analysis Tools

    • Why I Program in Erlang

Plus, we have a solution to our last coding interview question.


The Architecture of Snapchat's Service Mesh

Snapchat is an instant messaging application with 360 million daily active users from all around the world.

Previously, they used a monolithic architecture for their backend on Google App Engine but they made a shift to microservices in Kubernetes across both AWS and Google Cloud.

They started the shift in 2018 and as of November 2021, they had over 300 production services and handled more than 10 million queries per second of service-to-service requests.

The shift led to a 65% reduction in compute costs while reducing latency, increasing reliability and making it easier for Snap to grow as an organization.

Snap Engineering published a great blog post on why they made the change, how they shifted and the architecture of their new system.

Here’s a Summary

For years, Snap’s backend was a monolithic app running in Google App Engine. Monoliths are great for accommodating rapid growth in features, engineers and customers.

However, they faced some challenges with the monolith as they scaled. Some of the things Snap developers wanted were

  • Clear, explicit relationships between different components

  • The ability for teams to choose their own architecture

  • Capability to iterate quickly and independently on their own data models (having massive shared datastores made this difficult due to tight couplings across teams)

  • A viable path to shift certain workloads to other cloud providers like AWS

Therefore, the Snap team proposed a service-oriented architecture.

Here are some of their design tenets for the new system

  • Clear separation of concerns between a service’s business logic and the infrastructure. Each side should be able to iterate independently.

  • Centralized service discovery and management. All service owners should have the same experience around service management regardless of where the service is running.

  • Minimal friction for creating new services.

  • Abstract the differences between cloud providers where possible. Minimize cloud provider dependencies so it’s possible to shift services between AWS/GCP/Azure, etc.

In order to meet these goals, Snap engineers rallied around Envoy as a core building block.

Envoy is an open source service proxy that was developed at Lyft and has seen huge adoption for companies that use a microservices architecture. Airbnb, Tinder and Reddit are a couple examples of companies that use Envoy in their stack.

Envoy is a service proxy, so whenever a microservice sends/receives a request, that request will go through Envoy. Every microservice host will be running an Envoy sidecar container and all inbound/outbound communications will go through that container. Services have no direct interaction with the network and must communicate through Envoy.

Envoy will handle things like service discovery, routing, health checking, circuit breaking, rate limiting, load balancing and a whole bunch of other stuff. Engineers can just focus on the business logic around their particular service and they don’t have to worry about the service-to-service communications logic.

Snapchat engineers picked Envoy for several reasons

  • Compelling Featureset - Envoy supports gRPC and HTTP/2 for communication, hot restarts on configuration changes, client-side load balancing, robust circuit-breaking and much more.

  • Extensible - Envoy supports pluggable filters, so developers can easily inject their own functionality

  • Observability - Envoy offers a broad set of upstream and downstream metrics around latency, connections, retries and much more.

  • Rich Ecosystem - Both AWS and Google Cloud are investing heavily around Envoy.

  • Robust - Envoy can handle North-South traffic as well as East-West traffic. North-South APIs connect to external apps (Snapchat iOS/Android apps) while East-West refers to a business’s internal backend service to service traffic.

Snapchat used the Service Mesh design pattern where they had a data plane and a control plane. The data plane consists of all the services in the system and each service’s accompanying Envoy proxy. Each Envoy proxy would also connect to a central control plane that handles service configuration/routing, traffic management, etc. You can read more about the data plane and control plane here.

For the control plane, Snap built an internal web app called Switchboard. This serves as a single control plane for all of Snap’s services across cloud providers, regions and environments. Service owners can use Switchboard to manage their service dependencies, shift traffic between clusters, drain regions and more.

Switchboard provides a simplified configuration model that then interacts with Envoy’s APIs (xDS) to control the data plane. Snap chose not to expose Envoy’s full API surface area as the amount of optionality could lead to a combinatorial explosion of different configurations and make the system a nightmare to support. Instead, they standardized as much as possible by hiding it and providing a simplified config with Switchboard.

Envoy is also used for Snap’s API Gateway, the front door for all requests from Snapchat clients. The gateway runs the same Envoy image that the internal microservices run and it connects to the same control plane.

From the control plane, engineers wrote custom Envoy filters for the gateway to handle things like authentication, stricter rate limiting and load shedding.

Here’s the overall architecture of the system.

For more details, you can read the full blog post here.


Tech Snippets

  • The Ultimate Guide to Onboarding Software Engineers - Leadership Garden is a great blog that helps engineers become better leaders. This post gives a 13-step onboarding framework for helping new developers succeed at your company. The framework is divided into 30 day, 60 day, 90 day and 150 day milestones.

  • A detailed guide to CDNs - Web.dev is a great resource for web developers with a ton of in-depth guides on various topics in building scalable websites. They have a great guide on Content Delivery Networks with information on how to choose a CDN, improving the cache hit ratio and other performance tweaks.

  • Static Analysis Tools - If you're interested in learning more about static analysis tools, this is an awesome list of open source static analyzers for a huge variety of programming languages, build tools, config files and more. It's part of the analysis tools website, which adds rankings, user comments and tutorial videos.

  • Why I Program in Erlang - This is a great blog post on the features of Erlang. The author describes it as "Erlang is the culmination of twenty-five years of correct design decisions in the language and platform". For example, GC pause (where your system pauses for automatic garbage collection) is a problem with many languages but Erlang can have thousand of independent heaps which are garbage-collected separately. Another example is how concatenating two strings in Erlang is implemented to be an O(1) operation. Read the full blog post for more benefits/drawbacks.

  • Get a look into what CTOs are reading - If you find Quastor useful, you should check out Pointer.io. It's a reading club for software developers that sends out super high quality engineering-related content. It's read by CTOs, engineering managers and senior engineers so you should definitely sign up if you're on that path (or if you want to go down that path in the future). It's completely free! (cross promo)

Interview Question

You’re given an integer N as input.

Write a function that determines the fewest number of perfect squares that sum up to N.

Examples

Input - 16

Output - 1 (16 is a perfect square)

Input - 21

Output - 3 (1 + 4 + 16 = 21)

Previous Question

As a reminder, here’s our last question

You are given an n x n 2D matrix.

Rotate the matrix by 90 degrees clockwise.

Solution

We can solve this question by breaking it down into two steps.

  1. Find the transpose of the matrix

  2. Reverse the rows of the transpose

The transpose of a matrix is an operator which flips the matrix over it’s diagonal.

The row and column indices of the matrix are switched.

After finding the transpose of the matrix, you need to iterate through each of the rows and reverse the order of the elements in that row.

These two transformations are equivalent to a 90 degree clockwise rotation.

Here’s the Python 3 code.