How Quora integrated a Service Mesh in their Backend
We'll delve into what a Service Mesh is, why you'd use one and how it's used at Quora. Plus, why new hires often get paid more than existing employees, lessons from successful one-person startups and more.
Hey Everyone!
Today we’ll be talking about
How Quora integrated a Service Mesh in their Backend
What is a Service Mesh and why would you use one
Service Mesh Concepts Explained (Control Plane vs. Data Plane)
Why Quora picked Istio and the other choices they considered
Design challenges they faced with Istio
Final Results
Tech Snippets
10 Lessons from Successful One Person Startups
Why New Hires often get Paid more than Existing Employees
Detecting Traffic Anomalies at Scale
Going from Developer to CEO
How Quora integrated a Service Mesh into their Backend
Quora is a question-answering website with over 400 million monthly active users. You can post questions about anything on the site and other users will respond with long-form answers.
For their infrastructure, Quora uses both Kubernetes clusters for container orchestration and separate EC2 instances for particular services.
Since late 2021, one of their major projects has been building a service mesh to handle communication between all their machines and improve observability, reliability and developer productivity.
The Quora engineering team published a fantastic blog post delving into the background, technical evaluations, implementation and results of the service mesh migration.
We’ll first explain what a service mesh is and what purpose it serves. Then, we’ll delve into how Quora implemented theirs.
We talk about a lot of different technical concepts in Quastor. If you’d like long-form deep dives on specific concepts (like DynamoDB, Redis, Spark, Caching Strategies and more) then check out Quastor Pro.
What is a Service Mesh
A service mesh is an infrastructure layer that handles communication between the microservices (or machines) in your backend.
As you might imagine, communication between these services can be extremely complicated, so the service mesh will handle tasks like
Service Discovery - For each microservice, new instances are constantly being spun up/down. The service mesh keeps track of the IP addresses/port number of these instances and routes requests to/from them.
Load Balancing - When one microservice calls another, you want to send that request to an instance that’s not busy (using round robin, least connections, consistent hashing, etc.). The service mesh can handle this for you.
Observability - As all communications get routed through the service mesh, it can keep track of metrics, logs and traces.
Resiliency - The service mesh can handle things like retrying requests, rate limiting, timeouts, etc. to make the backend more resilient.
Security - The mesh layer can encrypt and authenticate service-to-service communications. You can also configure access control policies to set limits on which microservice can talk to whom.
Deployments - You might have a new version for a microservice you’re rolling out and you want to run an A/B test on this. You can set the service mesh to route a certain % of requests to the old version and the rest to the new version (or some other deployment pattern)
Architecture of Service Mesh
In practice, a service mesh typically consists of two components
Data Plane
Control Plane
Data Plane
The data plane consists of lightweight proxies that are deployed alongside every instance for all of your microservices (i.e. the sidecar pattern). This service mesh proxy will handle all outbound/inbound communications for the instance.
So, with Istio (a popular service mesh), you could install the Envoy Proxy on all the instances of all your microservices.
Control Plane
The control plane manages and configures all the data plane proxies. So you can configure things like retries, rate limiting policies, health checks, etc. in the control plane.
The control plane will also handle service discovery (keeping track of all the IP addresses for all the instances), deployments, and more.
Integrating a Service Mesh at Quora
The Quora team looked at several options for the data plane and the control plane. For the data plane, they looked at Envoy, Linkerd and Nginx. For the control plane, they looked at Istio, Linkerd, Kuma, AWS app mesh and a potential in-house solution.
They decided to go with Istio because of its large community and ecosystem. One of the downsides is Istio’s reputation for complexity, but the Quora team found that it had become simpler after it depreciated Mixer and unified control plane components.
Design
When implementing the service mesh in Quora’s hybrid environment, they had several design problems they needed to address.
Connecting EC2 VMs and Kubernetes - Istio was built with a focus on Kubernetes but the Quora team found that integrating it with EC2 VMs was a bit bumpy. They ended up forking the Istio codebase and making some changes to the agent code that was running on their VMs.
Handling Metrics Collection - For historical/legacy reasons, Quora stored Kubernetes metrics in Prometheus and VM application metrics in Graphite. They ended up migrating to VictoriaMetrics for easier integration.
Configuration and Deployment - Istio configurations are verbose due to it’s rich feature-set. This can make it a bit complex for engineers to ramp up to all the Istio concepts. To improve developer productivity, Quora created high-level abstractions defined in YAML that engineers could use instead.
Results
Quora first deployed the service mesh in late 2021 and have since integrated hundreds of services (using thousands of proxies).
Some features they were able to spin up with the service mesh include
Canary deployments with precise traffic controls
Load Balancing/Rate limiting/Retries
Generic service warm up infrastructure so that new pods can warm up their local cache from live traffic
For more details, read the full blog post here.