How Canva Implemented End-to-End Tracing

Canva processes over 5 billion spans daily. Plus, a detailed guide to CDNs, when feature flags do and don't make sense and more.

June 16, 2023

Hey Everyone!

Today we’ll be talking about

How Canva Implemented End to End Tracing
- The Three Pillars of Observability
- How Distributed Tracing Works
- Implementing Backend Tracing with OpenTelemetry
- Implementing Frontend Tracing
- Insights Gained from Tracing
How Tinder Built Their Own API Gateway
- What is an API Gateway
- The Design of Tinder's API Gateway TAG
- How a Request Flows Through TAG and the Middleware Involved
Tech Snippets
- Ways of Generating Income from Open Source
- A Detailed Guide to CDNs
- When Feature Flags Do and Don’t Make Sense
- How Dropbox Selects Data Centers
- Pytudes

End-to-End Tracing at Canva

As systems get bigger and more complicated, having good observability in-place is crucial.

You’ll commonly hear about the Three Pillars of Observability

Logs - specific, timestamped events. Your web server might log an event whenever there’s a configuration change or a restart. If an application crashes, the error message and timestamp will be written in the logs.
Metrics - Quantifiable statistics about your system like CPU utilization, memory usage, network latencies, etc.
Traces - A representation of all the events that happened as a request flows through your system. For example, the user file upload trace would contain all the different backend services that get called when a user uploads a file to your service.

In this post, we’ll delve into traces and how Canva implemented end-to-end distributed tracing.

Canva is an online graphic design platform that lets you create presentations, social media banners, infographics, logos and more. They have over 100 million monthly active users.

Ian Slesser is the Engineering Manager in charge of Observability and he recently published a fantastic blog post that delves into how Canva implemented End-to-End Tracing. He talks about Backend and Frontend tracing, insights Canva has gotten from the data and design choices the team made.

Example of End to End Tracing

With distributed tracing, you have

Traces - a record of the lifecycle of a request as it moves through a distributed system.
Spans - a single operation within the trace. It has a start time and a duration and traces are made up of multiple spans. You could have a span for executing a database query or calling a microservice. Spans can also overlap, so you might have a span in the trace for calling the metadata retrieval backend service, which has another span within for calling DynamoDB.
Tags - key-value pairs that are attached to spans. They provide context about the span, such as the HTTP method of a request, status code, URL requested, etc.

If you’d prefer a theoretical view, you can think of the trace as being a tree of spans, where each span can contain other spans (representing sub-operations). Each span node has associated tags for storing metadata on that span.

In the diagram below, you have the Get Document trace, which contains the Find Document and Get Videos spans (representing specific backend services). Both of these spans contain additional spans for sub-operations (representing queries to different data stores).

Backend Tracing

Canva first started using distributed tracing in 2017. They used the OpenTracing project to add instrumentation and record telemetry (metrics, logs, traces) from various backend services.

The OpenTracing project was created to create vendor-neutral APIs and instrumentation for tracing. Later, OpenTracing merged with another distributed traces project (OpenCensus) to form OpenTelemetry (known as OTEL). This project has become the industry standard for implementing distributed tracing and has native support in most distributed tracing frameworks and Observability SaaS products (Datadog, New Relic, Honeycomb, etc.).

Canva also switched over to OTEL and they’ve found success with it. They’ve fully embraced OTEL in their codebase and have only needed to instrument their codebase once. This is done through importing the OpenTelemetry library and adding code to API routes (or whatever you want to get telemetry on) that will track a specific span and record its name, timestamp and any tags (key-value pairs with additional metadata). This gets sent to a telemetry backend, which can be implemented with Jaeger, Datadog, Zipkin etc.

They use Jaeger, but also have Datadog and Kibana running at the company. OTEL is integrated with those products as well so all teams are using the same underlying observability data.

Their system generates over 5 billion spans per day and it creates a wealth of data for engineers to understand how the system is performing.

Frontend Tracing

Although Backend Tracing has gained wide adoption, Frontend Tracing is still relatively new.

OpenTelemetry provides a JavaScript SDK for collecting logs from browser applications, so Canva initially used this.

However, this massively increased the entry bundle size of the Canva app. The entry bundle is all the JavaScript, CSS and HTML that has to be sent over to a user’s browser when they first request the website. Having a large bundle size means a slower page load time and it can also have negative effects on SEO.

The OTEL library added 107 KB to Canva’s entry bundle, which is comparable to the bundle size of ReactJS.

Therefore, the Canva team decided to implement their own SDK according to OTEL specifications and uniquely tailored it to what they wanted to trace in Canva’s frontend codebase. With this, they were able to reduce the size to 16 KB.

Insights Gained

When Canva originally implemented tracing, they did so for things like finding bottlenecks, preventing future failures and faster debugging. However, they also gained additional data on user experience and how reliable the various user flows were (uploading a file, sharing a design, etc.).

In the future, they plan on collaborating with other teams at Canva and using the trace data for things like

Improving their understanding of infrastructure costs and projecting a feature’s cost
Risk analysis of dependencies
Constant monitoring of latency & availability

For more details, you can read the full blog post here.

How did you like this summary?

Your feedback really helps me improve curation for future emails.

Level Up Your Dev Skills by Mastering the Fundamentals

When you interview at a company like Google or Facebook, they don’t test you on the specifics of any particular language. Instead, they see how well you grasp the fundamental data structures and algorithms.

Similarly, if you want to learn machine learning, then you’ll need a strong foundation in calculus and linear algebra.

Mastering the fundamentals is an underrated but crucial way to learn new tech quickly.

Brilliant makes this extremely easy by providing math, computer science and data analysis lessons using a first-principles approach. The lessons are bite-sized and can be done in less than 15 minutes, making it super convenient to build a daily learning habit.

That’s why Brilliant is used by over 10 million developers, data scientists and researchers to sharpen their analytical skills.

With the link below, you can get a 30-day free trial to check it out. You’ll also get a 20% discount when you subscribe.

sponsored

Tech Snippets

How to Generate Income from Open Source

This is an interesting blog post that delves into how open source projects are making money. One option is commercial licenses.

Another is having a limited open source version and then offering additional features for paid plans. Sidekiq is a open source library in the Ruby ecosystem that follows this approach.

A third option is by open sourcing the code but offering a hosted version for a monthly price. Plausible Analytics does this where they’re a privacy-focused alternative to Google Analytics.

vadimdemedes.com/posts/generating-income-from-open-source

A Detailed Guide to CDNs

Web.dev is a great resource for web developers with a ton of in-depth guides on various topics in building scalable websites. This is a great guide on Content Delivery Networks with information on how to choose a CDN, improving the cache hit ratio and other performance tweaks.

http://web.dev/content-delivery-networks

Pytudes

This is an awesome Github repo by Peter Norvig with short, interesting snippets of Python for things like solving a sudoku board, ranking poker hands, finding scrabble words and more . They’re formatted as Jupyter Notebooks, so each piece of code also has a great explanation delving into the code and design choices.

https://github.com/norvig/pytudes

When Feature Flags Do And Don’t Make Sense

Some situations where feature flags are great are for A/B testing and coordinating between many subtasks. They are also very useful for hiding a certain feature until a specific launch date while still being able to deploy it and test it out.

However, some teams can become excessively risk averse and mandate that too many code changes should be behind feature flags. Having too many flags can increase complexity significantly.

software.rajivprab.com/2019/12/19/when-feature-flags-do-and-dont-make-sense

How Dropbox Selects Data Centers

This is a fantastic read that delves into all the factors Dropbox considers when they’re finding data centers. They look at things like electricity costs, power redundancy, network speeds, security and more and create a score card.

The article also discusses how Dropbox negotiates leases for their data centers.

dropbox.tech/infrastructure/how-the-data-center-site-selection-process-works-at-dropbox

Signup Free to Pointer.io, a Reading Club for Software Developers

If you find Quastor useful, you should check out Pointer.io.

It’s a reading club for software developers that’s read by CTOs, engineering managers and senior developers. They send out super high quality engineering-related content and it’s completely free!

(crosspromo)

https://www.pointer.io/?utm_source=quastordaily&utm_medium=email&utm_campaign=crosspromo

How Tinder Built Their Own API Gateway

Tinder is the most popular dating app in the world with over 75 million monthly active users in over 190 countries. The app is owned by the Match Group, a conglomerate that also owns Match.com, OkCupid, Hinge and over 40 other dating apps.

Tinder’s backend consists of over 500 microservices, which talk to each other using a service mesh built with Envoy. Envoy is an open source service proxy, so an Envoy process runs alongside every microservice and the service does all inbound/outbound communication through that process.

For the entry point to their backend, Tinder needed an API gateway. They tried several third party solutions like AWS Gateway, APIgee, Kong and others but none were able to meet all of their needs.

Instead, they built Tinder Application Gateway (TAG), a highly scalable and configurable solution. It’s JVM-based and is built on top of Spring Cloud Gateway.

Tinder Engineering published a great blog post that delves into why they built TAG and how TAG works under the hood.

We’ll be summarizing this post and adding more context.

What is an API Gateway

The API Gateway is the “front door” to your application and it sits between your users and all your backend services. When a client sends a request to your backend, it’s sent to your API gateway (it’s a reverse proxy).

The gateway service will handle things like

Authenticating the request and handling Session Management
Checking Authorization (making sure the client is allowed to do whatever he’s requesting)
Rate Limiting
Load balancing
Keeping track of the backend services and routing the request to whichever service handles it (this may involve converting an HTTP request from the client to a gRPC call to the backend service)
Caching (to speed up future requests for the same resource)
Logging

And much more.

The Gateway applies filters and middleware to the request to handle the tasks listed above. Then, it makes calls to the internal backend services to execute the request.

After getting the response, the gateway applies another set of filters (for adding response headers, monitoring, logging, etc.) and replies back to the client phone/tablet/computer.

Tinder’s Prior Challenges with API Gateways

Prior to building TAG, the Tinder team used multiple API Gateway solutions with each application team picking their own service.

Each gateway was built on a different tech stack, so this led to challenges in managing all the different services. It also led to compatibility issues with sharing reusable components across gateways. This had downstream effects with inconsistent use for things like Session Management (managing user sign ins) across APIs.

Therefore, the Tinder team had the goal of finding a solution to bring all these services under one umbrella.

They were looking for something that

Supports easy modification of backend service routes
Allows for Tinder to add custom middleware logic for features like bot detection, schema registry and more
Allows easy Request/Response transformations (adding/modifying headers for the request/response)

The engineering team considered existing solutions like Amazon AWS Gateway, APIgee, Tyk.io, Kong, Express API Gateway and others. However, they couldn’t find one that met all of their needs and easily integrated into their system.

Some of the solutions were not well integrated with Envoy, the service proxy that Tinder uses for their service mesh. Others required too much configuration and a steep learning curve. The team wanted more flexibility to build their own plugins and filters quickly.

Tinder Application Gateway

The Tinder team decided to build their own API Gateway on top of Spring Cloud Gateway, which is part of the Java Spring framework.

Here’s an overview of the architecture of Tinder Application Gateway (TAG)

The components are

Routes - Developers can list their API endpoints in a YAML file. TAG will parse that YAML file and use it to preconfigure all the routes in the API.
Service Discovery - Tinder has a bunch of different microservices in their backend, as they use Envoy to manage the service mesh. The Envoy proxy can be run on every single microservice and it handles the inbound/outbound communications for that microservice. Envoy also has a control plane that manages all these microservices and keeps track of them. TAG uses this Envoy control plane to look for the backend services for each route.
Pre Filters - Filters that you can configure in TAG to be applied on the request before it’s sent to the backend service. You can create filters to do things like modify request headers, convert from HTTP to gRPC, authentication and more.
Post Filters - Filters that can be applied on the response before it’s sent back to the client. You might configure filters to look at any errors (from the backend services) and store them in Elasticsearch, modify response headers and more.
Custom/Global Filters - These Pre and Post filters can be custom or global. Custom filters can be written by application teams if they need their own special logic and are applied at the route level. Global filters are applied to all routes automatically.

Real World Usage of TAG at Tinder

Here’s an example of how TAG handles a request for reverse geo IP lookup (where the IP address of a user is mapped to his country).

The client sends an HTTP Request to Tinder’s backend calling the reverse geo IP lookup route.
A global filter captures the request semantics (IP address, route, User-Agent, etc.) and that data is streamed through Amazon MSK (Amazon Managed Kafka Stream). It can be consumed by applications downstream for things like bot detection, logging, etc.
Another global filter will authenticate the request and handle session management
The path of the request is matched with one of the deployed routes in the API. The path might be /v1/geoip and that path will get matched with one of the routes.
The service discovery module in TAG will use Envoy to look up the backend service for the matched API route.
Once the backend service is identified, the request goes through a chain of pre-filters configured for that route. These filters will handle things like HTTP to gRPC conversion, trimming request headers and more.
The request is sent to the backend service and executed. A backend service will send a response to the API gateway.
The response will go through a chain of post-filters configured for that route. Post filters handle things like checking for any errors and logging them to Elasticsearch, adding/trimming response headers and more.
The final response is returned to the client.

The Match Group owns other apps like Hinge, OkCupid, PlentyOfFish and others. All these brands also use TAG in production.

For more details, you can read the full blog post here.