How Canva Implemented End-to-End Tracing

Canva processes over 5 billion spans daily. Plus, a detailed guide to CDNs, when feature flags do and don't make sense and more.

December 05, 2023

Hey Everyone!

Today we’ll be talking about

How Canva Implemented End to End Tracing
- The Three Pillars of Observability
- How Distributed Tracing Works
- Implementing Backend Tracing with OpenTelemetry
- Implementing Frontend Tracing
- Insights Gained from Tracing
Tech Snippets
- System Design Newsletter
- How Google does Code Review
- Ways of Generating Income from Open Source
- A Detailed Guide to CDNs
- When Feature Flags Do and Don’t Make Sense
- How Dropbox Selects Data Centers
- Pytudes

End-to-End Tracing at Canva

As systems get bigger and more complicated, having good observability in-place is crucial.

You’ll commonly hear about the Three Pillars of Observability

Logs - specific, timestamped events. Your web server might log an event whenever there’s a configuration change or a restart. If an application crashes, the error message and timestamp will be written in the logs.
Metrics - Quantifiable statistics about your system like CPU utilization, memory usage, network latencies, etc.
Traces - A representation of all the events that happened as a request flows through your system. For example, the user file upload trace would contain all the different backend services that get called when a user uploads a file to your service.

In this post, we’ll delve into traces and how Canva implemented end-to-end distributed tracing.

Canva is an online graphic design platform that lets you create presentations, social media banners, infographics, logos and more. They have over 100 million monthly active users.

Ian Slesser is the Engineering Manager in charge of Observability and he published a fantastic blog post that delves into how Canva implemented End-to-End Tracing. He talks about Backend and Frontend tracing, insights Canva has gotten from the data and design choices the team made.

Example of End to End Tracing

With distributed tracing, you have

Traces - a record of the lifecycle of a request as it moves through a distributed system.
Spans - a single operation within the trace. It has a start time and a duration and traces are made up of multiple spans. You could have a span for executing a database query or calling a microservice. Spans can also overlap, so you might have a span in the trace for calling the metadata retrieval backend service, which has another span within for calling DynamoDB.
Tags - key-value pairs that are attached to spans. They provide context about the span, such as the HTTP method of a request, status code, URL requested, etc.

If you’d prefer a theoretical view, you can think of the trace as being a tree of spans, where each span can contain other spans (representing sub-operations). Each span node has associated tags for storing metadata on that span.

In the diagram below, you have the Get Document trace, which contains the Find Document and Get Videos spans (representing specific backend services). Both of these spans contain additional spans for sub-operations (representing queries to different data stores).

Backend Tracing

Canva first started using distributed tracing in 2017. They used the OpenTracing project to add instrumentation and record telemetry (metrics, logs, traces) from various backend services.

The OpenTracing project was created to create vendor-neutral APIs and instrumentation for tracing. Later, OpenTracing merged with another distributed traces project (OpenCensus) to form OpenTelemetry (known as OTEL). This project has become the industry standard for implementing distributed tracing and has native support in most distributed tracing frameworks and Observability SaaS products (Datadog, New Relic, Honeycomb, etc.).

Canva also switched over to OTEL and they’ve found success with it. They’ve fully embraced OTEL in their codebase and have only needed to instrument their codebase once. This is done through importing the OpenTelemetry library and adding code to API routes (or whatever you want to get telemetry on) that will track a specific span and record its name, timestamp and any tags (key-value pairs with additional metadata). This gets sent to a telemetry backend, which can be implemented with Jaeger, Datadog, Zipkin etc.

They use Jaeger, but also have Datadog and Kibana running at the company. OTEL is integrated with those products as well so all teams are using the same underlying observability data.

Their system generates over 5 billion spans per day and it creates a wealth of data for engineers to understand how the system is performing.

Frontend Tracing

Although Backend Tracing has gained wide adoption, Frontend Tracing is still relatively new.

OpenTelemetry provides a JavaScript SDK for collecting logs from browser applications, so Canva initially used this.

However, this massively increased the entry bundle size of the Canva app. The entry bundle is all the JavaScript, CSS and HTML that has to be sent over to a user’s browser when they first request the website. Having a large bundle size means a slower page load time and it can also have negative effects on SEO.

The OTEL library added 107 KB to Canva’s entry bundle, which is comparable to the bundle size of ReactJS.

Therefore, the Canva team decided to implement their own SDK according to OTEL specifications and uniquely tailored it to what they wanted to trace in Canva’s frontend codebase. With this, they were able to reduce the size to 16 KB.

Insights Gained

When Canva originally implemented tracing, they did so for things like finding bottlenecks, preventing future failures and faster debugging. However, they also gained additional data on user experience and how reliable the various user flows were (uploading a file, sharing a design, etc.).

In the future, they plan on collaborating with other teams at Canva and using the trace data for things like

Improving their understanding of infrastructure costs and projecting a feature’s cost
Risk analysis of dependencies
Constant monitoring of latency & availability

For more details, you can read the full blog post here.

Tech Snippets

System Design Newsletter

The System Design Newsletter is a fantastic read if you’re looking to up your knowledge of building large scale systems. Past articles include dives into micro frontends, microservice lessons, caching patterns and more.

When you sign up, you’ll also get a powerful system design template for free!

Join 18,001+ software engineers who read the newsletter! (cross promo)

systemdesign.one

How Google takes the pain out of code reviews

This is a fantastic dive into how Google does code review and the tooling they use. Critique is Google’s code review tool and it has a 97% satisfaction score amongst engineer’s at Google.

This article delves into Critique, the features it provides and what makes it so popular.

engineercodex.substack.com/p/how-google-takes-the-pain-out-of

How to Generate Income from Open Source

This is an interesting blog post that delves into how open source projects are making money. One option is commercial licenses.

Another is having a limited open source version and then offering additional features for paid plans. Sidekiq is a open source library in the Ruby ecosystem that follows this approach.

A third option is by open sourcing the code but offering a hosted version for a monthly price. Plausible Analytics does this where they’re a privacy-focused alternative to Google Analytics.

vadimdemedes.com/posts/generating-income-from-open-source

A Detailed Guide to CDNs

Web.dev is a great resource for web developers with a ton of in-depth guides on various topics in building scalable websites. This is a great guide on Content Delivery Networks with information on how to choose a CDN, improving the cache hit ratio and other performance tweaks.

http://web.dev/content-delivery-networks

Pytudes

This is an awesome Github repo by Peter Norvig with short, interesting snippets of Python for things like solving a sudoku board, ranking poker hands, finding scrabble words and more . They’re formatted as Jupyter Notebooks, so each piece of code also has a great explanation delving into the code and design choices.

https://github.com/norvig/pytudes

When Feature Flags Do And Don’t Make Sense

Some situations where feature flags are great are for A/B testing and coordinating between many subtasks. They are also very useful for hiding a certain feature until a specific launch date while still being able to deploy it and test it out.

However, some teams can become excessively risk averse and mandate that too many code changes should be behind feature flags. Having too many flags can increase complexity significantly.

software.rajivprab.com/2019/12/19/when-feature-flags-do-and-dont-make-sense