How Canva Implemented End-to-End Tracing

Canva processes over 5 billion spans daily. Plus, a detailed guide to CDNs, when feature flags do and don't make sense and more.

Hey Everyone!

Today we’ll be talking about

  • How Canva Implemented End to End Tracing

    • The Three Pillars of Observability

    • How Distributed Tracing Works

    • Implementing Backend Tracing with OpenTelemetry

    • Implementing Frontend Tracing

    • Insights Gained from Tracing

  • Tech Snippets

    • System Design Newsletter

    • How Google does Code Review

    • Ways of Generating Income from Open Source

    • A Detailed Guide to CDNs

    • When Feature Flags Do and Don’t Make Sense

    • How Dropbox Selects Data Centers

    • Pytudes

End-to-End Tracing at Canva

As systems get bigger and more complicated, having good observability in-place is crucial.

You’ll commonly hear about the Three Pillars of Observability

  1. Logs - specific, timestamped events. Your web server might log an event whenever there’s a configuration change or a restart. If an application crashes, the error message and timestamp will be written in the logs.

  2. Metrics - Quantifiable statistics about your system like CPU utilization, memory usage, network latencies, etc.

  3. Traces - A representation of all the events that happened as a request flows through your system. For example, the user file upload trace would contain all the different backend services that get called when a user uploads a file to your service.

In this post, we’ll delve into traces and how Canva implemented end-to-end distributed tracing.

Canva is an online graphic design platform that lets you create presentations, social media banners, infographics, logos and more. They have over 100 million monthly active users.

Ian Slesser is the Engineering Manager in charge of Observability and he published a fantastic blog post that delves into how Canva implemented End-to-End Tracing. He talks about Backend and Frontend tracing, insights Canva has gotten from the data and design choices the team made.

Example of End to End Tracing

With distributed tracing, you have

  • Traces - a record of the lifecycle of a request as it moves through a distributed system.

  • Spans - a single operation within the trace. It has a start time and a duration and traces are made up of multiple spans. You could have a span for executing a database query or calling a microservice. Spans can also overlap, so you might have a span in the trace for calling the metadata retrieval backend service, which has another span within for calling DynamoDB.

  • Tags - key-value pairs that are attached to spans. They provide context about the span, such as the HTTP method of a request, status code, URL requested, etc.

If you’d prefer a theoretical view, you can think of the trace as being a tree of spans, where each span can contain other spans (representing sub-operations). Each span node has associated tags for storing metadata on that span.

In the diagram below, you have the Get Document trace, which contains the Find Document and Get Videos spans (representing specific backend services). Both of these spans contain additional spans for sub-operations (representing queries to different data stores).

Backend Tracing

Canva first started using distributed tracing in 2017. They used the OpenTracing project to add instrumentation and record telemetry (metrics, logs, traces) from various backend services.

The OpenTracing project was created to create vendor-neutral APIs and instrumentation for tracing. Later, OpenTracing merged with another distributed traces project (OpenCensus) to form OpenTelemetry (known as OTEL). This project has become the industry standard for implementing distributed tracing and has native support in most distributed tracing frameworks and Observability SaaS products (Datadog, New Relic, Honeycomb, etc.).

Canva also switched over to OTEL and they’ve found success with it. They’ve fully embraced OTEL in their codebase and have only needed to instrument their codebase once. This is done through importing the OpenTelemetry library and adding code to API routes (or whatever you want to get telemetry on) that will track a specific span and record its name, timestamp and any tags (key-value pairs with additional metadata). This gets sent to a telemetry backend, which can be implemented with Jaeger, Datadog, Zipkin etc.

They use Jaeger, but also have Datadog and Kibana running at the company. OTEL is integrated with those products as well so all teams are using the same underlying observability data.

Their system generates over 5 billion spans per day and it creates a wealth of data for engineers to understand how the system is performing.

Frontend Tracing

Although Backend Tracing has gained wide adoption, Frontend Tracing is still relatively new.

OpenTelemetry provides a JavaScript SDK for collecting logs from browser applications, so Canva initially used this.

However, this massively increased the entry bundle size of the Canva app. The entry bundle is all the JavaScript, CSS and HTML that has to be sent over to a user’s browser when they first request the website. Having a large bundle size means a slower page load time and it can also have negative effects on SEO.

The OTEL library added 107 KB to Canva’s entry bundle, which is comparable to the bundle size of ReactJS.

Therefore, the Canva team decided to implement their own SDK according to OTEL specifications and uniquely tailored it to what they wanted to trace in Canva’s frontend codebase. With this, they were able to reduce the size to 16 KB.

Insights Gained

When Canva originally implemented tracing, they did so for things like finding bottlenecks, preventing future failures and faster debugging. However, they also gained additional data on user experience and how reliable the various user flows were (uploading a file, sharing a design, etc.).

In the future, they plan on collaborating with other teams at Canva and using the trace data for things like

  • Improving their understanding of infrastructure costs and projecting a feature’s cost

  • Risk analysis of dependencies

  • Constant monitoring of latency & availability

For more details, you can read the full blog post here.

Tech Snippets