How Canva Collects 25 Billion Events Per Day

An overview of AWS Kinesis and how Canva uses it to collect and process 25 billion events per day. Plus, the art of good code review, how to find coachable employees and more.

Arpan KG
August 20, 2024

Hey Everyone!

Today we’ll be talking about

How Canva Collects 25 Billion Events Per Day
- Brief Overview of AWS Kinesis
- Architecture of Canva’s Data Pipeline
- Why Canva picked Kinesis over AWS SQS and techniques Canva uses to minimize costs
Tech Snippets
- Go is my hammer, and everything is a nail
- Coachability: The Prerequisite To Growth
- The art of good code review

How Canva Collects 25 Billion Events Per Day

Canva is an online graphics design platform that lets you create presentations, social media banners, infographics, logos and more. They have over 175 million monthly users and are valued at $26 billion.

In order to understand how people are using the platform, Canva’s mobile, web and desktop apps collect a wide range of events on user clicks, views, scrolls, etc.

Every day, Canva needs to collect and process over 25 billion events (800 billion events per month). This needs to be done with 99.999% uptime.

Last month, they published a fantastic blog post on how they built a data pipeline to handle this.

They talk about why they built the pipeline on AWS Kinesis and the specific techniques they use to minimize costs and latency.

Brief Overview of AWS Kinesis

AWS Kinesis is a family of services for processing and analyzing streaming data in real-time. It was launched in late 2013 and is composed of four main services: Data Streams, Data Firehose, Data Analytics and Video Streams.

Here’s a brief overview of the four services:

Data Streams - this service is responsible for ingesting and storing streaming data in real-time with sub-second latency. Kinesis Data Streams does not handle data processing so you’ll need to use another tool (Apache Flink, Kinesis Data Analytics, Spark, etc.) for transformations and analytics. Kinesis Data Firehose is used for sending the processed data to destinations like AWS S3, MongoDB, etc.
Data Firehose - Firehose is primarily used for loading streaming data into data lakes, databases and analytics services. You can deliver your data to AWS S3, Redshift, Elasticsearch, Splunk and other data stores.

However, Firehose can also handle data ingestion and basic transformations. A few months ago, Firehose was rebranded from Kinesis Firehose to Data Firehose (but Firehose’s API and other functionality wasn’t changed).
Data Analytics - If you’d like to run complex transformations on the streaming data that’s been ingested through Data Streams, then you can do that with Kinesis Data Analytics.

Under the hood, Data Analytics uses Apache Flink so Amazon has also rebranded Kinesis Data Analytics to “Amazon Managed Service for Apache Flink” (but the core capabilities and purpose haven’t changed).
Video Streams - In addition to data, Kinesis can also be used for ingesting and storing live video. Kinesis Video Streams gives you the infrastructure to ingest and store video data. You can integrate it with other services to process and distribute the stored video.

Canva uses Kinesis Data Streams to ingest 25 billion events per day. From Kinesis, Canva sends the event data to Snowflake for processing.

Here’s how the data pipeline works…

Canva’s Data Pipeline for Collecting Events

Canva has iOS, Android, web and desktop applications. Each of these apps is instrumented to collect events and send them to Canva’s backend.

Canva’s servers will first validate the events and make sure that they conform to a predefined schema.

They will then batch the events together (with a few hundred events per batch) and apply ztsd compression. Then, Canva’s servers will send the events to a Kinesis Data Stream.

From Kinesis, Canva has an ingestion worker that will read the events and enrich them with additional data. This worker will do things like

Add country-level geolocation data
Add user device details
Correct any timestamp issues

Canva has a separate ingestion worker do this processing because they wanted to minimize the latency of the collection endpoint in the server. Decoupling the event collection and the event enrichment helps them scale to 25 billion events per day.

After enrichment, the events are sent back to Kinesis. Canva’s router then routes the events to Snowflake. Canva runs their ML models, dashboards and data analytics with Snowflake as the data store.

Some of the event types are also sent to AWS SQS so they can be consumed by other backend services at Canva (that need to process the event data in real-time).

Minimizing AWS Costs

AWS Kinesis over SQS - In the first version of the data pipeline, Canva used AWS SQS and SNS instead of Kinesis. These were easier to set up however the pricing was significantly higher. By switching to Kinesis Data Streams, Canva saw costs drop by 85%.
Event Compression - Canva’s servers will first batch the events (in groups of a few hundred events per batch) and apply ztsd compression. These compressed batches will then be sent to Kinesis. Using this strategy (instead of sending each event as a separate record) saves Canva $600k every year in AWS costs.

Tech Snippets

The art of good code review

A red flag for software engineering teams is if they don’t have any code review process setup.

Code Review helps prevent bugs and design issues but it’s also great for reducing knowledge silos and helping promote shared ownership of the codebase.

Phil Booth is an ex-Mozilla engineer and he wrote a terrific blog post delving into the art of a good code review process.

Good code review starts with a good PR. The PR should have a clear and detailed description. The change itself should only contain the minimal code necessary to implement the description. There shouldn’t be tangential refactorings.

Code Reviewers should carve out time for the review and read through the PR token by token. They should be asking themselves “Am I happy if I have to maintain this code”.

Read the full blog post for the rest of Phil’s thoughts.

philbooth.me/blog/the-art-of-good-code-review

Go is my hammer, and everything is a nail

Markus is a solo developer who builds digital products. He has written an excellent blog post on why he simplified his tech stack and opted for Go for all his needs.

While you should pick your tech stack based on the problem you need to solve, there are several upsides to sticking to a single programming language such as:

1. Any modern programming language, including Go, can be used to build everything from games to GUI apps and cloud infrastructure.
2. You can focus on delivering what matters the most instead of context-switching and struggling with multiple different workflows
3. Deep knowledge of a programming language lets you stay updated with its ecosystem and use its strengths to your full advantage.

Finally, picking achieving mastery over a growing programming language opens up a massive number of career options.

www.maragu.dev/blog/go-is-my-hammer-and-everything-is-a-nail

Coachability: The Prerequisite To Growth

Cate Huston wrote an insightful blog post giving managers the tools to spot and encourage coachable employees.

Employees can be sorted into different quadrants based on their receptiveness to feedback and their ability to implement actionable changes.

Those who rank high on both metrics offer the least friction.

For high actionability and low receptiveness employees, managers need to build up trust and resolve any conflicts.

Low actionability yet highly receptive candidates need clear, achievable goals and frequent check-ins.

Finally, employees with low actionability and low receptiveness may be overwhelmed with their responsibilities or bullied by others on the team.

Managers need to understand their employees and break down their resistance for the team’s success.

leaddev.com/mentoring-coaching-feedback/coachability-overlooked-factor-people-development