How Canva Stores Tens of Billions of User Generated Media

Canva is an online design platform with tens of millions of users. They wrote a blog post detailing their migration from MySQL to DynamoDB.

December 16, 2022

Hey Everyone!

Today we’ll be talking about

How Canva Scaled their Media Storage Service
- Canva is an online design platform and they store tens of billions of media items (pictures, videos, icons, etc.)
- Previously, they relied on MySQL on AWS RDS for storing data but they faced scaling issues. They evaluated several solutions (sharding, switching to a different service, etc.) and decided on migrating to DynamoDB
- Their engineering team wrote a great blog post delving into what scaling problems they faced, why they made the switch, and how they did the migration.
Tech Snippets
- Upgrading Lichess to Scala 3
- How Grab uses Graphs for Fraud Detection
- Using the Type System Effectively

Scaling Media Storage at Canva

Canva is an online design platform that allows users to create posters, social media ads, presentations, and other types of graphics. The company has more than 100 million monthly active users, who collectively upload more than 50 million new pieces of media (pictures, videos, logos, etc.) every day.

Canva aggregates all this media to provide a massive marketplace of user-generated-content that you can use in your own work. They now store over 25 billion pieces of user uploaded media.

Previously, Canva used MySQL hosted on AWS RDS to store media metadata. However, as the platform continued to grow, they faced numerous scaling pains with their setup.

To address their challenges, Canva switched to DynamoDB.

Robert Sharp and Jacky Chen are software engineers at Canva and they wrote a great blog post detailing their MySQL scaling pains, why they picked DynamoDB and the migration process.

Here's a summary

Canva uses a microservices architecture, and the media service is responsible for managing operations on the state of media resources uploaded to the service.

For each piece of media, the media service manages

media ID
ID of the user who uploaded it
status (active, trashed, pending deletion)
title, keywords, color information, etc.

and more.

The media service serves far more reads than writes. Most of the media is rarely modified after it's created and most media reads are of content that was created recently.

Previously, the Canva team used MySQL on AWS RDS to store this data. They handled growth by first scaling vertically (using larger instances) and then by scaling horizontally (introducing eventually consistent reads that went to MySQL read replicas).

However, they started facing some issues

schema change operations on the largest media tables took days
they approached the limits of RDS MySQL's Elastic Block Store (EBS) volume size (it was ~16TB at the time of the shift, but it's now ~64 TB)
each increase in EBS volume size resulted in a small increase in I/O latency
servicing normal production traffic required a hot buffer pool (caching data and indexes in RAM) so restarting/upgrading machines was not possible without significant downtime for cache warming.

and more (check out the full blog post for a more extensive list).

The engineering team began to look for alternatives, with a strong preference for approaches that allowed them to migrate incrementally and not pull all their bets on a single unproven technology choice.

They also took several steps to extend the lifetime of the MySQL solution, including

Denormalizing tables to reduce lock contention and joins
Rewriting code to minimize the number of metadata updates
Removed foreign key constraints

They also implemented a basic sharding solution, where they split the database into multiple SQL databases based on the media ID (the partition key).

In parallel, the team also investigated and prototyped different long-term solutions. They looked at sharding MySQL (either DIY or using a service like Vitess), Cassandra, DynamoDB, CockroachDB, Cloud Spanner and more. If you're subscribed to Quastor Pro, check out our past explanation on Database Replication and Sharding.

They created the table below to illustrate their thinking process.

You can view a larger version of this table here.

Based on the tradeoffs above, they picked DynamoDB as their tentative target. They had previous experience running DynamoDB services at Canva, so that would help make the ramp up faster.

Migration

For the migration process, the engineering team wanted to shed load from the MySQL cluster as soon as possible.

They wanted an approach that would

Migrate recently created/updated/read media first, so they could shed load from MySQL
Progressively migrate
Have complete control over they mapped data into DynamoDB

Updates to the media are infrequent and the Canva team found that they didn't need to go through the difficulty of producing an ordered log of changes (like using a MySQL binlog parser).

Instead, they made changes to the media service to enqueue any changes to an AWS SQS queue. These enqueued messages would identify the particular media items that were created/updated/read.

Then, a worker instance would process these messages and read the current state from the MySQL primary for that media item and write it to DynamoDB.

Most media reads are of media that was recently created, so this was the best way to populate DynamoDB with the most read media metadata quickly. It was also easy to pause or slow down the worker instances if the MySQL primary was under high load from Canva users.

They also had a low priority SQS queue that they used to backfill previous media data over to DynamoDB. A scanning process would scan through MySQL and publish messages with the media content ID to the low priority queue.

Worker instances would process these messages at a slower rate than the high priority queue.

Testing

To test the AWS SQS replication process, they implemented a dual read and comparison process to compare results from MySQL and DynamoDB and identify any discrepancies.

After resolving replication bugs, they began serving eventually consistent reads of media metadata from DynamoDB.

The final part of the migration was switching over all writes to be done on DynamoDB. This required writing new service code to handle the create/update requests.

To mitigate any risks, the team first did extensive testing to ensure that their service code was functioning properly.

They wrote a runbook for the cutover and used a toggle to allow switching back to MySQL within seconds if required. They rehearsed the runbooks in a staging environment before rolling out the changes.

The cutover ended up being seamless, with no downtime or errors.

Results

Canva's monthly active users have more than tripled since the migration, and DynamoDB has been rock solid for them. It's autoscaled as they've grown and also cost less than the AWS RDS clusters it replaced.

The media service now stores more than 25 billion user-uploaded media, with another 50 million uploaded daily.

How did you like this summary?

Your feedback really helps me improve curation for future emails.

Tech Snippets

Upgrading Lichess to Scala 3 - Lichess is an open-source and free chess website with tens of millions of users. They've recently undergone a major upgrade by migrating to Scala 3. This brought a bunch of new features around Scala's type system, syntax, JVM performance and more. The migration had a brief hiccup with JVM CPU usage spiking with no apparent cause, but that was solved with some tuning. Lichess is now much faster and the entire upgrade took a month.

How Grab uses Graph Data Structures for Fraud Detection - Grab operates ride hailing, grocery delivery and financial services businesses in Southeast Asia. In order to identify fraudsters, they've been relying on graph neural network (GNN) models to identify the complicated fraud patterns effectively. GNNs require far less feature engineering compared to rule engines/decision trees but are able to identify complex relationships. The authors go into detail on why they picked GNNs, how they train them and the benefits compared to other approaches.

The Type System is a Programmer's Best Friend - This is an interesting blog post on using type systems more effectively. If you're storing a user's email address then don't just rely on a string. Use a dedicated type and add in useful functionality, such as a .Domain() method which will return gmail or outlook. Having good types can prevent many future bugs and make your team much more efficient.

System Design Articles

If you'd like weekly articles on System Design concepts (in addition to the Quastor summaries), consider upgrading to Quastor Pro!

Past articles cover topics like

Load Balancers - L4 vs. L7 load balancers and load balancing strategies like Round Robin, Least Connections, Hashing, etc.
Database Storage Engines - Different types of Storage Engines and the tradeoffs made. Plus a brief intro to MyISAM, InnoDB, LevelDB and RocksDB.
Log Structured Merge Trees - Log Structured Storage Engines and how the LSM Tree data structure works and the tradeoffs it makes.
Chaos Engineering - The principles behind Chaos Engineering and how to implement. Plus case studies from Facebook, LinkedIn, Amazon, and more.
Backend Caching Strategies - Ways of caching data and their tradeoffs plus methods of doing cache eviction.
Row vs. Column Oriented Databases - How OLTP and OLAP databases store data on disk and different file formats for storing and transferring data.
API Paradigms - Request-Response APIs with REST and GraphQL vs. the Event Driven paradigm with Webhooks and Websockets.

It's $12 a month or $10 per month if you pay yearly. Thanks a ton for your support.

I'd highly recommend using your Learning & Development budget for Quastor Pro!

Interview Question

Given an integer array nums that may contain duplicates, return all possible subsets.

The solution set must not contain duplicate subsets. Return the solution in any order.

Here’s the question in LeetCode.

Previous Question

As a reminder, here’s our last question

You are given a binary tree.

Each node in the tree has a next pointer that points to the next right node but the next pointers are currently set to Null.

Write a function that sets each next pointer to the correct node.

Here’s a link to the question in LeetCode.

Solution

The “brute force” way of solving this question is with a level-order traversal.

We can use a queue to keep track of our nodes and then iterate through the tree level by level.

While we’re iterating through each level, we’ll also set the next pointers.

Here’s the Python 3 code.

This solution runs in O(n) time complexity, but space complexity is also O(n).

Can we improve on space complexity?

We can! We can solve this problem with constant space complexity.

Rather than maintaining a queue, we’ll use 3 pointers.

cur will point to the current node.

head will point to the head node of the next level.

prev will point to the leading node on the next level.

We’ll be iterating through each level and setting the next pointers on the following level.

The key is, instead of using a queue to iterate through each level, we can just use the next pointers that we’ve been setting!

Here’s the Python 3 code.

Time complexity is O(n) and space complexity is O(1).