How Canva Saved Millions on Data Storage

An overview of AWS S3 and how Canva changed their setup. Plus, the problem with LangChain, no more Postgres VACUUM, Distributed Systems for Fun and Profit and more.

July 17, 2023

Hey Everyone!

Today we’ll be talking about

How Canva Saved Millions on Data Storage
- A brief overview of AWS S3
- How Canva uses S3
- Analyzing Canva’s data access patterns and storage classes
- Transitioning data to Glacier Instant Retrieval and the cost savings
Tech Snippets
- The Problem with LangChain
- No More Postgres VACUUM
- Distributed Systems for Fun and Profit
- LazyVim
- There is No Data Engineering Roadmap

How Canva Saved Millions on Data Storage

Canva is an online platform that lets you easily create presentations, diagrams, social media posters, flyers and other graphics.

They have a ton of pre-built templates, stock photos/videos, fonts, etc. so you can quickly create a presentation that doesn’t look like it was designed by a 9 year old.

Canva has over 100 million monthly active users with tens of billions of designs created on the platform. They have over 75 million stock photos and graphics on the site.

They run most of their production workloads on AWS and are heavy users of services like

S3 - for storing graphics, photos, videos, etc.
ECS (Elastic Container Service) - for compute. For example, they use ECS for handling GPU-intensive tasks like image processing.
RDS (Relational Database Service) - for storing data on users and more.
DynamoDB - key-value store that Canva uses for storing media metadata (title, artist, keywords) and more

In November of 2021, AWS launched a new storage tier of S3 called Glacier Instant Retrieval. This offered low-cost archive storage that also had low latency (milliseconds).

Canva analyzed their data storage/access patterns and estimated the cost savings of switching to this tier. They also looked at the cost of switching and calculated the ROI.

The company was able to save $3.6 million annually by migrating over a hundred petabytes to S3 Glacier Instant Retrieval.

Josh Smith is an Engineering Manager at Canva and he wrote a fantastic blog post delving into how Canva tracked their data access patterns, estimated the ROI of switching, and the migration process.

Here’s a summary of the blog post with additional context

Brief Overview of AWS S3

(You might want to skim/skip over this section if you’re experienced with AWS)

AWS S3 (Simple Storage Service) is one of the first cloud services Amazon launched (back in 2006).

It’s an object storage service, so you can use it to store any type of data. It’s commonly used to store things like images, videos, log files, backups. You can store any file on S3 as long as the file is less than 5 terabytes.

S3 provides

Cost Effective - S3 can be a cheap way to store large amounts of data. There’s different storage tiers based on your latency requirements (discussed below) and the pricing is a couple of cents to store a gigabyte per month.
High Durability - AWS provides 11 9’s of durability, so it’s very safe and it’s extremely unlikely that you’ll lose data. That being said, you should still have backups.
Reliable - AWS provides at least 3 9’s of availability (99.9%), which equates to about 40 minutes of downtime per month. If they’re down for longer than that, then Amazon will write you a heartfelt apology and compensate you for any losses your business incurred from their mistake. Just kidding, they’ll give you a small fraction of your bill back as AWS credits.

With S3, you create a bucket (like a folder in a file system) and upload your files there. Each file is given a key and a version ID. You use the file’s bucket, key and version ID to access it.

The file is immutable, so if you want to change it then you’ll have to upload the entire changed file again.

Pricing is mainly based on

Storage - you’re charged per month per gigabyte you use
Requests - AWS charges you for each GET and PUT request made. Frequent uploading or accessing data will increase cost.
Data Transfer - Charges apply when you move data out of S3. You’re billed per gigabyte that you move out. This can be very high.

AWS provides many different storage classes for storing your data. Each storage class has tradeoffs in terms of latency and pricing.

Some of the classes are

S3 Standard - general purpose option for storing data that is frequently accessed. Your storage will be a couple cents per gigabyte per month and GETs/PUTs are a fraction of a cent per 1,000 requests.
S3 Standard-Infrequent Access - This is for when you need to store data that isn’t accessed as frequently. Compared to S3 standard, the cost per gigabyte per month of storage is cheaper, but the cost of uploading and accessing the data is more expensive.
S3 Glacier Instant Retrieval - Data storage per month is significantly cheaper compared to S3 standard, but uploading and accessing data is also significantly more expensive.
S3 Glacier Deep Archive - This is the lowest cost storage option but retrieving the data can have a latency of hours.

Read more about the storage classes here.

You can create rules to automatically move data between storage classes using S3 lifecycle policies.

Alrighty, back to Canva.

How Canva uses S3

Canva stores over 230 petabytes in S3, with their largest bucket coming in at 45 petabytes.

They use many different storage tiers to minimize their cost.

S3 Standard - Canva stores stock photos/videos and templates in this storage tier. The data is accessed many times per day, so they need to minimize the cost and latency of PUTs/GETs.
S3 Standard-Infrequent Access - Canva uses this to store old user-created projects, images and media. A user will access their project very frequently when it’s first created. After a few weeks, the user will finish the project and rarely open it again. Therefore, the project will first be in S3 Standard and will be moved to S3 Standard-IA after a few weeks by an S3 lifecycle policy.
S3 Glacier Flexible Retrieval - Canva also archives logs and backups on S3. They rarely access this data and latency doesn’t matter so they use Glacier Flexible Retrieval. They still get the data within minutes/hours and it’s very cheap to store.

Migrating to S3 Glacier Instant Retrieval

In November 2021, AWS launched S3 Glacier Instant Retrieval. This gives you an extremely cheap cost of storage per gigabyte per month. In addition, data retrieval for Glacier Instant Retrieval can be done instantly (within milliseconds) whereas Glacier Flexible Retrieval can take hours. The downside is that retrieval for this storage class is extremely expensive (around 25 times more expensive compared to S3 Standard).

Canva had to figure out whether it would make financial sense to migrate data to the Glacier Instant Retrieval class. To do this, they used S3 Storage Class Analytics, which you can turn on at a per-bucket level.

With this, Canva made several observations

Retrieval for user projects data fell dramatically after the first 15 days, so users finished up their projects after the first 2 weeks
The rate of retrieval for data in S3 Standard-Infrequent Access class didn’t change. Users were equally likely to open up a past project a month after they finished it versus a year after they finished it.
For a typical bucket, around 10% of the data was stored in S3 Standard, whereas 90% was stored in S3 Standard-IA. However, 70% of all accessed data for that bucket came from S3 Standard.

Based off this (and some more data crunching), Canva decided that it would be cost effective to shift low-access data to Glacier Instant Retrieval.

Unfortunately, shifting S3 data from one storage class to another isn’t free. In fact, moving all of Canva’s 300 billion objects from other storage classes to the Glacier Instant Retrieval class would cost over $6 million dollars. Not fun.

However, the cost of transferring data between storage classes is billed per 1,000 objects. The size of the objects don’t matter, so you can get the biggest bang for your buck by transferring over the largest objects.

Based on this, Canva decided to target buckets with an average object size of 400 KB or more. This would show a positive return on investment (the storage class transfer costs) within 6 months or less.

Conclusion

Canva has already transferred over 130 petabytes to S3 Glacier Instant Retrieval. It cost them $1.6 million dollars to transition but it’ll save them $3.6 million dollars a year.

For more details, please read the full blog post here.

How did you like this summary?

Your feedback really helps me improve curation for future emails.

Tech Snippets

The Problem With LangChain

If you’re building applications with Large Language Models, then you’ve probably heard of LangChain. This library makes it easy to hook up to common LLM APIs and it’ll handle common tasks (splitting up your text so it fits in the context window, managing prompts, creating workflows and more).

Max Woolf tried to build a chat app with LangChain and he quickly ran into problems. After he stopped using LangChain in his workflow, he had a significantly easier time building the app. He wrote a blog post delving into some of the issues with the library.

minimaxir.com/2023/07/langchain-problem

PostgreSQL: No More VACUUM

When you use Postgres, you need to purge old, unneeded data through a VACUUM process. This can consume substantial system resources and can also lead to issues with very large Postgres databases.

OrioleDB is a new open source engine for Postgres that aims to eliminate the need for the VACUUM process. This blog post delves into how OrioleDB does this.

www.orioledata.com/blog/no-more-vacuum-in-postgresql

Distributed Systems for Fun and Profit

This is a great series of in-depth blog posts that give an introduction to working with distributed systems.

Topics covered include a discussion around time (vector clocks, global vs. local vs. no-clock assumptions), replication (synchronous, asynchronous, partition tolerance), CRDTs and more!

book.mixu.net/distsys/?utm_source=blog.quastor.org&utm_medium=referral&utm_campaign=how-mcdonalds-implements-event-driven-architectures

LazyVim

If you’ve played around with vim and are looking to incorporate it into your workflow, then LazyVim could be a great way to get started.

You’ll quickly get set up with an IDE-like environment that you can easily extend and modify. It’s also very fast!

www.lazyvim.org

There is No Data Engineering Roadmap

You’ll frequently see posts on social media with some type of “roadmap“ on what you need to get into Data Engineering.

This is a great blog post that delves into how the only skill you really need to get started is SQL. Then, play around with different databases like Postgres, MySQL, etc.

Don’t worry too much about things like Python, Pandas, dbt, Airflow, Spark, etc. Learn those things on the job if they’re necessary.