How Canva Built a Reverse Image Search System

Hey Everyone!

Today we’ll be talking about

  • How Canva Built a Reverse Image Search System

    • Canva has a massive library of images and they wanted to build a reverse image search system that could take in an image and return similar images in the dataset.

    • They used Perceptual Hashing and Hamming Distance to build the system.

    • The system was built on DynamoDB with Multi-Index Hashing

  • Tech Snippets

    • Lessons Learned while Migrating from Java to Kotlin at Facebok

    • An Introduction to JIT Compilers

    • How Monzo Deploys to Production Over 100 Times a Day

    • Short, Interesting Snippets of Python Code

How to make your Multi-Factor Authentication Unphishable

Last month, Uber had a massive data breach where a hacker gained complete access to internal systems including several databases, employee email, backend source code and more.

The attack was caused by the use of phishable user authentication factors, a flaw in how multi-factor authentication (MFA) is usually implemented. Although still more secure than single-factor passwords, MFA flows like SMS passcodes, email magic links, push notifications, etc. can be susceptible to social engineering hacks.

The Web Authentication API (WebAuthn) is a W3C standard that’s meant to solve this vulnerability.

It uses public-key cryptography, where a user submits built-in biometrics like FaceID to create a digital signature using the private key stored on their device. This digital signature is then sent to the web server where it’s verified using the public key.

Adoption of WebAuthn is ramping up quickly with biometrics growing more prevalent (like Apple introducing Passkeys) and the rise of developer-friendly auth platforms (like Stytch).

sponsored

How Canva Built a Reverse Image Search System

Canva develops a web app that you can use to create social media graphics, posters, brochures and other visual content. The company is valued at $26 billion and has over 75 million monthly active users.

Their web app has a huge library of images that you can use when creating your graphics. Many of these images are contributed by Canva users, so the company has to moderate this library and make sure there isn't any inappropriate content.

Additionally, they also want to remove any duplicate images to minimize storage costs. You might have an almost-identical pair of images in the library where one of the images has a watermark. Engineers would like to identify that pair and remove one of the images.

In order to do this, Canva built a Reverse Image Search platform, where you can give the service an image and it’ll search the image library for the other images that are the most visually similar.

If these two images below were in Canva’s image library, the service would be able to return the watermarked image given the first one as a prompt (or vice versa).

Canva wrote a great blog post on how they built the system, which you can check out here.

Some of the technologies/algorithms Canva used are

  • Perceptual Hashing

  • Hamming Distance

  • Multi-index Hashing

We’ll go through each of these and talk about how Canva used it to build their system.

Hashing

Hash functions are commonly used when comparing different files. You’ll see cryptographic hash functions like MD5 and SHA used to checksum files, where you generate unique strings for all the different files in your system by hashing the raw bytes.

However, this strategy doesn’t work for reverse image search. An important property of cryptographic hash functions is the Avalanche effect, where a small change in the input will result in a completely different hash function output.

If one pixel changes color slightly in the image, then the resulting cryptographic hash function output will be completely different (using something like MD5 or SHA). For reverse image search, Canva wanted the hash output to be extremely similar for images that are visually close.

Perceptual Hashing

Perceptual Hash functions are part of a field of hashing that is locality-sensitive (similar inputs will receive similar hash function outputs). These hash functions generate their output based on the actual pixel values in the image so the output depends on the visual characteristics of the image. pHash is one popular Perceptual Hash function that is used in the industry.

Here's a good blog post that delves into how pHash works; a TLDR is that it relies on Discrete Cosine Transforms (DCTs). DCT forms the basis for many compression algorithms; Computerphile made a great video on how the DCT is used for JPEG.

If you want an algorithm that checks if two images are visually similar, you can do so by first calculating the pHash of both images and then finding the Hamming distance between the two pHashes.

The Hamming distance between two strings calculates the minimum number of character substitutions that are required to change one string into the other. So “cat” and “bat” have a Hamming distance of 1. This is a great way for calculating how similar two strings are.

Matching Perceptual Hashes

Canva’s goal was to create an image matching system which can take a perceptual hash and a Hamming distance as input.

Then, the system would compare that hash and Hamming distance to all the stored image hashes in the database and output all the stored images within the Hamming distance.

In order to build this, Canva implemented an algorithm called multi-index hashing.

This is where you take the stored perceptual hashes and split each hash up into n segments.

Here’s an example where the perceptual hash of each image is split into 4 segments.

Then, you can store each of these segments in different database shards (where all the first segments go on one shard, all the second segments go on another, and so on).

When you want to run a reverse image search, you find the perceptual hash of the input image and split that hash up into the same number of segments.

For each segment in the input hash, you’ll query the corresponding database shard and find any stored hash segments that match. You can run these queries in parallel and then combine the results to find images within the total Hamming Distance.

Results

Canva has this system deployed in production and they store hashes for over 10 billion images in DynamoDB.

They have an average query time of 40 milliseconds, with 95% of the queries being resolved within 60 milliseconds. This is with a peak workload in excess of 2000 queries per second.

They’re able to catch watermarking, quality changes and other minor modifications.

For more details, you can read the full post here.

How did you like this summary?

Your feedback really helps me improve curation for future emails.

Login or Subscribe to participate in polls.

How to make your Multi-Factor Authentication Unphishable

Last month, Uber had a massive data breach where a hacker gained complete access to internal systems including several databases, employee email, backend source code and more.

The attack was caused by the use of phishable user authentication factors, a flaw in how multi-factor authentication (MFA) is usually implemented. Although still more secure than single-factor passwords, MFA flows like SMS passcodes, email magic links, push notifications, etc. can be susceptible to social engineering hacks.

The Web Authentication API (WebAuthn) is a W3C standard that’s meant to solve this vulnerability.

It uses public-key cryptography, where a user submits built-in biometrics like FaceID to create a digital signature using the private key stored on their device. This digital signature is then sent to the web server where it’s verified using the public key.

Adoption of WebAuthn is ramping up quickly with biometrics growing more prevalent (like Apple introducing Passkeys) and the rise of developer-friendly auth platforms (like Stytch).

sponsored

Tech Snippets

  • Pytudes - This is an awesome Github repo by Peter Norvig with short, interesting snippets of Python for things like solving a sudoku board, ranking poker hands, finding scrabble words and more . They’re formatted as Jupyter Notebooks, so each piece of code also has a great explanation delving into the code and design choices. Going through a few of the posts a week is a great way to become a better programmer.

  • Lessons learned at Facebook when shifting from Java to Kotlin - Facebook shifted their Android dev from Java to Kotlin and have now passed over 10 million lines of Kotlin. They published a fantastic blog post going over some of the strengths and weaknesses of the language compared to Java. They also talked about how they did the migration and how they addressed challenges like longer build times and interoperability with Java.

  • An Introduction to Just-In-Time Compilers - Noah Gibbs is a software engineer at Shopify and he works on YJIT, a JIT compiler for Ruby. He wrote a great post giving an introduction to Just-In-Time compilers and how they compare to your traditional Ahead-Of-Time compilers.

  • How Monzo deploys to Production over 100 times a day - Monzo is an online bank based in the UK with nearly 6 million customers. As a startup, it's extremely important for them to iterate quickly and ship fast. To do so, they've optimized their developer workflow for rapid delivery and built a culture of shipping small, reversible changes. This is a great article that delves into how they've done this.

Interview Question

Given an array nums of distinct integers, return all the possible permutations. You can return the answer in any order.

Previous Question

As a reminder, here’s our last question

Given an integer array nums sorted in non-decreasing order, return an array of the squares of each number sorted in non-decreasing order.

Solution

This problem is tricky because we have negative and positive integers in our array. If we just square it and return the result then it won't be sorted.

[-4, -2, 0, 1, 3] squared is [16, 4, 0, 1, 9]

Instead, we can solve this question in O(n) time using the Two Pointer Technique.

The idea is that we'll have two pointers, one at the start of the array (p1) and one at the end (p2).

Then, we'll compare the squares of the numbers at both pointers to see which square is greater.

If p1's square is greater, then we increment p1 otherwise we decrement p2.

When p1 is greater than p2, we'll terminate the loop.

The squares list will be sorted from greatest to least, so we can reverse the list and return it.

Here's the Python 3 code.