How Dropbox implemented their Image Search feature

Using CNNs, Word Embeddings and an Inverted Index to build Image Search

July 23, 2023

Hey Everyone!

Today we'll be talking about

How Image Search works at Dropbox
- Using Convolutional Neural Networks with EfficientNet
- Creating Category Space Vectors to encode information about what an image contains
- Employing Word Embeddings that support multiple languages
- Using an Inverted Index data structure and a Forward Index data structure
Row vs. Column-Oriented Databases
- OLTP vs OLAP Workloads
- Storing Data in a Row vs. Column Format and How They Differ on Disk
- Performance Trade-Offs Between Row vs. Column
- Serialization Formats Like CSV, JSON, XML, Avro, Parquet and More
Tech Snippets
- Tips on Writing Good Documentation
- A Collection of Debugging Stories
- Bringing HDR Video to Instagram Reels
- Finding the Right Company to Reach Staff Engineer at

How Image Search works at Dropbox

Dropbox is a file hosting and sharing company with over 700 million registered users.

One of the most common types of files uploaded to Dropbox are photos. As I’m sure you’re aware, searching for a specific photo is a huge pain.

You probably have your photos named “IMG_20220323.png“ or something. Finding the photo from that time you were at the beach would require quite a bit of scrolling.

Instead, Dropbox wanted to ship a feature where you could type picnic or golden retriever and have pictures related to that term show up.

Prior to the last decade, building a feature like this would’ve been a massive task. Taking a photo and recognizing what items were in it relied on hand-engineered features and patterns.

You’d write rules based on the color, size, texture and relation between different parts of the image or you might use techniques like decision trees or support vector machines (these still rely on hand-crafted features).

xkcd 1425

However, with the huge amount of research in Deep Learning over the last decade, building a feature like this has become something you can do quite quickly with a couple of engineers.

Thomas Berg is a machine learning engineer at Dropbox and he wrote an awesome blog post on how the team built their image search feature.

Here’s a summary with some added context

Tech Dropbox Used

Here’s some of the important concepts/tech for how Dropbox built their image search feature.

Convolutional Neural Network (CNN) - A class of neural network models that perform incredibly well at processing and recognizing objects in images. To learn more, I’d highly recommend checking out the fastai book (it’s free).
EfficientNet - At Dropbox, they used a convolutional neural network from EfficientNet, a family of CNN models that are optimized for fast inference without significantly sacrificing accuracy. These models also perform really well on image classification transfer learning tasks, so they can be adapted to your specific use case. Etsy also talked about how they use EfficientNet for their image search feature.
Word Vectors - a technique used in machine learning to represent words as multi-dimensional vectors. Words with similar meanings will assigned similar vectors; dog, puppy, pitbull, perro (spanish), kutta (hindi) will all be close in vector space (measured by cosine similarity).
I’d highly suggest watching 3Blue1Brown’s playlist on linear algebra if you need a refresher (or a fun way to spend your sunday night).
Dropbox uses ConceptNet Numberbatch, a pre-computed set of word embeddings to transform a user’s search query into a word vector. These word vectors are then used to find the closest image. ConceptNet Numberbatch is multi-lingual, so users across the world can use the image search feature.
Category Space Vectors - These are vectors in “category space“. A category space is a vector space where the dimensions represent different things that can be in the image (these are called categories).
You can have a category (dimension) for dogs, cars, ice cream, etc.
Dropbox’s CNN model identifies what categories are in the image and generates the category space vector for that image. The magnitude of the vector in the different dimensions represents how significant the categories are in the image. Dropbox uses several thousand categories/dimensions for their category space.
Inverted Index - A data structure that’s commonly used to build full-text search systems. It’s very similar to the index section at the back of a textbook. You have all the words in alphabetic order and you also have the page numbers where those words appear. This allows you to quickly find all the passages in the book where the word appears.
Dropbox maintains an inverted index that keeps track of all the different categories (potential objects). For each category, the inverted index has a list of all the images that contain that category.
Forward Index - A data structure that is the opposite of an inverted index. For each document, you keep a list of all the words that appear in that document.
Dropbox has a forward index that keeps track of all the images. For each image, the forward index stores the category space vector, that encodes the information of what categories are in the image.

Image Search Architecture

Here’s what happens when you search for an image on the Dropbox website.

Dropbox takes your query and uses ConceptNet Numberbatch (set of semantic vectors/word embeddings) to turn it into a word vector. Remember that this vector captures the meaning of your query so dog, golden retriever, puppy, perro (spanish), kutta (hindi), etc. will all be translated to similar vectors.
This word vector is then projected into Dropbox’s category space using matrix multiplication. This converts the semantic meaning of your search query into a representation that matches the category space vectors of the images. This way, we can compare the projected vector with the image’s category space vectors. Dropbox measures how close two vectors are using cosine similarity.
Read the Dropbox article for more details on how they do this. If you want to learn about projection matrices, then again, I’d highly recommend the 3Blue1Brown series.
Dropbox has an inverted index that keeps track of all the categories and which images rank high in that category. So they will match the categories in your query with the categories in the Inverted Index and find all the images that contain items from your query.
Now, Dropbox has to rank these images in terms of how well they match your query. For this, they use the Forward Index to get the category space vectors for each matching image. They see how close each of these vectors are with your query’s category space vector to get a similarity score.
Finally, they return the images with a similarity score above a certain threshold. They also rank the images by that similarity score.

Dropbox’s forward index and inverted index are both part of Nautilus, their inhouse search engine. You can read about the architecture of Nautilus here.

For more details about scaling issues Dropbox faced with their Image Search system, read the full article.

How did you like this summary?

Your feedback really helps me improve curation for future emails.

Tech Snippets

Writing Documentation for your Team

Good Documentation can be a huge productivity boost for an engineering team.

Here’s some great tips on writing great docs

- Be selective. Documenting everything leads to noise and a big maintenance burden. Instead, be explicit about why something is worth documenting

- Layer your communication. Use headings and subheadings liberally.

- Link to sources of truth. This way, it’s easier to update the docs when things change.

www.shubhro.com/2020/01/06/writing-documentation-team

Bringing HDR video to Instagram Reels

This is an interesting post on Meta’s efforts to bring high dynamic range (HDR) to Reels on Facebook and Instagram. They talk about the challenges faced due to differing support for HDR and the need for accurate tone mapping.

They also discussed the effort they put into managing brightness when displaying HDR and SDR content in a mixed scenario.

engineering.fb.com/2023/07/17/video-engineering/hdr-video-reels-meta

Finding the Right Company to Reach Staff Engineer

This is a great blog post that delves into how to pick the right company for career progression, especially if you’re aiming for Staff+ roles.

Things to consider are
- Is the company meritocratic or proceduralist
- What type of Staff Engineers does the company hire
- Do you have a sponsor at the company?

lethain.com/finding-the-right-company

A Collection of Debugging Stories

This is a collection of stories that people have published detailing their journey of debugging a certain problem.

Reading these can help you hone your own debugging skills and it can also be fun.

https://github.com/danluu/debugging-stories

This is part of a past article from Quastor Pro. You can check out our full archive of content with this 1 week free trial.

Row vs. Column Oriented Databases

Over the last 40 years, databases have traditionally been row-oriented (you can think of DBMSs like Postgres or MySQL). This has worked extremely well for your traditional Online Transaction Processing (OLTP) workloads, where you might be keeping track of basic customer information (name, age, address, etc.), e-commerce orders, warehouse inventory, etc.

OLTP workloads have small, online transactions where you have to perform frequent CRUD operations on different parts of the data.

However, over the last 20 years, we’ve seen a new type of workflow. Companies are starting to store massive datasets tracking things like sensor data, financial trades, user actions, etc. These databases must be able to ingest massive amounts of data quickly and perform large computations efficiently.

This is known as an Online Analytics Processing Workload and there’s been a ton of research in this area over the last 2 decades. To solve this problem, you’ve seen the rise of cloud data warehouses like AWS Redshift and Snowflake. You’ve also seen databases like Apache Pinot, Apache Druid, InfluxDB and more.

One key difference between databases optimized for OLTP workloads vs. OLAP workloads is how they store data on disk. Databases designed for OLTP workloads traditionally use a row oriented layout while OLAP databases tend to favor column oriented.

AWS Redshift, Google BigQuery, Snowflake, Pinot, Druid and InfluxDB are a couple examples of databases that are column oriented.

We’ll go through both paradigms and talk about the tradeoffs.

Quastor Pro

In addition to our emails, you can also get weekly articles on system design and technical dives on cool tech!

Past articles include

System Design Articles

Tech Dives

Database Concepts

It’s $12 per month and I’d highly recommend using your job’s Learning & Development stipend to pay for it!