How Dropbox implemented their Image Search feature

Using CNNs, Word Embeddings and an Inverted Index to build Image Search

Hey Everyone!

Today we'll be talking about

  • How Image Search works at Dropbox

    • Using Convolutional Neural Networks with the EfficientNet network

    • Employing Word Embeddings that support multiple languages

    • Using the Inverted Index data structure to optimize searches

  • Tech Snippets

    • A collection of Post-Mortems published by companies on why their systems went down

    • Good hacks to use when giving an engineering estimate

    • Which Programming Paradigm Gives the Most Expressive Code?


How Image Search works at Dropbox

Dropbox is a file hosting and sharing company with over 700 million registered users.

Photos are among the most common types of files uploaded to Dropbox. The Dropbox app allows users to set up camera sync so that any photo they take on their smart phone will automatically get synced and stored in their Dropbox account.

To make it easier for people to find their photos, Dropbox built an image search feature where you can search for objects/scenery/action and Dropbox will find images that contain what you searched for.

For example, if you search for picnic, then Dropbox will find your images that contain a picnic.

Thomas Berg is a Machine Learning Engineer at Dropbox and he wrote a great blog post summarizing how this feature works.

Here's a summary

Dropbox’s Approach

To build this, Dropbox relies on two areas of machine learning

  • Image Classification

  • Word Vectors

Image Classification

An image classifier takes in the pixel values of an image and outputs a list of things that the image contains (where each of these things are categories that the classifier is trained to recognize).

There has been tremendous progress over the last 10 years in image classification with the innovations in deep learning, specifically convolutional neural networks. Model architecture improvements, better training methods, large datasets, faster GPUs and more have resulted in image classifiers that can recognize thousands of different categories with extremely high accuracy.

For Dropbox’s image search, their image classifier is an EfficientNet network trained on the OpenImages dataset. This produces classification scores for ~8500 categories ranging from grapes to telephone to picnic and much more.

However, an issue that comes up with image classification is synonyms. What if a user searches for seashore but the image classifier is trained on the term beach?

To solve this, Dropbox used word vectors.

Word Vectors

Word vectors are an extremely important technique in natural language processing that really took off in 2013 with Word2vec.

The idea is that you have a vector space with hundreds of dimensions (your standard X-Y cartesian coordinate system could be viewed as an example of a 2 dimensional vector space).

Then, you use a neural network to map every word to a vector in the vector space. For each word, the neural network will assign a number for each of the hundreds of different dimensions.

You train the neural network so that words with similar meanings will be assigned vectors that are close to each other in vector space. Here's a great article that dives deeper into Word2vec.

Dropbox used the ConceptNet Numberbatch pre-computed word embeddings. This gave them good results in their testing and the word embeddings also support multiple languages, so they return close vectors for words in different languages with similar meanings. This makes supporting image search in multiple languages much easier.

If there’s a multi-word image search, Dropbox parses the query as an AND of the individual words. They also maintain a list of multi-word terms like beach ball that they can match for.

Production Architecture

When a user submits a search query, it’s obviously not possible to immediately run image classification on all of their images. Users can have tens of thousands of images in their dropbox account, so that solution would be way too slow.

Instead, Dropbox uses an Inverted Index data structure (also used by many Full-Text search engines like Elasticsearch). You can think of an Inverted Index as very similar to the Index section at the back of the textbook where it contains a list of all the words in the book and the corresponding page numbers where each word is mentioned. An Inverted Index contains a mapping from all the unique words/phrases in the text to their locations in the documents.

Dropbox will scan through all the images in a user’s account and run their image classification algorithm to find all the categories (things) that appear in that image. They convert whatever categories are found into the corresponding word embedding vectors.

Then, they create an Inverted Index where for each category, they have a list of images that contain that thing.

When a user searches for a word, Dropbox will first find the word vector for that term.

Then, they'll find the closest word vectors that are categories for the image classifier.

They'll query the inverted index to find the matching images for these categories and then rank the matching images based on how strong the classifier ranked each category in the image.

For more details and some additional optimizations Dropbox made, you can view the full article here.

How did you like this summary?

Your feedback really helps me improve curation for future emails.

Login or Subscribe to participate in polls.

Tech Snippets

  • A Collection of Post Mortems- This is a really cool list of post mortem blog posts where companies talk about why their systems went down. Learning from other people’s mistakes is far cheaper than learning from your own, so it’s great to read some of these to understand all the different things that can go wrong.

  • Hacks for giving engineering estimates - Shubhro Saha is an Engineering Manager at Facebook. He wrote a great blog post on how you can give better estimates on how long a project will take to complete.One great tip is to always clearly separate the estimate vs. the goal vs. the plan. The goal may be the most optimistic while the plan is the most pessimistic (it accounts for any hiccups that may be encountered).

  • Which Programming Paradigm Gives the Most Expressive Code? - Jonathan Boccara is a Principal Engineer at Doctolib (the largest e-health company in Europe). He wrote a very interesting blog post talking about trends in programming paradigms. He discusses the increasing popularity of functional programming and the declarative paradigm trend.

Interview Question

Given a string s, return the longest palindromic substring in s. A palindromic substring is a substring that is a palindrome.

We’ll send a detailed solution in our next email, so make sure you move our emails to primary, so you don’t miss them! Unfortunately, Gmail filters our emails into "Promotions"

Gmail users—move us to your primary inbox

  • On your phone? Hit the 3 dots at the top right corner, click "Move to" then "Primary"

  • On desktop? Back out of this email then drag and drop this email into the "Primary" tab near the top left of your screen

Apple mail users—tap on our email address at the top of this email (next to "From:" on mobile) and click “Add to VIPs”