How Instacart Built Their Autocomplete System
Hey Everyone!
Today we’ll be talking about
How Instacart Built Their Autocomplete System
Handling misspellings with Levenshtein Edit Distance and checking for semantic duplication with an embeddings-based model
Solving the Cold Start problem using query expansion
Training a Learning-to-Rank model to rank autocomplete suggestions
A Dive Into Load Balancers (Quastor Pro)
Purpose of Load Balancers and the differences versus an API Gateway
L4 vs. L7 Load Balancers
Load Balancing Strategies (Round Robin, Least Connections, Hashing and more)
Get a discount on Quastor Pro here.
Tech Snippets
How Databricks uses Scala
Spellbook of Modern JavaScript Web Dev
Vertical Scaling can be a lot cheaper than you might think
Plus, we have a solution to our last coding interview question on string matching.
Free Training on Managing Time Series Data
One of the biggest trends in data engineering is the explosive growth in time series data.
Enterprises are storing petabytes of data on things like user actions/events, sensor readings, financial trades and more. Nearly every sector is being changed with the ingestion and analysis of large scale time series data.
With this, engineers are quickly realizing that relational databases and many standard NoSQL offerings lack the ability to efficiently store and access large amounts of time series data.
If you’d like to learn more about how time series data is stored and accessed, InfluxDB is hosting a free, live training on what time series data is, how the problem domain differs from traditional data workloads and how InfluxDB is differentiated from other time series databases.
sponsored
How Instacart Built their Autocomplete System
Instacart is a delivery platform with more than 10 million users. You can use the app/website to order items from local stores and have an instacart driver drop the item off to your home. Groceries, medicine, clothing and more are sold on the platform.
With such a wide range of products sold, having a high quality autocomplete feature is extremely important. Autocomplete can not only save customers time, but also recommend products that a customer didn’t even know he was interested in. This can end up increasing how much the customer spends on the app and benefit Instacart’s revenue.
Esther Vasiete is a senior Machine Learning Engineer at Instacart and she wrote a great blog post diving into how they built their autocomplete feature.
Here’s a summary
When a user enters an input in the Instacart search box, this input is referred to as the prefix. Given a prefix, Instacart wants to generate potential suggestions for what the user is thinking of.
For the prefix “ice c”, instacart might suggest “ice cream”, “ice cream sandwich”, “ice coffee”, etc.
In order to generate search suggestions, Instacart relies on their massive dataset of previous customer searches. Their vocabulary consists of 57,000 words extracted from 11.3 million products and brands. From this, they’ve extracted ~800k distinct autocomplete terms across all retailers.
These terms are loaded into Elasticsearch and Instacart queries it when generating autocomplete suggestions.
Some challenges that Instacart had to deal with around search suggestions were
Handling User Misspellings - A user might accidentally think “Mozzarella” is spelled as “Mozarella”. Therefore, when he types “Mozare”, he won’t see any helpful autocomplete suggestions.
Semantic Deduplication - “Mac and Cheese” and “Macaroni and Cheese” are two phrases for the same thing. If a user types in “Mac” in the search box, recommending both phrases would be a confusing experience. Instacart needs to make sure each phrase recommended in autocomplete refers to a distinct item.
Cold Start Problem - When a customer searches for an item, she searches for it at a specific retailer. She might search for “bananas” at her local grocery store or “basketball” at her local sporting goods store. Therefore, Instacart autocomplete will give suggestions based on the specific retailer. This presents a cold start problem when a new retailer joins the platform, since Instacart doesn’t have any past data to generate autocomplete suggestions off.
We’ll talk about how they solved each of these…
Handling Misspellings
If a customer accidentally thinks "Mozzarella" is spelled as "Mozarella", Instacart’s autocomplete should still suggest the correct term when the user types in "Mozar" into the search box.
To accomplish this, Instacart relies on fuzzy matching with Elasticsearch’s fuzziness parameter. That parameter works by using the Levenshtein Edit Distance algorithm under the hood, where you can specify the maximum edit distance you want to allow matches for.
Semantic Deduplication
Many times, the same item will be called multiple things. Macaroni and cheese can also be called Mac and cheese.
The autocomplete suggestion should not display both "Macaroni and Cheese" and "Mac and Cheese" when the user types in “Mac”. This would be confusing to the user.
Therefore, Instacart trained an embeddings-based model that learned the relationship between queries and products. They can use this model to generate a similarity score between two queries (by taking the dot product).
Terms like Mayo and Mayonnaise will give a very high semantic similarity score. The same applies for cheese slices and sliced cheese.
Instacart published a paper on how they trained this model, which you can read here.
Cold Start Problem
Instart’s autocomplete system will give completely different autocomplete suggestions based on the retailer. A grocery store will not give the same autocomplete suggestions as a sporting goods store.
Therefore, if a new retailer joins the Instacart platform, the app will have to generate autocomplete suggestions for them. This creates a cold start problem for distinct/small retailers, where Instacart doesn’t have a good dataset.
Instacart solves this by using a neural generative language model that can look at the retailer’s product catalog (list of all the items the retailer is selling) and extract potential autocomplete suggestions from that.
This is part of the Query Expansion area of Information Retrieval systems. Instacart uses doc2query to handle this.
Ranking Autocomplete Suggestions
In the paragraphs above, we talked about how Instacart generates possible autocomplete suggestions for a user’s prefix.
However, these also need to be ranked so that the most relevant suggestion is at the top. To do this, they trained a Machine-Learned Ranking model.
This model predicts the probability that an autocomplete suggestion will result in the user adding something to their online shopping cart. The higher the probability of an add-to-cart, the higher that autocomplete suggestion would be ranked.
The model can be broken down into two parts: an autocomplete conversion model and an add to cart model.
The autocomplete conversion model takes in the user’s prefix and then predicts the probability that the user will select a certain autocomplete suggestion. It’s a binary classification task where some examples of features are
Is the suggestion a fuzzy match (slightly different spelling)
How popular is this suggestion
What rate is this suggestion clicked on
The add to cart model calculates the conditional probability where given the user has clicked on a certain autocomplete suggestion, what is the probability that he adds one of the resulting products to his cart? This is another binary classification task where some of the features are
The add to cart rate for that autocomplete suggestion
The rate at which that autocomplete suggestion returns zero or a low number of results
These two models are then combined to generate the probability that a certain autocomplete suggestion will result in the user adding something to her cart.
The combined model is used to rank autocomplete suggestion terms. It resulted in a 2.7% increase in autocomplete engagement and an increase in gross transaction value per user.
For more details, you can read the full blog post here.
Tech Dive - Load Balancers (Quastor Pro)
In this tech dive, we delve into load balancers and talk about the purpose they serve.
We go through layer 4 and layer 7 load balancers and talk about the pros/cons and differences between each.
We also talk about load balancing strategies like Round Robin, Least Connections, Hashing, Consistent Hashing and more. We talk about the strengths and weaknesses of each.
This is part of Quastor Pro, a section with explainers on the backend concepts that we talk about in our Quastor summaries.
You can get a discount of 25% off below.
I'd highly recommend using your Learning & Development budget for Quastor Pro.
Tech Snippets
Scala at Scale at Databricks - Databricks is a software company that makes data engineering easier. It was founded by the creators of Apache Spark. Databricks is one of the largest Scala shops around, with millions of lines of Scala being used in production. This post goes into why they picked Scala, the tooling they use, issues they’ve come across and more.
Spellbook of Modern JavaScript Web Dev - This is an awesome GitHub repo that aggregates a ton of resources on web dev. It has resources on JavaScript and Node, Web Assembly, Web APIs, CSS and PostCSS, Web Design and some resources on Server Side programming.The resources are structured quite well, so it’s a good reference if you’re a web developer.
Use One Big Server - We’ve done a lot of articles on “how company XYZ broke up their monolith”, but a microservices architecture is usually way too complicated/unnecessary for the vast majority of teams. Nima Badizadegan was a software engineer at Google (working on Colossus, their internal distributed file system) and he wrote a great blog post on how vertical scaling can actually be a lot cheaper than you might think. He talks about the most powerful options on various hosting providers, how much they cost and goes through some benchmarks.
Previous Question
As a reminder, here’s our last question
You are given a list of strings called words and a string pattern.
Return a list of the strings inside words that match the pattern.
A word matches the pattern if the letters in the word can be mapped one-to-one to characters in the pattern.
Solution
This question is a bit of an extension on another common interview question, which is to find if two strings are isomorphic.
An isomorphism is a one-to-one mapping (bijection) between two graphs.
In this question, our “graphs” are represented by character strings.
So, we need to find out if a one-to-one mapping can be formed between two strings.
For an example, let’s look at the strings “abb” and “mee”.
We can form a one-to-one mapping with a maps to m and b maps to e.
For “apple” and “zaapy”, we can form a one-to-one mapping with a maps to z, p maps to a, l maps to p and e maps to y.
Each character in the first string maps to only one character in the second string and vice-versa.
How can we calculate if two strings are isomorphic in O(n) time?
One way is to just iterate through both strings and try to create a mapping from the first string to the second and from the second string to the first.
If creating the mapping fails at any point, then you return False.
Here’s the Python 3 code implementing this.
Now, to answer the original question, we can iterate through the string of words and check each word to see if it’s isomorphic with our pattern.
If it is, we add it to an isomorphicWords array and then return that array after the loop terminates.
Here’s the Python 3 code.