How DoorDash labels Millions of Items with Large Language Models

How DoorDash uses GPT-4, RAG and LLM agents for labeling data. Plus, why you don't always need indexes, bloom filters explained visually and more.

May 03, 2024

Hey Everyone!

Today we’ll be talking about

How DoorDash uses LLMs for labeling their products
- Issues DoorDash faces with Data Labeling
- Solving the “Cold Start problem” with LLMs
- Labeling Organic Products with LLM agents
- Solving Entity Resolution with Retrieval Augmented Generation
Tech Snippets
- Become a better communicator by focusing on the “kernel“
- You don’t always need indexes
- Why working quickly is more important than it seems
- Bloom filters explained visually

How DoorDash uses LLMs for Data Labeling

Currently, Large Language Models have a ton of “FOMO” (fear of missing out) around them, so it can be tempting to dismiss them as another hype-train that’ll fizzle out.

There are definitely some aspects that are over-hyped, but it’s also important to note that LLMs have fundamentally changed how we’re building ML models and they’re now being used in NLP tasks far beyond ChatGPT.

DoorDash is the largest food delivery service in the US and they let you order items from restaurants, convenience stores, grocery stores and more.

Their engineering team published a fantastic blog post about how they use GPT-4 to generate labels and attributes for all the different items they have to list.

We’ll talk about the architecture of their system and how they use OpenAI embeddings, GPT-4, LLM agents, retrieval augmented generation and more to power their data labeling. (we’ll define all these terms)

We’ll cover a ton of concepts on LLMs in this article.

If you’d like Spaced Repetition Flashcards (Anki) on all the concepts discussed in Quastor, check out Quastor Pro.

When you join, you’ll also get an up-to-date PDF with our past articles.

Data Labeling at DoorDash

As we mentioned earlier, DoorDash doesn’t just deliver food from restaurants. They also deliver groceries, medical items, beauty products, alcohol and much more.

With each of these items, the app needs to track specific attributes in order to properly identify the product.

For example, a can of Coke will have attributes like

Size: 12 fluid ounces
Flavor: Cherry
Type: Diet

On the other hand, a bottle of shampoo will have attributes like

Brand: Dove
Type: Shampoo
Keyword: Anti-Dandruff
Size: 500 ml

Every product has different attribute/value pairs based on what the item is. DoorDash needs to generate and maintain these for millions of items across their app.

To do this, they need an intelligent, robust ML system that can handle creation and maintenance of these attribute/value pairs based on a product’s name and description.

We’ll talk about 3 specific problems DoorDash faced when building this system and how LLMs have helped them address the issues.

Cold Start Problem
Labeling Organic Products
Entity Resolution

Solving the Cold Start Problem with LLMs

One big issue DoorDash faced with building this attribute/value creation system was the cold start problem (a classic issue with ML systems).

This happens when DoorDash onboards a new merchant and there are a bunch of new items that they’ve never seen before.

For example, what would DoorDash do if Costco joined the platform?

Costco sells a bunch of their own products (under the Kirkland brand), so a traditional NLP system wouldn’t recognize any of the Kirkland branded items (it wasn’t in the training set).

However, Large Language Models are already trained on vast amounts of data. GPT-4 has knowledge of Costco’s products and even understands memes about Costco’s Rotisserie chicken.

With this base knowledge, LLMs can perform extremely well without requiring labeled examples (zero shot prompting) or requiring just a few (few shot prompting).

Here’s the process DoorDash uses for dealing with the Cold Start problem:

Traditional Techniques - The product name and description is passed to DoorDash’s in-house classifier. This is built with traditional NLP techniques for Named Entity Recognition.
Use LLM for Brand Recognition - Items that cannot be tagged confidently are passed to an LLM. The LLM will take in the item’s name and description and is tasked with figuring out the brand of the item. For the shampoo bottle example, the LLM would return Dove.
RAG to find Overlapping Products - DoorDash takes the brand name and product name/description and then queries an internal knowledge graph to find similar items. The brand name, product name/description and the results of this knowledge graph query are all taken and then given to an LLM (retrieval augmented generation). The LLM’s task is to see if the product being analyzed is a duplicate of any other products found in the internal knowledge graph.
Adding to the Knowledge Base - If the LLM determines that the product is unique then it enters the DoorDash knowledge graph. Their in-house classifier (from step 1) is re-trained with the new product attribute/value pairs.

Organic Product Labeling with LLM agents

Another issue that DoorDash needs to solve is properly labeling organic products. One of their goals was to create a “Fresh & Organic” section in the app for customers who prefer those types of products.

Here are the steps in how DoorDash figures out if a product is organic

String Matching - Look for the keyword “organic” in the product name/description. However, the product names/descriptions aren’t perfect and organic could be misspelled or it could go under a different name (“natural”, “non-GMO”, “hormone-free”, “unprocessed”, etc.). This is where LLMs come into play.
LLM Reasoning - DoorDash will use LLMs to read the available product information and determine whether it could be organic. This has massively improved coverage and addressed the challenges faced with only doing string matching.
LLM Agent - LLMs will also conduct online searches of product information and send the search results to another LLM for reasoning. This process of having LLMs use external tools (web search) and make internal decisions is called “agent-powered LLMs”. I’d highly recommend checking out LlamaIndex to learn more about this.

Solving the Entity Resolution Problem with Retrieval Augmented Generation

Entity Resolution is where you take the product name/description of two items and figure out whether they’re referring to the same thing.

For example, does “Corona Extra Mexican Lager (12 oz x 12 ct)” refer to the same product as “Corona Extra Mexican Lager Beer Bottles, 12 pk, 12 fl oz”?

In order to accomplish this, DoorDash uses LLMs and Retrieval Augmented Generation (RAG).

RAG is a commonly used way to use language models like GPT-4 on your own data.

With RAG, you first take your input prompt and use that to query an external data source (a popular choice is a vector database) for relevant context/documents. You take the relevant context/documents and add that to your input prompt and feed that to the LLM.

Adding this context from your own dataset helps personalize the LLM’s results to your own use case.

Here’s how DoorDash does this for Entity Resolution.

They’ll take a product name/description and run it through this process:

Generate Embeddings Vector - a common way to compare strings is to take the string and use an Embeddings model to turn that string into a vector (a collection of numbers). This vector encodes meaning and knowledge about the original text so strings like “queen” and “beyonce” will map to vectors that are “similar” in certain dimensions. 3Blue1Brown has an amazing video delving into word embeddings in a visual way.
DoorDash uses OpenAI Embeddings to do this with the product’s name.
Query the Vector Database - Once they generate a vector from the product name, they’ll query a vector database that stores the embedding vectors from all the other product names in DoorDash’s app. Then, they use approximate nearest neighbors to retrieve the most similar products.
Pass Augmented Prompt to GPT-4 - They take the most similar product names and then feed that to GPT-4. GPT-4 is instructed to read the product names and figure out if they’re referring to the same underlying product.

With this approach, DoorDash has been able to generate annotations in less than ten percent of the time it previously took them.

Tech Snippets

Bloom Filters

Bloom Filters are a probabilistic data structure with an API that’s similar to a Set. They allow you to add items and check for their presence.

They’re widely used in the industry and help you check for the presence of an item when you have a very large number of items (they’re super space-efficient)

This is a fantastic blog post that delves into bloom filters and explains them in a visual way.

samwho.dev/bloom-filters

Become a better communicator by focusing on the “Kernel”

This is a fantastic blog post by Will Larson on how to improve your communication by “extracting the kernel”.

You should avoid communicating too literally and instead try to understand the underlying message.

If an executive suggests using ChatGPT instead of building a custom model, then getting into a technology choice debate might be the wrong move. Instead, the executive’s real concern might be that the timeline is too slow.

You should focus on understanding the insight/perspective within a question and focus on answering that.

lethain.com/extract-the-kernel

The Importance of Working Quickly

This is a terrific article that talks about why working quickly can be more important than it seems.

When you’re able to get things done efficiently, then the perceived “cost” of starting new tasks decreases and it encourages you to take on more challenges.

The speed of your response also affects the behavior of others. If you respond to emails promptly, that encourages more communication. If you have a fast website, then users are more likely to engage.

Being very slow can have the opposite effect where it makes things seem overly-daunting and discourages you from starting.

jsomers.net/blog/speed-matters

You Don't Always Need Indexes

When you have a large dataset, indexing is a common approach to support quick searches on the data. However, there are cases where full scans can be a better engineering choice.

This is an interesting blog post that delves into some examples where full scans are preferable.

When trying to search, a good approach is to start with simple scans and then add indexing if acceptable performance can’t be achieved.