How DoorDash labels Millions of Items with Large Language Models

How DoorDash uses GPT-4, RAG and LLM agents for labeling data. Plus, why you don't always need indexes, bloom filters explained visually and more.

Hey Everyone!

Today we’ll be talking about

  • How DoorDash uses LLMs for labeling their products

    • Issues DoorDash faces with Data Labeling

    • Solving the “Cold Start problem” with LLMs

    • Labeling Organic Products with LLM agents

    • Solving Entity Resolution with Retrieval Augmented Generation

  • Tech Snippets

    • Become a better communicator by focusing on the “kernel“

    • You don’t always need indexes

    • Why working quickly is more important than it seems

    • Bloom filters explained visually

Taipy is an open-source Python library for building scalable, production-ready data and AI web apps without having to mess around with JavaScript and HTML/CSS.

You can build complex charts, interactive dashboards, chatbots, RAGs and much more…. All in a single language…Python! Their UI builder also is easily integrated with VSCode or Jupyter.

With their new Enterprise Edition, you can now put your applications into production with security, simplicity, and efficiency:

  • Ease of Use: With a drag-and-drop UI Builder…Taipy Designer. You can create user-friendly web interfaces without a single line of code.

  • Performance & Scalability - Handle large-scale executions with hundreds of tasks, seamlessly deploying them across CPU/GPU clusters, using distributed computing. Monitor your application’s health with Telemetry. Schedule your jobs to run at the time of your choice.

  • User Management Features - Handle user authentication with SSO features and also profile management with authorization so you can set fine-grained privileges across your users.

Taipy is an extremely fast way to ship Data & AI  web applications with only Python and it’s used at companies like Microsoft, the World Bank, and many others.

sponsored

How DoorDash uses LLMs for Data Labeling

Currently, Large Language Models have a ton of “FOMO” (fear of missing out) around them, so it can be tempting to dismiss them as another hype-train that’ll fizzle out.

There are definitely some aspects that are over-hyped, but it’s also important to note that LLMs have fundamentally changed how we’re building ML models and they’re now being used in NLP tasks far beyond ChatGPT.

DoorDash is the largest food delivery service in the US and they let you order items from restaurants, convenience stores, grocery stores and more.

Their engineering team published a fantastic blog post about how they use GPT-4 to generate labels and attributes for all the different items they have to list.

We’ll talk about the architecture of their system and how they use OpenAI embeddings, GPT-4, LLM agents, retrieval augmented generation and more to power their data labeling. (we’ll define all these terms)

We’ll cover a ton of concepts on LLMs in this article.

If you’d like Spaced Repetition Flashcards (Anki) on all the concepts discussed in Quastor, check out Quastor Pro.

When you join, you’ll also get an up-to-date PDF with our past articles.

Data Labeling at DoorDash

As we mentioned earlier, DoorDash doesn’t just deliver food from restaurants. They also deliver groceries, medical items, beauty products, alcohol and much more.

With each of these items, the app needs to track specific attributes in order to properly identify the product.

For example, a can of Coke will have attributes like

Size: 12 fluid ounces
Flavor: Cherry
Type: Diet

On the other hand, a bottle of shampoo will have attributes like

Brand: Dove
Type: Shampoo
Keyword: Anti-Dandruff
Size: 500 ml

Every product has different attribute/value pairs based on what the item is. DoorDash needs to generate and maintain these for millions of items across their app.

To do this, they need an intelligent, robust ML system that can handle creation and maintenance of these attribute/value pairs based on a product’s name and description.

We’ll talk about 3 specific problems DoorDash faced when building this system and how LLMs have helped them address the issues.

  1. Cold Start Problem

  2. Labeling Organic Products

  3. Entity Resolution

Solving the Cold Start Problem with LLMs

One big issue DoorDash faced with building this attribute/value creation system was the cold start problem (a classic issue with ML systems).

This happens when DoorDash onboards a new merchant and there are a bunch of new items that they’ve never seen before.

For example, what would DoorDash do if Costco joined the platform?

Costco sells a bunch of their own products (under the Kirkland brand), so a traditional NLP system wouldn’t recognize any of the Kirkland branded items (it wasn’t in the training set).

However, Large Language Models are already trained on vast amounts of data. GPT-4 has knowledge of Costco’s products and even understands memes about Costco’s Rotisserie chicken.

With this base knowledge, LLMs can perform extremely well without requiring labeled examples (zero shot prompting) or requiring just a few (few shot prompting).

Here’s the process DoorDash uses for dealing with the Cold Start problem:

  1. Traditional Techniques - The product name and description is passed to DoorDash’s in-house classifier. This is built with traditional NLP techniques for Named Entity Recognition.

  2. Use LLM for Brand Recognition - Items that cannot be tagged confidently are passed to an LLM. The LLM will take in the item’s name and description and is tasked with figuring out the brand of the item. For the shampoo bottle example, the LLM would return Dove.

  3. RAG to find Overlapping Products - DoorDash takes the brand name and product name/description and then queries an internal knowledge graph to find similar items. The brand name, product name/description and the results of this knowledge graph query are all taken and then given to an LLM (retrieval augmented generation). The LLM’s task is to see if the product being analyzed is a duplicate of any other products found in the internal knowledge graph.

  4. Adding to the Knowledge Base - If the LLM determines that the product is unique then it enters the DoorDash knowledge graph. Their in-house classifier (from step 1) is re-trained with the new product attribute/value pairs.

Organic Product Labeling with LLM agents

Another issue that DoorDash needs to solve is properly labeling organic products. One of their goals was to create a “Fresh & Organic” section in the app for customers who prefer those types of products.

Here are the steps in how DoorDash figures out if a product is organic

  1. String Matching - Look for the keyword “organic” in the product name/description. However, the product names/descriptions aren’t perfect and organic could be misspelled or it could go under a different name (“natural”, “non-GMO”, “hormone-free”, “unprocessed”, etc.). This is where LLMs come into play.

  2. LLM Reasoning - DoorDash will use LLMs to read the available product information and determine whether it could be organic. This has massively improved coverage and addressed the challenges faced with only doing string matching.

  3. LLM Agent - LLMs will also conduct online searches of product information and send the search results to another LLM for reasoning. This process of having LLMs use external tools (web search) and make internal decisions is called “agent-powered LLMs”. I’d highly recommend checking out LlamaIndex to learn more about this.

Solving the Entity Resolution Problem with Retrieval Augmented Generation

Entity Resolution is where you take the product name/description of two items and figure out whether they’re referring to the same thing.

For example, does “Corona Extra Mexican Lager (12 oz x 12 ct)” refer to the same product as “Corona Extra Mexican Lager Beer Bottles, 12 pk, 12 fl oz”?

In order to accomplish this, DoorDash uses LLMs and Retrieval Augmented Generation (RAG).

RAG is a commonly used way to use language models like GPT-4 on your own data.

With RAG, you first take your input prompt and use that to query an external data source (a popular choice is a vector database) for relevant context/documents. You take the relevant context/documents and add that to your input prompt and feed that to the LLM.

Adding this context from your own dataset helps personalize the LLM’s results to your own use case.

Here’s how DoorDash does this for Entity Resolution.

They’ll take a product name/description and run it through this process:

  1. Generate Embeddings Vector - a common way to compare strings is to take the string and use an Embeddings model to turn that string into a vector (a collection of numbers). This vector encodes meaning and knowledge about the original text so strings like “queen” and “beyonce” will map to vectors that are “similar” in certain dimensions. 3Blue1Brown has an amazing video delving into word embeddings in a visual way.
    DoorDash uses OpenAI Embeddings to do this with the product’s name.

  2. Query the Vector Database - Once they generate a vector from the product name, they’ll query a vector database that stores the embedding vectors from all the other product names in DoorDash’s app. Then, they use approximate nearest neighbors to retrieve the most similar products.

  3. Pass Augmented Prompt to GPT-4 - They take the most similar product names and then feed that to GPT-4. GPT-4 is instructed to read the product names and figure out if they’re referring to the same underlying product.

With this approach, DoorDash has been able to generate annotations in less than ten percent of the time it previously took them.

Are you a backend developer who wants to build interactive, sophisticated web apps?

Taipy Designer is a tool that helps you build dashboards, data visualization widgets, internal tooling and more with an intuitive drag and drop approach.

With Taipy Designer, you can:

  • Drag and drop widgets onto the page: Instead of messing around with React or Vue, you can just drag and drop pre-built widgets from the Taipy Designer menu onto the page.

  • Connect the widgets to Python variables: Connect widgets to Python variables seamlessly to build complex data visualization graphics and internal tooling. 

  • Create Highly Customizable UIs: The designer offers a ton of flexibility in crafting the aesthetics so you can build a design that’s sophisticated, beautiful and unique to your product.

  • Have Multi-User Interactivity: As you create your applications you can easily share them with members of your team for seamless collaboration.

sponsored

Tech Snippets