How Lyft Localizes Millions of Users

Hey Everyone!

Today we’ll be talking about

  • How Lyft used client side Localization to improve Real Time Positioning for drivers and riders

    • GPS signals can be very noisy and unreliable. Therefore, Lyft also uses map data to match a driver/rider to a precise location on the map.

    • Previously, this was done server side, but Lyft shifted to doing this client side.

  • How DoorDash Built a Faster Indexing System for their Search feature

    • DoorDash has to index millions of food items and restaurants in their Search system. They wrote a great blog post on the data pipeline they use to do this.

    • They used Kafka, Flink and Elasticsearch

  • Tech Snippets

    • Rebuilding the world's most popular Spellchecker

    • A detailed guide to CDNs

    • Programming language trends

    • How we ended up with Git

    • Notes on maintaining a React Component library.


Client side localization at Lyft

Lyft is a ride-sharing app that allows riders to connect with drivers (similar to DiDi, Uber, Grab, Ola Cabs, etc.). The company is the second-largest ridesharing app in the US with 29% market share and close to 20 million users.

In order to help the rider and driver connect, the Lyft app shows both users a map with the real time location of the other person.

Lyft relies on GPS data from both the riders and drivers mobile devices for the real-time position, however GPS signals are notoriously noisy and unreliable. Relying on GPS signals alone would mean inaccurate real-time positioning of the rider and driver, resulting in a poor user experience.

The Mapping team at Lyft solves this issue by using map data to more accurately localize the driver/rider. This reduces the space of locations to just the roads and makes it much easier to run map matching algorithms (match the user to the correct position on the map). You can read about the map matching algorithms that Lyft uses here.

Previously, these localization systems were run server-side. Map data was stored on Lyft servers and their map matching algorithms would be run server side.However, in 2021 Lyft made the transition to client side localization.

Karina Goot is an Engineering Manager at Lyft and she wrote a great blog post on this transition.

Here’s a summary

Benefits of Client-side Localization

There are quite a few benefits from shifting localization to the client rather than running it server-side.

  • Network Benefits - Cell network availability is not guaranteed. Putting map data on the client and having localization algorithms running on the user’s phone means better localization when the user is can’t connect to Lyft servers (in a tunnel or in areas with poor connection)

  • Efficiency and Performance Benefits - Moving the high-cost computation from the server to the client will simplify the server-side architecture, reduce hosting costs and also lower server-to-client latency.

  • Driver Safety Features - Running localization client side means that map data has to be on the driver’s phone. Lyft can later use that map data to add in additional features to the UI like symbols for traffic lights, stop signs, speed limits, etc.

The drawback to putting localization client-side is that it is extremely constrained in memory and latency requirements.

Lyft cannot put too much map data on the client or that will cause the Lyft app to take up too much user storage. They also can’t send all the map data through the network as downloading data while on cellular is expensive and slow.

The Lyft Engineering team had to navigate these technical limitations and designed the client localization system around them.

Designing the System

They broke the project down into three main components.

  1. Generating Lightweight Map Data

  2. Client Platform Networking Layer

  3. C++ Core Library for localization

Generating Lightweight Map Data

Lyft has to send map data to the client devices without taking up too much user storage, so they can only include data about the user’s local area. This means changing up the map data format, how it’s generated and how it’s served.

To do this, they used the S2 Geometry library. The S2 library was developed at Google and made open source in 2017. It represents all data on a three-dimensional sphere, which models map data better than traditional geographic information systems that represent data in two dimensions.

The team divided the entire LyftMap into small chunks of S2 Cells. When the client tries to download map data from the server, it specifies the cell id of the S2 cell and the map version. The server will then return the map data serialized as S2 Cell Elements.

The client will download the necessary cells based on the user location and dynamically build the road network graph in memory.

Client Platform Networking Layer

Lyft created a backend service called MapAttributes, that reads the map elements data from DynamoDB based on a geospatial index. The S2 library uses the Quadtree data structure for geospatial indexing. Here’s a great blog post on how S2 does indexing if you’d like to learn more. These map elements are serialized and converted to S2 Cell Elements.

To reduce latency and increase reliability, Lyft also uses AWS CloudFront’s CDN to cache the MapAttributes responses. CloudFront will return the results from the CDN on a cache hit without calling the MapAttributes service.

Core Library Functionality

Once the client fetches the desired map cells from the server, it passes this information through to a C++ localization library on the client.

This library uses Lyft’s Map-Based Models for localization.

Lyft drivers generally operate in a single service area so their locality is highly concentrated. This causes duplicate downloads of the same data, leading to unnecessarily high network data usage.

To solve this, engineers added an in-memory SQLite caching layer directly in C++. They used SQLite because of its simplicity and native support on client platforms.

With this cache, they can store the highest locality map data for each driver directly on-device. By persisting the map data on disk, they can store data across sessions and only have to refresh the cache when the underlying map data changes.

Based on data analysis of driving patterns, the Lyft team found that they can achieve a high cache hit rate for the vast majority of Lyft drivers with only 15 megabytes of on-device data.

Results

In order to track the success of the project, Lyft looked at how often mobile clients have map data and what the latency of the map matching system was.

They found > 99% on-device map data availability among drivers and sub 10 ms latency for 99% of map matching computations.

With this project, drivers and riders now have significantly better map localization in the Lyft app.

You can read more here.

Tech Snippets

  • Rebuilding the world’s most popular Spellchecker - Hunspell is an open source spell checker that is used by Google Chrome, Mozilla Firefox, LibreOffice, macOS, Adobe products and a variety of other tools. You can view the source code for Hunspell here. This is a great series of blog posts by Victor Shepelev where he tries to rebuild Hunspell and goes through some of the assumptions/technical choices that the project makes. This is a great series of blog posts to read if you want to learn more about the complexities behind building a Spellchecker.

  • How we ended up with Git - This is a great article on the history of source control and different types of centralized and distributed source control systems. If you haven’t heard of tools like Perforce, Subversion or Sourcesafe, you might find this article interesting. For many years, the community around the Linux kernel didn’t use a version control system; they would just coordinate using the mailing list. They first started using BitKeeper, a commercial product, until Linus Torvalds bit Git. This is a great article to get some of the history around that.

  • Programming Language Trends - This is a great article on some trends that the author believes (and hopes!) will happen over the 2020s. In terms of Types, he anticipates static-typed languages will continue to gain ground over dynamically-typed languages (at least for large, corporate software). For Concurrency, he predicts a continuation of some of some of the last decade’s trends, with things like Channels, Promises and implicit continuations, Permissions and more. Read the full article for predictions on Memory Management, Tooling and more.

  • Notes on maintaining an internal React component library - Gabe Scholz is a senior software engineer at Digital Ocean where he's responsible for being the primary maintainer of the company's internal React component library, Walrus. He wrote a great document on some principles he recommends when maintaining a component library that's used by a large number of front end applications.

  • A detailed guide to CDNs - Web.dev is a great resource for web developers with a ton of in-depth guides on various topics in building scalable websites. They have a great guide on Content Delivery Networks with information on how to choose a CDN, improving the cache hit ratio and other performance tweaks.

Building Faster Indexing with Apache Kafka and Elasticsearch

DoorDash is the largest food delivery app in the United States with more than 20 million consumers and 450 thousand restaurants.

A critical part of the DoorDash app is the search function. You can search for Scallion Pancakes and the DoorDash app will give you restaurants near you that are open and currently serving that dish.

Solving this problem at scale is quite challenging, as restaurants are constantly changing their menus, store hours, locations, etc.

You need to quickly index all of the store data to provide a great restaurant discovery feature.

Satish, Danial, and Siddharth are software engineers on DoorDash’s Search Platform team, and they wrote a great blog post about how they built a faster indexing system with Apache Kafka, Apache Flink and Elasticsearch.

Here’s a summary

DoorDash’s Problem with Search Indexing

DoorDash’s legacy indexing system was very slow, unreliable and not extensible. It took a long time for changes in store and item descriptions to be reflected in the search index. It was also very difficult to assess the indexing quality.

There were frequent complaints about mismatches in store details between the search index and the source of truth. These had to be fixed manually.

The New System

Engineers solved these problems by building a new search indexing platform with the goals of providing fast and reliable indexing while also improving search performance.

The new platform is built on a data pipeline that uses Apache Kafka as a message queue, Apache Flink for data transformation and Elasticsearch as the search engine.

The components of the architecture are

  • Data sources - These are the sources of truth for the data. When CRUD operations take place on the data (changing store menu, updating store hours, etc.) then they are reflected here. DoorDash uses Postgres as the database and Snowflake as the data warehouse.

  • Data destination - DoorDash is using Elasticsearch here as the final data destination. It will serve as the data store and search engine.

  • Flink application - There are two custom Apache Flink applications in this pipeline: Assembler and ES Sink. Assembler is responsible for assembling all the data required in an Elasticsearch document. ES Sink is responsible for shaping the documents as per the schema and writing the data to the targeted Elasticsearch cluster.

  • Message queue - Kafka 1 and Kafka 2 are the message queue components.

This data pipeline allows for fast, incremental changes to the search index when there are changes to the restaurant data.

The changes in data sources are propagated to Flink applications using Kafka. The Flink apps implement business logic to curate the search documents and then write them to Elasticsearch.

Incremental Indexing

The indexing pipeline processes two main types of data changes.

The first type of data change is when human operators make ad hoc changes to stores or restaurant items. An example of a possible data change is a restaurant owner adding a new dish to her menu.

The second type of data change is ETL data changes that are generated from machine learning models. Things like restaurant ratings/scores or auto-generated tags are generated by machine learning models and then stored in a data warehouse.

Both of these changes need to be reflected in the search index for the best customer experience.

Here’s how DoorDash does it.

Indexing Human Operator Changes

Restaurant owners will frequently update their menus and store information. These changes need to be reflected onto the search experience as quickly as possible.

The updates are saved in data stores like Postgres.

To keep track of these updates, DoorDash search engineers rely on Change Data Capture (CDC) events.

DoorDash engineers implemented save hooks in the application to propagate change events to Kafka whenever there is a change on the underlying data store.

After receiving the Kafka events, the Assembler app will make backend calls to gather more information about the change and to create an event which it pushes to Kafka for the ES Sink app to consume.

They tested other solutions like Debezium connector, a Red Hat-developed open source project for capturing row-level changes with Postgres but they found that this strategy had too much overhead and was not performant.

Indexing ETL data

Many properties that are used in the search index are generated by ML models. Things like restaurant scores, auto-generated tags, etc.

These properties are updated in bulk, once a day. The data gets populated into tabs in DoorDash’s data warehouse after a nightly run of the respective ETL jobs.

The CDC patterns described for Human Operator Changes don’t work here because you don’t constantly have changes/updates through the day. Instead, you have one bulk update that happens once a day.

Using the CDC pattern described above would overwhelm the system when making the bulk update due to the size of the update.

Therefore, DoorDash engineers built a custom Flink source function which spreads out the ETL ingestion over a 24 hour interval so that the systems don’t get overwhelmed.

The Flink source function will periodically stream rows from an ETL table to Kafka in batches, where the batch size is chosen to ensure that the downstream systems do not get overwhelmed.

Sending documents to Elasticsearch

Once the Assembler application publishes data to Kafka, the consumer (ES Sink) will read those messages, transform them according to the specific index schema, and then send them to their appropriate index in Elasticsearch.

ES Sink utilizes Flink Elasticsearch Connector to write JSON documents to Elasticsearch.

It has rate limiting and throttling capabilities out of the box, which are essential for protecting Elasticsearch clusters when the system is under heavy write load.

Results

With the new search indexing platform, updates happen much faster. The time needed to reindex existing stores and items on the platform fell from 1 week to 2 hours.

The reliance on open source tools for the index means a lot of accessible documentation online and engineers with this expertise who can join the DoorDash team in the future.

For information on how DoorDash backfilled the search index (and more!), read the full blog post here.

Interview Question

You are given an array filled with letters and numbers.

Find the longest subarray with an equal number of letters and numbers.

Return the longest subarray.

If there are multiple results, then return the subarray with the lowest starting index.

Previous Question

As a refresher, here’s the previous question

Write a function that adds two numbers.

You cannot use + or any arithmetic operators!

Solution

As you might’ve guessed, we can solve this question with Bit Manipulation.

First, let’s walk through addition in Base 10.

We want to add 234 + 983.

We can add these two numbers by breaking the operation down into “addition” and “carry-over”

For the addition part, we’ll add 234 + 983 but ignore any carry overs.

That will result in 117 as 2 + 9 = 1, 3 + 8 = 1, 4 + 3 = 7.

Now, we’ll do the carry overs.

The carry-over for 234 + 983 is 1100.

Now, we can add the addition part and the carry over part to get the final sum.

117 + 1100 = 1217, which is equal to 234 + 983.

So, for doing this operation with bit manipulation, we can just repeat the same steps.

We can simulate the addition part with the XOR operator. That’s because if we add the two binary numbers together without carrying over, the ith bit in the sum will be 0 only if a and b have the same ith bit (both 0 or both 1).

For the carry over, the ith bit in the sum will be 1 only if a and b have (i - 1) bits that are both 1. This can be simulated with an AND operation and then left shifting by 1.

Then, we have to add the addition part and the carry over part. We can do this by recursively calling our add function.

The base case will be when one of the numbers is 0, then we can just return the other number.