How McDonalds implements Event Driven Architectures

McDonalds created an Eventing Platform that internal teams could use for EDAs. We'll talk about the architecture of this platform.

December 06, 2022

Hey Everyone!

Today we’ll be talking about

Event Driven Architectures at McDonalds
- McDonalds relies on EDAs for things like sending marketing promotions or giving mobile order tracking progress
- They built the Eventing Platform to give internal teams an easy way to spin up producers and consumers for events
- The Eventing Platform is built on AWS MSK and DynamoDB
How Mixpanel Fixed Their Load Balancing Problem
- An overview of Mixpanel's in-house database
- How an unequal workload was giving Mixpanel load balancing problems
- Solving this using the Power of 2 Choices load balancing technique
Tech Snippets
- Distributed Systems for Fun and Profit
- Testing Infrastructure at Uber
- The Joy of Cryptography
- How Netflix uses Neural Networks to Improve Video Quality

Plus, we have a solution to our last coding interview question on linked lists.

Event Driven Architectures at McDonalds

Over the last few years, McDonalds has been investing heavily in building and promoting their mobile app; which customers can use to place orders, get coupons, view nutrition information and more.

As a result, McDonalds has the most downloaded restaurant app in the United States, with over 24 million downloads in 2021. Starbucks came in a close second with 12 million downloads last year.

For their backend, McDonalds relies on event driven architectures (EDAs) for many of their services. For example, sending marketing communications (a deal or promotion) or giving mobile-order progress tracking is done by backend services with an EDA.

A problem McDonalds was facing was the lack of a standardized approach for building event driven systems. They saw a variety of technologies and patterns used by their various teams, leading to inconsistency and unnecessary complexity.

To solve this, engineers built a unified eventing platform that the different teams at the company could adopt when building their own event based systems. This reduced the implementation and operational complexity of spinning up and maintaining these services while also ensuring consistency.

Vamshi Krishna Komuravalli is a Principal Architect at McDonalds and he wrote a great blog post on the architecture of McDonald’s unified eventing platform.

Here’s a summary

McDonalds wanted to build a unified eventing platform that could manage the communication between different producers and consumer services at the company.

Their goals were that the platform be scalable, highly available, performant, secure and reliable. They also wanted consistency around how things like error handling, monitoring and recovery were done.

Here’s the architecture of the unified eventing platform they built to handle this.

Event Broker - The event broker is the core of the platform and manages communication between the producers and consumers. McDonalds uses AWS so they used AWS Managed Streaming for Kafka (MSK) as the event broker.
Schema Registry - Engineers wanted to enforce a uniform schema for events to ensure data quality for downstream consumers. The Schema Registry contains all the event schemas and producers and consumers will check the registry to validate events when publishing/consuming.
Standby Event Store - If Kafka is unavailable for whatever reason, the producers will send their events to the Standby Event Store (DynamoDB). An AWS Lambda function will handle retries and try to publish events in the Standby Event Store to Kafka.
Custom SDKs - Engineers who want to use the Eventing Platform can do so with an SDK. The eventing platform team built language-specific libraries for both producers and consumers to write and read events. Performing schema validation, retries and error handling is abstracted away with the SDK.

Here’s the workflow for an event going through McDonald’s eventing platform

Engineers define the event's schema and register it in the Schema Registry
Applications that need to produce events use the producer SDK to publish events
The SDK performs schema validation to ensure the event follows the schema
If the validation checks out, the SDK publishes the event to the primary topic
If the SDK encounters an error with publishing to the primary topic, then it routes the event to the dead-letter topic for that producer. An admin utility allows engineers to fix any errors with those events.
If the SDK encounters an error with Kafka, then the event is written to the DynamoDB database. A Lambda function will later retry sending these events to Kafka.
Consumers consume events from Kafka

Schema Registry

In order to ensure data integrity, McDonalds relies on a Schema Registry to enforce data contracts on all the events published and consumed from the eventing platform.

When the producer publishes a message, the message has a byte at the beginning that contains versioning information.

Later, when the messages are consumed, the consumer can use the byte to determine which schema the message is supposed to follow.

Domain-based Sharding and Autoscaling

Events are sharded into different MSK clusters based on their domain. This makes autoscaling easier as you can just scale the specific cluster for whatever domain is under heavy load.

To deal with an increase in read/write load, the eventing platform team added logic to watch the CPU metrics for MSK and trigger an autoscaler function to add additional broker machines to the cluster if load gets too high. For increasing disk space, MSK has built-in functionality for autoscaling storage.

For more details, you can read the full blog posts here and here.

How did you like this summary?

Your feedback really helps me improve curation for future emails.

Tech Snippets

Distributed Systems for Fun and Profit - This is a great series of in-depth blog posts that give an introduction to working with distributed systems. Topics covered include a discussion around time (vector clocks, global vs. local vs. no-clock assumptions), replication (synchronous, asynchronous, partition tolerance), CRDTs and much more!

Testing Infrastructure at Uber - This is an interesting blog post by Uber Engineering that delves into SLATE - Short-Lived Application Testing Environments. This is a system that lets developers easily create testing environments and deploy services in them. These services will use production instances of dependencies by default and are very easy to setup/teardown/configure.

The Joy of Cryptography - A fantastic (free) textbook that gives an undergraduate-level introduction to cryptography. The book goes through generating pseudo-random numbers, cryptographic hash functions, RSA and more.

How Netflix uses Neural Networks to Improve Video Quality - When preprocessing video for different devices, Netflix has to downscale the content. A 4k source video will be downscaled to 1080p, 720p, 540p, and more. To do this, Netflix previously used conventional resampling filters, like Lanczos. After much research, they've replaced this with neural networks. They published an interesting blog post delving into how they do this and the results they've gotten.

How Mixpanel Fixed their Load Balancing Problem

Mixpanel is an analytics product that you can embed into your website/app to get detailed data on how your users are behaving (similar to Google Analytics).

In order to best serve their users, Mixpanel needs to support real-time event ingestion while also supporting fast analytical queries over all a user’s history.

When a user on a Mixpanel-tracked website clicks a button or navigates to a new page, that event needs to be ingested and stored in Mixpanel in under a minute (real-time event ingestion).

If a Mixpanel customer wants to see the number of sign up conversion events over the past 6 months, they should be able to query that data quickly (fast analytical queries).

Mixpanel accomplishes this with their in-house database, Arb. They leverage both row-oriented and column-oriented data formats where row-oriented works better for real-time event ingestion and column-oriented works well for analytical queries. This is based on the classic Lambda Architecture where you have a speed layer for real-time views and a batch layer for historical data.

If you're interested in learning more about Mixpanel's system architecture, you can read about it here.

In order to convert data from row format to a columnar format, Mixpanel has a service called Compacter.

Vijay Jayaram was the Principal Tech Lead Manager of the Performance team at Mixpanel, and he wrote a great blog post on technical challenges the company faced when scaling the Compacter service and how they overcame them.

Here’s a Summary

End-user devices with a Mixpanel tracker send event data to Mixpanel’s API and these events are pushed onto queues.

This data gets pushed onto Mixpanel’s storage system, where storage nodes will write the events to disk using a row-oriented format.

Then, the Compacter service will convert the data from row format to columnar format, making it faster to query.

Given the nature of the work, the Compacter service is very computationally expensive. It runs in an autoscaling nodepool on Google Kubernetes Engine.

When a storage node has a row file of a certain size/age, it will send a request to a randomly selected compacter node to convert it. The compacter node will then return a handle to the resulting columnar file.

If a compacter node has too many requests, then it’ll load shed and return an error. The storage node will retry after a backoff period.

A Skew Problem

Mixpanel engineers were having a great deal of trouble scaling the compacter in time to absorb the spikes in load. The compacter service failed to autoscale and this resulted in a spike in errors (as storage node requests were getting shedded and the retries were also getting shedded).

Engineers would have to manually set the autoscaler’s minimum number of nodes to a higher number to deal with the load. This resulted in a waste of engineer time and also inefficient provisioning.

When Mixpanel looked at the average utilization of nodes in the compacter service, they expected it to be at 80-90%. This would mean that the compute provisioned in the service was being used efficiently.

However, they found that average CPU utilization was ~40%. They checked the median utilization and the 90th percentile utilization to find that while median utilization was low, the 90th percentile utilization was near 80%.

This meant that half the compacter nodes provisioned were doing little work, while the top 10% of nodes were maxed out.

This was why the autoscaling was messed up, because the autoscaling algorithm was using the average utilization to make its scaling decisions.

Cause for Skew

Engineers were confused about why there was a skew since the storage nodes were randomly selecting compacter nodes based on a uniform random distribution (Randomized Static load balancing). Each compacter node was equally likely to be selected for a row-to-column conversion job.

However, because the individual jobs had a very uneven distribution in terms of computational load, this caused a large work skew between the compacter nodes. Mixpanel has a vast range of customers, from startups with thousands of events per day to large companies with billions of events per day.

This meant that the individual jobs were distributed based on a power law, where the largest jobs were significantly larger than the smallest jobs. Some compacter nodes were getting significantly more time-consuming jobs than other nodes and this is what caused the work skew between the nodes.

Having unequal load will also present problems for many other load balancing algorithms as well, like Round Robin.

The Power of 2-Choices

Mixpanel considered several solutions to solve this including inserting a queue between the storage nodes and compacters or inserting a more complex load balancing service. You can check out the full post to read about these options.

They went with a far simpler solution. They used a popular strategy called The Power of 2-Choices, which uses randomized load balancing.

Instead of the storage nodes randomly picking 1 compacter, they randomly pick 2 compacter nodes. Then, they ask each node for its current load and send the request to the less loaded of the two.

There’s been quite a few papers on this strategy, and it’s been found to drastically reduce the maximum load over having just one choice. It’s used quite frequently with load balancers like Nginx. Mixpanel wrote some quick Python simulations to confirm their intuition about how The Power of 2-Choices worked.

Implementing this into their system was extremely easy and it ended up massively closing the gap between the median utilization and the 90th percentile utilization.

Average utilization increased to 90% and the error rate dropped to nearly 0 in steady state since the compacters rarely had to shed load.

For more details, you can read the full summary here.

Interview Question

You are given a doubly-linked list where each node also has a child pointer.

The child pointer may or may not point to a separate doubly linked list.

These child lists may have one or more children of their own, and so on, to produce a multilevel data structure.

Flatten the list so that all the nodes appear in a single-level, doubly linked list.

Here’s the question in LeetCode.

Previous Question

As a reminder, here’s our last question

You are given an array of k linked lists.

Each list is sorted in ascending order.

Merge all the linked lists into one sorted linked list and return it.

Here’s the question in LeetCode.

Solution

We'll start with k linked lists and we'll merge every pair of linked lists (using the combineLists function).

This leaves us with k/2 remaining linked lists.

We can then repeat this merging process on the remaining lists, where we combine every pair to result in k/4 linked lists, k/8 linked lists, and so on.

We continue this process until we reach our final linked list.

If we have an odd number of linked lists to combine, then we just ignore the last linked lists and add it to the merged list to be combined in the next combination step.

The number of nodes we will be traversing for each combination step scales with k*N, where N is the average number of nodes in our linked lists and k is the number of linked lists. So k*N is the total number of nodes across all lists.

We'll have to repeat the merge process log(k) times, so this means our time complexity will be O(k*N*log(k)).