How Netflix Syncs Hundreds of Millions of Devices

The architecture of Netflix' Rapid Event Notification System. Plus GraphQL at LinkedIn and how Large Language Models are thinking.

Arpan KG
December 11, 2022

Hey Everyone!

Today we’ll be talking about

How Netflix keeps hundreds of millions of user devices synced
- Users will typically use Netflix with more than one device. They expect their watch history, maturity ratings, membership plan, profile changes, etc. to be synced between all their devices.
- Netflix built their Rapid Event Notification System to manage this task. It handles 150,000 events per second and is built on AWS SQS and Cassandra.
- We'll talk about some of the design decisions behind the system and the architecture.
Tech Snippets
- Building a BitTorrent client from the ground up in Go
- How Large Language Models are "Thinking"
- GraphQL at LinkedIn
- How to write better Git Commit Messages

Plus, we have a solution to our last coding interview question on string matching.

Netflix's Rapid Event Notification System

Netflix is an online video streaming service that operates at insane scale. They have more than 220 million active users and account for more of the world's downstream internet traffic than YouTube (in 2018, Netflix accounted for ~15% of the world’s downstream traffic).

These 220 million active users are accessing their account from multiple devices, so Netflix engineers have to make sure that all the different clients that a user logs in from are synced.

You might start watching Breaking Bad on your iPhone and then switch over to your laptop. After you switch to your laptop, you expect Netflix to continue playback of the show exactly where you left off on your iPhone.

Syncing between all these devices for all of their users requires an immense amount of communication between Netflix’s backend and all the various clients (iOS, Android, smart TV, web browser, Roku, etc.). At peak, it can be about 150,000 events per second.

To handle this, Netflix built RENO, their Rapid Event Notification System (RENO).

Ankush Gulati and David Gevorkyan are two senior software engineers at Netflix, and they wrote a great blog post on the design decision behind RENO.

Here's a Summary

Netflix users will be using their account with different devices. Therefore, engineers have to make sure that things like viewing activity, membership plan, movie recommendations, profile changes, etc. are synced between all these devices.

The company uses a microservices architecture for their backend, and built the RENO service to handle this task.

There were several key design decisions behind RENO.

Single Event Source - All the various events (viewing activity, recommendations, etc.) that RENO has to track come from different internal systems. To simplify this, engineers used an Event Management Engine that serves as a level of indirection. This Event Management Engine layer is the single source of events for RENO. All the events from the various backend services go to the Event Management Engine, from where they’re passed to RENO. This Engine was built using Netflix's internal distributed computation framework called Manhattan. You can read more about that here.
Event Prioritization - If a user changes their child’s profile maturity level, that event change should have a very high priority compared to other events. Therefore, each event-type that RENO handles has a priority assigned to it and RENO then shards by that event priority. This way, Netflix can tune system configuration and scaling policies differently for events based on their priority.
Hybrid Communication Model - RENO has to support mobile devices, smart TVs, browsers, etc. While a mobile device is almost always connected to the internet and reachable, a smart TV is only online when in use. Therefore, RENO has to rely on a hybrid push AND pull communication model, where the server tries to deliver all notifications to all devices immediately using push. Devices will also pull from the backend at various stages of the application lifecycle. Solely using pull doesn’t work because it makes the mobile apps too chatty (or else the updates won't be synced fast enough) and solely using push doesn’t work when a device is turned off.
Targeted Delivery - RENO has support for device specific notification delivery. If a certain notification only needs to go to mobile apps, RENO can solely deliver to those devices. This limits the outgoing traffic footprint significantly.
Managing High RPS - At peak times, RENO serves 150,000 events per second. This high load can put strain on the downstream services. Netflix handles this high load by adding various gate checks before sending an event. Some of the gate checks are

Staleness - Many events are time sensitive so RENO will not send an event if it’s older than it’s staleness threshold
Online Devices - RENO keeps track of which devices are currently online using Zuul. It will only push events to a device if it’s online.
Duplication - RENO checks for any duplicate incoming events and corrects that.

Architecture

Here’s a diagram of RENO.

We’ll go through all the components below.

At the top, you have Event Triggers.

These are from the various backend services that handle things like movie recommendations, profile changes, watch activity, etc.

Whenever there are any changes, an event is created. These events go to the Event Management Engine.

The Event Management Engine serves as a layer of indirection so that RENO has a single source of events.

From there, the events get passed down to Amazon SQS queues. These queues are sharded based on event priority.

AWS Instance Clusters will subscribe to the various queues and then process the events off those queues. They will generate actionable notifications for all the devices.

These notifications then get sent to Netflix’s outbound messaging system. This system handles delivery to all the various devices.

The notifications will also get sent to a Cassandra database. When devices need to pull for notifications, they can do so using the Cassandra database (remember it’s a Hybrid Communications Model of push and pull).

The RENO system has served Netflix well as they’ve scaled. It is horizontally scalable due to the decision of sharding by event priority and adding more machines to the processing cluster layer.

For more details, you can read the full blog post here.

How did you like this summary?

Your feedback really helps me improve curation for future emails.

Solving Edge Computing Pain Points

A massive trend that’s happening right now is the rise of Edge Computing.

Consumers are buying smart devices, automakers are adding additional car sensors and factories are putting in more automation. With these changes, the amount of data that’s being generated and processed outside of traditional data centers is exploding.

How should you balance the amount of data that’s processed at the edge versus being processed in the cloud? How can you replicate data that’s generated at the edge into your cloud database with low latency and high throughput?

InfluxDB is at the forefront of these challenges and they’ve built a suite of products to easily solve Edge Data Replication workloads at massive scale.

They published a great technical document that delves into the rapid rise of edge computing, challenges that companies face with adopting this new paradigm and case studies on how to build reliable, scalable systems.

sponsored

Tech Snippets

How Large Language Models are Thinking - Murray Shanahan is a senior scientist at Google DeepMind and he wrote a great article talking about what Large Language Models are doing under the hood and how you can reason about them. There's been a huge amount of chatter recently with OpenAI's ChatGPT and it's causing some people to imply anthropomorphisms that don't really exist. Murray discusses this in detail.

Building a BitTorrent client from the ground up in Go - Peer to peer file sharing is an extremely efficient way to share large files across many different machines. This is an interesting post by Jesse Li on implementing the BitTorrent protocol and he walks through building a project in Go that can download the Debian ISO given their torrent file.

GraphQL at LinkedIn - LinkedIn had to develop an API that strategic partners could use to query the company's data around workforce hiring and labor markets. A challenge they had with the traditional REST paradigm was that they'd have to create a new REST API for each case, as different partners had different needs. To address this (and other concerns), engineers shifted over to using GraphQL. They wrote a great blog post on why they picked GraphQL and how they adopted it.

How to write better Git Commit Messages - This is a good article with some practical tips on how to write better commit messages. Some of the suggestions are
- Use imperative mood in the subject line. So, you should say something like Add fix for dark mode toggle state.
- Specify the type of commit. Have a consistent set of words to describe your changes. Is it a bugfix? Update? Refactor? etc.
- Be direct. Eliminate filler words and phrases in your commits. Don’t use words like though, maybe, I think, kind of.

System Design Articles

If you'd like weekly articles on System Design concepts (in addition to the Quastor summaries), consider upgrading to Quastor Pro!

Past articles cover topics like

Load Balancers - L4 vs. L7 load balancers and load balancing strategies like Round Robin, Least Connections, Hashing, etc.
Database Storage Engines - Different types of Storage Engines and the tradeoffs made. Plus a brief intro to MyISAM, InnoDB, LevelDB and RocksDB.
Log Structured Merge Trees - Log Structured Storage Engines and how the LSM Tree data structure works and the tradeoffs it makes.
Chaos Engineering - The principles behind Chaos Engineering and how to implement. Plus case studies from Facebook, LinkedIn, Amazon, and more.
Backend Caching Strategies - Ways of caching data and their tradeoffs plus methods of doing cache eviction.
Row vs. Column Oriented Databases - How OLTP and OLAP databases store data on disk and different file formats for storing and transferring data.
API Paradigms - Request-Response APIs with REST and GraphQL vs. the Event Driven paradigm with Webhooks and Websockets.

It's $12 a month or $10 per month if you pay yearly. Thanks a ton for your support.

I'd highly recommend using your Learning & Development budget for Quastor Pro!

Interview Question

You are given a binary tree.

Each node in the tree has a next pointer that points to the next right node but the next pointers are currently set to Null.

Write a function that sets each next pointer to the correct node.

Here’s a link to the question in LeetCode.

Previous Question

As a reminder, here’s our last question

You are given a doubly-linked list where each node also has a child pointer.

The child pointer may or may not point to a separate doubly linked list.

These child lists may have one or more children of their own, and so on, to produce a multilevel data structure.

Flatten the list so that all the nodes appear in a single-level, doubly linked list.

Here’s the question in LeetCode.

Solution

We can solve this question recursively.

We’ll have a recursive function called _flatten that takes in a linked list and flattens it.

_flatten will then return the head and the tail (last node) of the flattened linked list.

We’ll iterate through our linked list and check if any of the nodes have a child linked list.

If a node does have a child linked list, then we’ll call _flatten on the child linked list.

Then, we’ll insert the flattened linked list inside of our current linked list after the current node.

After inserting it, we can continue iterating through the rest of our linked list checking for child nodes.

Here’s the Python 3 code.