How Netflix Syncs Hundreds of Millions of Devices

Hey Everyone,

Today we’ll be talking about

  • How Netflix keeps their user’s devices synced for all of their 200 million users

    • Netflix’s Rapid Event Notification System (RENO) and the problem it solves

    • The design decisions behind RENO

    • The architecture of RENO

  • Plus some tech snippets on

    • Write a simple operating system - from scratch

    • Understanding CSS layout algorithms

    • How to onboard engineers

Questions? Please contact me at [email protected].


Netflix’s Rapid Event Notification System

Netflix is an online video streaming service that operates at insane scale. They have more than 220 million active users and account for more of the world's downstream internet traffic than YouTube (in 2018, Netflix accounted for ~15% of the world’s downstream traffic).

These 220 million active users are accessing their Netflix account from multiple devices, so Netflix engineers have to make sure that all the different clients that a user logs in from are synced.

You might start watching Breaking Bad on your iPhone and then switch over to your laptop. After you switch to your laptop, you expect Netflix to continue playback of the show exactly where you left off on your iPhone.

Syncing between all these devices for all of their users requires an immense amount of communication between Netflix’s backend and all the various clients (iOS, Android, smart TV, web browser, Roku, etc.). At peak, it can be about 150,000 events per second.

To handle this, Netflix built RENO, their Rapid Event Notification System.

Ankush Gulati and David Gevorkyan are two senior software engineers at Netflix, and they wrote a great blog post on the design decision behind RENO.

Here’s a Summary

Netflix users will be using their account with different devices.

Netflix engineers have to make sure that things like viewing activity, membership plan, movie recommendations, profile changes, etc. are synced between all these devices.

The company uses a microservices architecture for their backend, and built the RENO service to handle this task.

There were several key design decisions behind RENO

  1. Single Event Source - All the various events (viewing activity, recommendations, etc.) that RENO has to track come from different internal systems. To simplify this, engineers used an Event Management Engine that serves as a level of indirection. This Event Management Engine layer is the single source of events for RENO.All the events from the various backend services go to the Event Management Engine, from where they’re passed to RENO.

  2. Event Prioritization - If a user changes their child’s profile maturity level, that event change should have a very high priority compared to other events. Therefore, each event-type that RENO handles has a priority assigned to it and RENO then shards by that event priority.This way, Netflix can tune system configuration and scaling policies differently for events based on their priority.

  3. Hybrid Communication Model - RENO has to support mobile devices, smart TVs, browsers, etc. While a mobile device is almost always connected to the internet and reachable, a smart TV is only online when in use.Therefore, RENO has to rely on a hybrid push AND pull communication model, where the server tries to deliver all notifications to all devices immediately using push. Devices will pull from the backend at various stages of the application lifecycle.Solely using pull doesn’t work because it makes the mobile apps too chatty and solely using push doesn’t work when a device is turned off.

  4. Targeted Delivery - RENO has support for device specific notification delivery. If a certain notification only needs to go to mobile apps, RENO can solely deliver to those devices. This limits the outgoing traffic footprint significantly.

  5. Managing High RPS - At peak times, RENO serves 150,000 events per second. This high load can put strain on the downstream services.Netflix handles this high load by adding various gate checks before sending an event.Some of the gate checks are

    1. Staleness - Many events are time sensitive so RENO will not send an event if it’s older than it’s staleness threshold

    2. Online Devices - RENO keeps track of which devices are currently online using Zuul. It will only push events to a device if it’s online.

    3. Duplication - RENO checks for any duplicate incoming events and corrects that.

Architecture

Here’s a diagram of RENO.

We’ll go through all the components below.

At the top, you have Event Triggers.

These are from the various backend services that handle things like movie recommendations, profile changes, watch activity, etc.

Whenever there are any changes, an event is created. These events go to the Event Management Engine.

The Event Management Engine serves as a layer of indirection so that RENO has a single source of events.

From there, the events get passed down to Amazon SQS queues. These queues are sharded based on event priority.

AWS Instance Clusters will subscribe to the various queues and then process the events off those queues. They will generate actionable notifications for all the devices.

These notifications then get sent to Netflix’s outbound messaging system. This system handles delivery to all the various devices.

The notifications will also get sent to a Cassandra database. When devices need to pull for notifications, they can do so using the Cassandra database (remember it’s a Hybrid Communications Model of push and pull).

The RENO system has served Netflix well as they’ve scaled. It is horizontally scalable due to the decision of sharding by event priority and adding more machines to the processing cluster layer.

For more details, you can read the full blog post here.

Quastor is a free Software Engineering newsletter that sends out deep dives on interesting tech, summaries of technical blog posts, and FAANG interview questions and solutions.

Tech Snippets

  • Writing a Simple Operating System — from Scratch - This is an awesome (free) book that goes through how to write a basic operating system from scratch. It covers how a computer boots, how to write low-level programs when there is no OS, how to create basic OS services like device drivers, file systems, multi-tasking, etc.

  • Understanding Layout Algorithms - Josh Comeau is the creator of the course CSS for JS devs. He wrote a great blog post that goes through the various algorithms that CSS is using under the hood like flexbox, grid, table, etc. It goes through all the intricacies with these algorithms and explains why CSS can be unpredictable.

  • The Ultimate Guide to Onboarding Software Engineers - Leadership Garden is a great blog that helps engineers become better leaders. This post gives a 13-step onboarding framework for helping new developers succeed at your company. The framework is divided into 30 day, 60 day, 90 day and 150 day milestones.

Interview Question

You are given the head of a linked list.

You are also given an integer n.

Remove the nth node from the end of the linked list and return it’s head.

Example

Input: head = [1, 2, 3, 4, 5], n = 2

Output: [1, 2, 3, 5]

Previous Question

As a refresher, here’s the previous question

Write a function that checks whether an integer is a palindrome.

For example, 191 is a palindrome, as well as 111.

123 is not a palindrome.

Do not convert the integer into a string.

Solution

We’ll first create a function that gets the ith digit of an integer given an integer and i.

The function does this by dividing the integer by the next smallest power of ten.

This removes all the digits to the right of i.

Then, we can remove the digits to the left of i by taking the number % 10.

In our main function, we’ll have two pointers i and j that iterate through all the digits in our number.

They will use the getIthDigit function to get the specific digit and will return False if there are any mismatching digits.

After iterating through the entire number, we can return True.

What’s the time and space complexity of our answer?