How LinkedIn uses Event Driven Architectures to Scale
An introduction to EDAs and the Actor Model. Plus, how to ship projects at big tech companies, how Coinbase uses ML to predict traffic patterns and more.
Hey Everyone!
Today we’ll be talking about
How LinkedIn uses Event Driven Architectures to Scale their Infrastructure
An Introduction to the Actor Model and how it works
How LinkedIn collects and processes server metrics from their fleet with an Event Driven Architecture
LinkedIn’s monitoring system for server consoles
Tech Snippets
How I Ship Projects at Big Tech Companies
How Binary Vector Embeddings work and why they’re so useful
How Coinbase uses ML to Predict Traffic and Auto-scale Databases
How LinkedIn uses Design Patterns to Scale their Infrastructure
LinkedIn is the largest professional social networking platform in the world with over 950 million users in 200+ countries.
To serve this user base, they maintain dozens of data centers around the world with hundreds of thousands of servers globally.
In order to manage these servers, LinkedIn makes use of many tried-and-tested design patterns.
One pattern is the Producer-Consumer pattern, commonly used in event driven architectures (EDAs).
This pattern consists of three main components:
Producer - generates events/messages (server metrics, status updates, data from queries, etc.)
Queue - acts as a buffer to store messages until they’re ready to be processed. LinkedIn uses Redis, Kafka or built-in queues for this.
Consumer - reads and processes messages from the queue
Saira Khanum is a Staff Software Engineer at LinkedIn and she wrote a fantastic blog post delving into how the engineering team uses this pattern in three different systems:
To Collect and Maintain Data from Servers for Real-time and Analytical Queries
To Check Servers for Availability and Accessibility
To Detect and Fix any Access Policy Violations on the Servers
We’ll explore these and talk about how LinkedIn implemented them.
Actor Pattern
When building event driven architectures, LinkedIn frequently uses the Actor Pattern. Event Driven Architectures are loosely defined so the Actor Pattern (or Actor Model) is a specific implementation of an EDA.
With this model, everything is represented as an actor.
An actor is an independent entity that can
Send messages to other actors
Process messages/requests
Create new actors and designate their behavior
Have independent state
To give you a better sense of how this might work, here’s a hypothetical example of an Actor model at Uber for handling ride requests.
When a user first requests a ride, a RequestActor is created specifically for their request. This actor maintains the state of the request (whether it’s active or canceled) and coordinates the entire matching process.
The RequestActor might first create a child PricingActor to figure out a reasonable price for the request based on the trip distance and time of day. The PricingActor will run internal logic based on the RequestActor’s message and return the ride price.
Once it has the pricing figured out, the RequestActor will communicate with nearby DriverActors (one actor per active driver on Uber) by sending them ride offer messages.
The DriverActor will then handle sending a notification to the Uber driver that there's someone looking for a ride. If the driver accepts the ride then the DriverActor might create a new TripActor to handle the ongoing ride (tracking location updates, route changes, payment processing, etc.)
If you’re looking for more details, here’s a fantastic article that delves deeper on the Actor model.
Back to LinkedIn…
Event Driven Architectures at LinkedIn
LinkedIn talks about a few systems where they’ve found EDAs useful for managing infrastructure.
Distributed Server Queries at LinkedIn
The first system is LinkedIn’s distributed server query system. This is responsible for collecting system facts (CPU/memory usage, network connections, disk space usage, etc.) from across the server fleet and storing them so they can be queried and analyzed.
Some of the requirements are
Scale - the system needs to process terabytes of data from hundreds of thousands of servers in near real-time
Data Refresh - the data needs to be collected several times every hour
Data Maintenance - the last known good snapshot of system facts needs to be maintained for a defined retention period. (after the retention period is over, the system facts need to be marked as stale)
Here’s the high level architecture of the system
Agents (producers) are deployed across the server fleet to collect system facts
These facts are sent to worker processes (using the Actor Pattern) and stored on Redis
Different worker processes consume the data from Redis, process it and store it in different datastores
Some of the choices LinkedIn made were
Redis - LinkedIn picked Redis as the queue since they were looking for low latency. The messages are short-lived and introducing a tool like Kafka would introduce too much overhead.
Actor Pattern - Workers that collect and process server metrics use the Actor pattern. They’re implemented with Gunicorn.
Server Console Monitoring
The second system is LinkedIn’s distributed system to monitor the server console for their infrastructure. Server consoles (often called service processors) allow administrators to manage and monitor physical servers remotely (even when the server is powered off or unresponsive). They’re essential for troubleshooting, rebooting and maintaining servers.
LinkedIn’s monitoring system checks that these server management consoles are available and accessible.
Here’s the architecture for how they do that.
Satellite servers run checks across the servers in the data center. Each check is handled by a separate actor.
Messages from each check are passed through RabbitMQ. The result of each check determines if the next check should be run (if the next actor should be created)
Final results are sent to Kafka. Consumer applications can read results for storage/analysis from the various Kafka streams.
Some of the tech choices LinkedIn made were
Actor Pattern - Each check that LinkedIn has to do is an actor. The checks are done sequentially so they pass messages to each other to send results and status updates.
Kafka and RabbitMQ - RabbitMQ is used for communication between the actors whereas Kafka is used for forwarding the final results down to the consumer applications for further processing and storage