How PayPal Scaled Kafka to 1.3 Trillion Messages a Day

Plus, how Ryan went from an L3 at Facebook to an L6 in under 3 years (tripling his compensation)

September 18, 2023

Hey Everyone!

Today we'll be talking about

How PayPal Scaled Kafka to 1.3 Trillion Messages a Day
- PayPal relies heavily on Kafka for synchronizing databases, collecting metrics/logs, batch processing tasks and much more
- A brief introduction to Kafka and it’s different components (producers, consumers, brokers)
- How PayPal improved Kafka cluster management with a config service, custom libraries, access control lists and more
- Collecting metrics on the Kafka cluster and creating standard operating procedures for when things go wrong
Speed-Running Big Tech Promotions
- Ryan Peterman is a Staff Engineer at Meta (L6). He managed to go from an L3 to an L6 in just 3 years (and triple his compensation)
- Ryan wrote a series of blog posts delving into how he did this and the process he took.
- Going from L3 to L4 by shipping features independently
- Going from L4 to L5 by finding senior-engineer scope and mentoring other engineers
- Going from L5 to L6 by having impact beyond your team and finding projects with enough scope
Tech Snippets
- How Instagram Scaled to 14 Million Users with just 3 Engineers
- How Figma overhauled their Performance Testing Framework
- Getting Paid in Cash vs. Equity and how to unbundle them
- Cache Eviction Strategies - FIFO vs. LRU
- Getting a job as an engineering executive.

How PayPal Scaled Kafka to 1.3 Trillion Messages a Day

Apache Kafka is an open source platform for moving messages between different components in your backend. These messages (also called events or records) typically signify some kind of action that happened.

A couple benefits of Kafka include

Distributed - Kafka clusters can scale to hundreds of nodes and process millions of events per second.
Configurable - it’s highly configurable so you can tune it to meet your requirements. Maybe you want super low latency and don’t mind if some messages are dropped. Or perhaps you need messages to be delivered exactly-once and always acknowledged by the recipient
Fault Tolerant - Kafka can be configured to be extremely durable (store messages on disk and replicate them across multiple nodes) so you don’t have to worry about messages being lost

PayPal adopted Kafka in 2015 and they’ve been using it for synchronizing databases, collecting metrics/logs, batch processing tasks and much more. Their Kafka fleet consists of over 85 clusters that process over 100 billion messages per day.

Previously, they’ve had Kafka volumes peak at over 1.3 trillion messages in a single day (21 million messages per second on Black Friday).

Monish Koppa is a software engineer at PayPal and he published a fantastic blog post on the PayPal Engineering blog delving into how they were able to scale Kafka to this level.

We’ll start with a brief overview of Kafka (skip over this if you’re already familiar) and then we’ll talk about what PayPal engineers did.

Brief Overview of Kafka

In the late 2000s, LinkedIn needed a highly scalable, reliable way to transfer user-activity and monitoring data through their backend.

Their goals were

High Throughput - the system should allow for easy horizontal scaling so it can handle a high number of messages
Reliable - It should provide deliverability and durability guarantees. If the system goes down, then data should not be lost
Language Agnostic - You should be able to transmit messages in JSON, XML, Avro, etc.
Decoupling - Multiple backend servers should be able to push messages to the system. You should be able to have multiple consumers read a single message.

Jay Kreps and his team tackled this problem and they built out Kafka.

Fun fact - three of the original LinkedIn employees who worked on Kafka became billionaires from the Confluent IPO in 2021 (with some help from JPow).

How Kafka Works

A Kafka system consists of three main components

Producers
Consumers
Kafka Brokers

Producers

Producers are any backend services that are publishing the events to Kafka.

Examples include

Payment service publishing an event to Kafka whenever a payment succeeds or fails
Fraud detection service publishing an event whenever suspicious activity is detected
Customer support service publishing a message whenever a user submits a support ticket

Messages have a timestamp, a value and optional metadata headers. The value can be any format, so you can use XML, JSON, Avro, etc. Typically, you’ll want to keep message sizes small (1 mb is the default max size).

Messages in Kafka are stored in topics, where a topic is similar to a folder in a filesystem. If you’re using Kafka to store user activity logs, you might have different topics for clicks, purchases, add-to-carts, etc.

For scaling horizontally (across many machines), each topic can be split onto multiple partitions. The partitions for a topic can be stored across multiple Kafka broker nodes and you can also have replicas of each partition for redundancy.

When a producer sends a message to a Kafka broker, it also needs to specify the topic and it can optionally include the specific topic partition.

Consumers

Consumers subscribe to messages from Kafka topics. A consumer can subscribe to one or more topic partitions but each partition can only send data to a single consumer. That consumer is considered the owner of the partition.

To have multiple consumers consuming from the same partition, they need to be in different consumer groups.

Consumers work on a pull pattern, where they will poll the server to check for new messages.

Kafka will maintain offsets so it can keep track of which messages each consumer has read. If a consumer wants to replay previous messages, then they can just change their offset to an earlier number.

How PayPal Scaled Kafka to 1.3 Trillion Messages a Day

PayPal has over 1,500 brokers that host over 20,000 topics. These Kafka clusters are spread across the world and they use MirrorMaker (tool for mirroring data between Kafka clusters) to mirror the data across data centers.

Each Kafka cluster consists of

Producers/Consumers - PayPal uses Java, Python, Node, Golang and more. The engineering team built custom Kafka client libraries for all of these languages that the company can use to send/receive messages.
Brokers - Kafka servers that are taking messages from producers, storing them on disk and sending them to consumers.
ZooKeeper Servers - a key-value service that’s used for storing configuration info, synchronizing the brokers, handling leader elections and more.

Having to maintain ZooKeeper servers was actually a big criticism of Kafka (it’s too complicated to maintain), so KRaft is now the default method of maintaining consensus in your Kafka cluster (PayPal is still using ZK though).

To manage their growing fleet, PayPal invested in several key areas

Cluster Management
Monitoring and Alerting
Enhancements and Automations

We’ll delve into each.

Cluster Management

As mentioned, LinkedIn has over 85 Kafka clusters where each has many brokers and zookeeper nodes. Producer and consumer services will interact with these clusters to send/receive messages.

To improve cluster management, PayPal introduced several improvements

Kafka Config Service - Previously, clients would hardcode the broker IPs that they would connect to. If a broker needed to be replaced, then this would break the clients. As PayPal added more brokers and clusters, this became a maintenance nightmare.

PayPal built a Kafka config service to keep track of all these details (and anything else a client might need to connect to the cluster). Whenever a new client connects to the cluster (or if there are any changes), the config service will push out the standard configuration details.
Access Control Lists - In the past, PayPal’s Kafka setup allowed any application to subscribe to any of the existing topics. PayPal is a financial services company, so this was an operational risk.

To solve this, they introduced ACLs so that applications must authenticate and authorize in order to gain access to a Kafka cluster and topic.
PayPal Kafka Libraries - In order to integrate the config service, ACLs and other improvements, PayPal built out their own client libraries for Kafka. Producer and consumer services can use these client libraries to connect to the clusters.

The client libraries also contain monitoring and security features. They can publish critical metrics for client applications using Micrometer and also handle authentication by pulling required certificates and tokens during startup.
QA Environment - Previously, developers would randomly create topics on the production environment that they’d use for testing. This caused issues when rolling out the changes to prod as the new cluster/topic details could be different than the one used for testing.

To solve this, PayPal built a separate QA platform for Kafka that has a one-to-one mapping between production and QA clusters (along with the same security and ACL rules).

Monitoring and Alerting

To improve reliability, PayPal collects a ton of metrics on how the Kafka clusters are performing. Each of their brokers, zookeeper instances, clients, MirrorMakers, etc. all send key metrics regarding the health of the application and the underlying machines.

Each of these metrics has an associated alert that gets triggered whenever abnormal thresholds are met. They also have clear standard operating procedures for every triggered alert so it’s easier for the SRE team to resolve the issue.

Topic Onboarding

In order to create new topics, application teams are required to submit a request via a dashboard. PayPal’s Kafka team reviews the request and makes sure they have the capacity.

Then, they handle things like authentication rules and Access Control Lists for that topic to ensure proper security procedures.

This is an overview of some of the changes PayPal engineers made. For the full list of changes and details, check out the article here.

Tech Snippets

How Instagram scaled to 14 million users with only 3 engineers

When Instagram was acquired by Facebook (for $1 billion), they only had 13 employees. The company was able to scale to 14 million users with just 3 engineers.

This is a fantastic blog post that delves into the core principles the team used to achieve this. The article also delves into the technical choices they made and how these were informed by their engineering principles.

engineercodex.substack.com/p/how-instagram-scaled-to-14-million

How Figma overhauled their Performance Testing Framework

Initially, FIgma relied on a single macbook for their in-house performance testing system. As you might imagine, they eventually had to find a more scalable solution.

This is an interesting read from the Figma engineering blog on how the company built a new system to spot performance regressions in the app.

They ultimately shipped two systems: a cloud based system that covered the majority of tests and a hardware system for highly targeted tests.

www.figma.com/blog/keeping-figma-fast

Getting Paid in Cash vs. Equity and how to unbundle them

When you get a job in Big Tech or at a startup, your compensation will consist of cash and equity. The cash part is seen as stable, while the equity is seen as high-risk, high-reward.

If you work at Google, then your compensation is basically just lots of cash. Your equity can be sold every quarter and It’s unlikely that Google stock will 20x from here.

If you work at a startup, then your cash compensation might be low but you have high upside with your equity.

But, what if you unbundle cash and equity? One way to do this is to have a high cash salary (work at a big tech firm) while also getting equity (work on your own startup on the side).

This is a great blog post that delves into unbundling the two.

anu.substack.com/p/unbundling-cash-and-equity

Cache Eviction Strategies - FIFO vs. LRU

When thinking about Cache Eviction, your first though is probably of LRU - getting rid of the least recently used item when the cache is full.

This article delves into FIFO (first in, first out) as a potential alternative to LRU. FIFO requires less metadata whereas the LRU data structure requires an update on each cache hit (you have to update the most recently used item).

The issue with just using FIFO is that it will remove popular items, so the author delves into modifications you can make to the algorithm (lazy promotion) to avoid this.

blog.jasony.me/system/cache/2023/06/24/fifo-lru

Getting a job as an engineering executive.

Will Larson is the CTO of Carta and was previously the CTO of Calm. He writes a great blog on Engineering Leadership called Irrational Exuberance.

This is a fantastic blog post he published earlier this year with tips on becoming an engineering executive. He talks about finding a role through an internal transition (getting promoted) vs. an external transition (recruiting for exec positions).

The interview process can be quite chaotic so Larson gives his tips on what you should be focusing on and he also talks about how you should negotiate salary.

lethain.com/getting-engineering-executive-job

Speedrunning Guide for Big Tech Promotions

Ryan Peterman is a Staff Software Engineer at Instagram. He was able to progress extremely quickly where he joined the company as a new grad (L3) and became a Staff Engineer (L6) in just three years (according to levels.fyi, the average compensation for an L3 at Facebook is $194,000 while an L6 makes $614,000).

He posts a ton of career advice on LinkedIn, so I’d highly recommend checking him out.

He published a fantastic guide on how to speed-run promotions on his substack, so we’ll be summarizing the guide and delving into his advice.

Speedrunning Guide: Junior (L3) -> Mid-Level (L4) dev

Big Tech companies want you to show at least 6 months of consistent performance at the next level before they promote you. Your manager is the one who builds the case and advocates for you getting promoted, so make sure they know your interest in getting promoted. They can assist you with L4 projects and help you understand what they’re expecting.

The main difference between L3 and L4 is the size of scope they can handle independently.

L3 - Can handle individual tasks (< 2 weeks of work) with minimal guidance
L4 - Can handle medium-to-large features (< 2 months of work) with minimal guidance

Ideally, you want to spend time on work that is impactful to get promoted to L4 quickly. However, this can be hard to judge when you’re a junior engineer.

Instead, adopt the mindset of the more code you ship, the more likely it is that some part of your work is extremely impactful.

Optimize for dev velocity, where you’re shipping code quickly so you can get feedback ASAP. It’s very important to avoid wasting time being blocked.

Read the full L3 to L4 guide here.

Speedrunning Guide: Mid-Level (L4) -> Senior (L5) dev

The L4 → L5 gap is bigger than the L3 → L4 gap. Becoming a senior engineer requires significant behavior changes where you change from writing code to leading and influencing teams.

Ryan has a great list where he describes L4 duties vs. L5 duties to show the changes.

The key to getting promoted quickly is to find L5 scope ASAP. Your manager can help you out with this if they know your career goals.

You should also focus on team leadership and influence as these are important to reach L5. You should find opportunities to lead workstreams and also look for opportunities to uplift and mentor others.

Ryan mentored several employees and led another team to revamp the instagram video ads pipeline.

This project had a massive impact and Ryan’s leadership and mentorship helped him get promoted.

Read the full L4 to L5 guide here.

Speedrunning Guide: Senior (L5) -> Staff (L6) dev

Staff Engineers are at the same level as Engineering Managers. The biggest mindset shift is that you need to work through delegation and influence across teams instead of writing code yourself.

Senior Engineers will build roadmaps of several medium-to-large features where the problem and business impact are clear. Staff Engineers handle more ambiguity and need to create scope by finding impactful opportunities and problems.

If you want to get promoted to L6, you’ll need to find Staff scope. Ryan wrote a great blog on this here. One way is to ask your manager (or skip manager) about their top of mind problems and opportunities. Another method is to come up with an impactful project yourself and convince your manager (and those around them) that this is valuable to the company’s goals.

You also need to look for opportunities to uplift and mentor others. It’s especially helpful if you can mentor L5 peers to grow.

Read the full L5 to L6 guide here.