Tech Dive on Apache Kafka

A long-form tech dive on Kafka, it's design goals, components, messaging patterns and more. Plus, how to avoid pitfalls when choosing your tech stack, scaling a large codebase and more!

March 09, 2024

Hey Everyone!

Today we’ll be talking about

Apache Kafka Tech Dive - We’ve talked about Kafka a ton in past Quastor summaries, so this article will be a long form tech dive into Kafka and it’s features
- Message Queues vs. Publish/Subscribe
- Tooling in the Message Queue and Pub/Sub space
- Origins of Kafka and original design goals at LinkedIn
- Purpose of Producers, Brokers and Consumers
- Kafka Messaging Patterns
- The Kafka Ecosystem
- Criticisms of Kafka and their solutions
- Scaling Kafka at Big Tech companies
How Faire Scaled Their Engineering Team to Hundreds of Developers
- Hiring developers who are customer focused
- Implementing code review and testing early on
- Tracking KPIs around developer velocity
- Factors that make a good KPI
Tech Snippets
- The New Edition of Eloquent JavaScript (awesome, free book to learn JS)
- How to Avoid Pitfalls when choosing your Tech Stack
- The Basics of AI-Powered (Vector Search)
- How to Scale a Large Codebase
- How (not) to Apply for a Software Job
- How Engineering Leaders Stay Up to Date

Tech Dive - Apache Kafka

With the rise of Web 2.0 (websites like Facebook, Twitter, YouTube and LinkedIn with user-generated content), engineers needed to build highly scalable backends that can process petabytes of data.

To do this, Asynchronous Communication has become an essential design pattern. You have backend services (publishers) that are creating data objects and then you have other backend services (consumers) that use these objects. With an asynchronous system, you add in a buffer that stores objects generated by the producers until they can be processed by the downstream services (consumers).

Tools that help you with this are message queues like RabbitMQ and event streaming platforms like Apache Kafka. The latter has grown extremely popular over the last decade.

In this article, we’ll be delving into Kafka. We’ll talk about it’s architecture, usage patterns, criticisms, ecosystem and more.

With asynchronous communication, you need some kind of buffer to sit in-between the producers and the consumers. Two commonly used types of buffers are message queues and publish-subscribe systems. We’ll delve into both and talk about their differences.

A traditional message queue sits between two components: the producer and the consumer.

The producer produces events at a certain rate and the consumer processes these events at a different rate. The mismatch in production/processing rates means we need a buffer in-between these two to prevent the consumer from being overwhelmed. The message queue serves this purpose.

With a message queue, each message is consumed by a single consumer that listens to the queue.

With Publish/Subscribe systems, you can have each event get processed by multiple consumers who are listening to the topic. So it’s essential if you want multiple consumers to all get the same messages from the producer.

Pub/Sub systems introduce the concept of a topic, which is analogous to a folder in a filesystem. You’ll have multiple topics (to categorize your messages) and you can have multiple producers writing to each topic and multiple consumers reading from a topic.

Pub/Sub is an extremely popular pattern in the real world. If you have a video sharing website, you might publish a message to Kafka whenever a new video gets submitted. Then, you’ll have multiple consumers subscribing to those messages: a consumer for the subtitles-generator service, another consumer for the piracy-checking service, another for the transcoding service, etc.

Possible Tools

In the Message Queue space, there’s a plethora of different tools like

AWS SQS - AWS Simple Queue Service is a fully managed queueing service that is extremely scalable.
RabbitMQ - open source message broker that can scale horizontally. It has an extensive plugin ecosystem so you can easily support more message protocols, add monitoring/management tooling and more.
ActiveMQ - an open source message broker that’s highly scalable and commonly used in Java environments. (It’s written in Java)

In the Publish/Subscribe space, you have tools like

Apache Kafka - Kafka is commonly referred to as an event-streaming service (an implementation of Pub/Sub). It’s designed to be horizontally scalable and transfer events between hundreds of producers and consumers.
Redis Pub/Sub - Part of Redis (an in-memory database). It’s extremely simple and very low latency so it’s great if you need extremely high throughput.
AWS SNS - AWS Simple Notification Service is a fully managed Pub/Sub service by AWS. Publishers send messages to a topic and Consumers can subscribe to specific SNS topics through HTTP (and other Amazon services).

Now, we’ll delve into Kafka.

Origins of Kafka

In the late 2000s, LinkedIn needed a highly scalable, reliable data pipeline for intaking data from their user-activity tracking system and their monitoring system. They looked at ActiveMQ, a popular message queue, but it couldn’t handle their scale.

In 2011, Jay Kreps and his team worked on Kafka, a messaging platform to solve this issue. Their goals were around

Decoupling - Multiple producers and consumers should be able to push/pull messages from a topic. The system needed to follow a Pub/Sub model.
High Throughput - The system should allow for easy horizontal scaling of topics so it could handle a high number of messages
Reliability - The system should provide message deliverability guarantees. Data should also be written to disk and have multiple replicas to provide durability.
Language Agnostic - Messages are byte arrays so you can use JSON or a data format like Avro.

How Kafka Works

A Kafka system consists of several components

Producer - Any service that pushes messages to Kafka is a producer.
Consumer - Any service that consumes (pulls) messages from Kafka is a consumer.
Kafka Brokers - the kafka cluster is composed of a network of nodes called brokers. Each machine/instance/container running a Kafka process is called a broker.

Producer

A Producer publishes messages ( also known as events) to Kafka. Messages are immutable and have a timestamp, a value and optional key/headers. The value can be any format, so you can use XML, JSON, Avro, etc. Typically, you’ll want to keep message sizes small (1 mb is the default max size).

Messages in Kafka are stored in topics, which are the core structure for organizing and storing your messages. You can think of a topic as a log file that stores all the messages and maintains their order. New message are appended to the end of the topic.

If you’re using Kafka to store user activity logs, you might have different topics for clicks, purchases, add-to-carts, etc. Multiple producers can write to a single Kafka topic (and multiple consumers can read from a single topic)

When a producer needs to send messages to Kafka, they send the message to the Kafka Broker.

Kafka Broker

In your Kafka cluster, the individual servers are referred to as brokers.

Brokers will handle

Receiving Messages - accepting messages from producers
Storing Messages - storing them on disk organized by topic. Each message has a unique offset identifying it.
Serving Messages - sending messages to consumer services when the consumer requests it

If you have a large backend with thousands of producers/consumers and billions of messages per day, having a single Kafka broker is obviously not going to be sufficient.

Kafka is designed to easily scale horizontally with hundreds of brokers in a single Kafka cluster.

Kafka Scalability

As we mentioned, your messages in Kafka are split into topics. Topics are the core structure for organizing and storing your messages.

In order to be horizontally scalable, each topic can be split up into partitions. Each of these partitions can be put on different Kafka brokers so you can split a specific topic across multiple machines.

You can also create replicas of each partition for fault tolerance (in case a machine fails). Replicas in Kafka work off a leader-follower model, where one partition becomes the leader and handles all reads/writes. The follower nodes passively replicate data from the leader for redundancy purposes. Kafka also has a feature to allow these replica partitions to serve reads.

Important Note - when you split up a Kafka topic into partitions, your messages in the topic are ordered within each partition but not across partitions. Kafka guarantees the order of messages at the partition-level.

Choosing the right partitioning strategy is important if you want to set up an efficient Kafka cluster. You can randomly assign messages to partitions, use round-robin or use some custom partitioning strategy (where you specify which partition in the message).

Data Storage and Retention

Kafka Brokers are responsible for writing messages to disk. Each broker will divide its data into segment files and store them on disk. This structure helps brokers ensure high throughput for reads/writes. The exact mechanics of how this works is out of scope for this article, but here’s a fantastic dive on Kafka’s storage internals.

For data retention, Kafka provides several options

Time-based Retention - Data is retained for a specific period of time. After that retention period, it is eligible for deletion. This is the default policy with Kafka (with a 7 day retention period).
Size-based Retention - Data is kept up to a specified size limit. Once the total size of the stored data exceeds this limit, the oldest data is deleted to make room for new messages.

This is the first part of our Tech Dive on Apache Kafka.

The entire article is 10+ pages and 3000+ words.

In the rest of the article, we’ll delve into Kafka Consumers, Messaging Patterns, the Kafka Ecosystem, Criticisms, Scaling Kafka at Big Tech companies and more.

You can get the full article (20+ other tech dives) by subscribing to Quastor Pro here.

Tech Snippets

Eloquent JavaScript

Eloquent JavaScript is a fantastic book that helps programmers get up to speed with the JavaScript world. The 4th edition was recently published and the entire book is available for free on the website.

It contains chapters on language features but also stack-specific topics like using JavaScript with the web browser or using it with NodeJS.

eloquentjavascript.net

The Basics of AI-Powered (Vector) Search

We’ve done a ton of past articles on how companies like Dropbox, Instacart, Meta and others use vector databases for search. The core idea is that you take a word/phrase and use a machine learning model to map it to vector space.

The machine learning model is trained to produce semantically meaningful vectors. This means that words/phrases with a similar meaning will be nearby in vector space. “Dog“ and “Golden Retriever“ will be mapped to similar vectors. This makes it very useful for search-based applications.

This is a fantastic article on how the whole process works.

cameronrwolfe.substack.com/p/the-basics-of-ai-powered-vector-search

How to avoid pitfalls when choosing your tech stack

This is a great list of factors you should consider when picking your tech stack.

You should think about things like
- Requirements Volatility
- Road Map for Proof-of-Concept and MVP
- How easy it is to hire for the tech stack
- Integration with other Tools

and much more. The article also talks about how you can get hard data on these questions.

levelup.gitconnected.com/how-to-avoid-pitfalls-when-choosing-your-stack-982e90d0972

How Engineering Leaders Stay Up to Date

Pointer is a fantastic newsletter that shares insights into what the top engineering leaders are reading and thinking about.

The content ranges from leadership soft skills to technical deep dives.

It’s all curated into a 5 minute read, emailed twice a week and it’s totally free!

(cross-promo)

www.pointer.io/?utm_source=quastor&utm_medium=referral

How to Scale a Large Codebase

Vercel is a cloud services provider and is also the maintainer of NextJS. In this blog post, they delve into their approach for scaling their codebase (a monorepo).

They emphasize feature flags for safe code releases, incremental builds for quick iteration and skew protection to handle version discrepancies.

vercel.com/blog/how-to-scale-a-large-codebase

How (not) to apply for a software job

A hiring manager has to sort through hundreds of applications when picking who to interview for a job. This is a terrific post by Ben Hoyt on how you can maximize your odds of getting past the resume selection portion of the application process. Ben gives tips on optimizing your application, resume and cover letter.

Some of the tips include
- Avoid flowery prose. Instead, be concise and to the point.
- Show your abilities by talking about your past experiences
- List your experience with each language/framework in the “skills” section.

Read the full article for all the tips.

benhoyt.com/writings/how-to-apply

Premium Content

Subscribe to Quastor Pro for long-form articles on concepts in system design and backend engineering.

Past article content includes

System Design Concepts

Measuring Availability
API Gateways
Database Replication
Load Balancing
API Paradigms
Database Sharding
Caching Strategies
Event Driven Systems
Database Consistency
Chaos Engineering
Distributed Consensus

Tech Dives

Redis
Postgres
Kafka
DynamoDB
gRPC
Apache Spark
HTTP
DNS
B Trees & LSM Trees
OLAP Databases
Database Engines

When you subscribe, you’ll also get Spaced Repetition Flashcards for reviewing all the main concepts discussed in prior Quastor articles.

How Faire Maintained Engineering Velocity as They Scaled

Faire is an online marketplace that connects small businesses with makers and brands. A wide variety of products are sold on Faire, ranging from home decor to fashion accessories and much more. Businesses can use Faire to order products they see demand for and then sell them to customers.

Faire was started in 2017 and became a unicorn (valued at over $1 billion) in under 2 years. Now, over 600,000 businesses are purchasing wholesale products from Faire’s online marketplace and they’re valued at over $12 billion.

Marcelo Cortes is the co-founder and CTO of Faire and he wrote a great blog post for YCombinator on how they scaled their engineering team to meet this growth.

Here’s a summary

Faire’s engineering team grew from five engineers to hundreds in just a few years. They were able to sustain their pace of engineering execution by adhering to four important elements.

Hiring the right engineers
Building solid long-term foundations
Tracking metrics for decision-making
Keeping teams small and independent

Hiring the Right Engineers

When you have a huge backlog of features/bug fixes/etc that need to be pushed, it can be tempting to hire as quickly as possible.

Instead, the engineering team at Faire resisted this urge and made sure new hires met a high bar.

Specific things they looked for were…

Expertise in Faire’s core technology

They wanted to move extremely fast so they needed engineers who had significant previous experience with what Faire was building. The team had to build a complex payments infrastructure in a couple of weeks which involved integrating with multiple payment processors, asynchronous retries, processing partial refunds and a number of other features. People on the team had previous experience building the same infrastructure for Cash App at Square, so that helped tremendously.

Focused on Delivering Value to Customers

When hiring engineers, Faire looked for people who were amazing technically but were also curious about Faire’s business and were passionate about entrepreneurship.

The CTO would ask interviewees questions like “Give me examples of how you or your team impacted the business”. Their answers showed how well they understood their past company’s business and how their work impacted customers.

A positive signal is when engineering candidates proactively ask about Faire’s business/market.

Having customer-focused engineers made it much easier to shut down projects and move people around. The team was more focused on delivering value for the customer and not wedded to the specific products they were building.

Build Solid Long-Term Foundations

From day one, Faire documented their culture in their engineering handbook. They decided to embrace practices like writing tests and code reviews (contrary to other startups that might solely focus on pushing features as quickly as possible).

Faire found that they operated better with these processes and it made onboarding new developers significantly easier.

Here’s four foundational elements Faire focused on.

Being Data Driven

Faire started investing in their data engineering/analytics when they were at just 10 customers. They wanted to ensure that data was a part of product decision-making.

From the start, they set up data pipelines to collect and transfer data into Redshift (their data warehouse).

They trained their team on how to use A/B testing and how to transform product ideas into statistically testable experiments. They also set principles around when to run experiments (and when not to) and when to stop experiments early.

Choosing Technology

For picking technologies, they had two criteria

The team should be familiar with the tech
There should be evidence from other companies that the tech is scalable long term

They went with Java for their backend language and MySQL as their database.

Writing Tests

Many startups think they can move faster by not writing tests, but Faire found that the effort spent writing tests had a positive ROI.

They used testing to enforce, validate and document specifications. Within months of code being written, engineers will forget the exact requirements, edge cases, constraints, etc.

Having tests to check these areas helps developers to not fear changing code and unintentionally breaking things.

Faire didn’t focus on 100% code coverage, but they made sure that critical paths, edge cases, important logic, etc. were all well tested.

Code Reviews

Faire started doing code reviews after hiring their first engineer. These helped ensure quality, prevented mistakes and spread knowledge.

Some best practices they implemented for code reviews are

Be Kind. Use positive phrasing where possible. It can be easy to unintentionally come across as critical.
Don’t block changes from being merged if the issues are minor. Instead, just ask for the change verbally.
Ensure code adheres to your internal style guide. Faire switched from Java to Kotlin in 2019 and they use JetBrains’ coding conventions.

Track Engineering Metrics

In order to maintain engineering velocity, Faire started measuring this with metrics at just 20 engineers.

Some metrics they started monitoring are

CI wait time
Open Defects
Defects Resolution Time
Flaky tests
New Tickets

And more. They built dashboards to monitor these metrics.

As Faire grew to 100+ engineers, it no longer made sense to track specific engineering velocity metrics across the company.

They moved to a model where each team maintains a set of key performance indicators (KPIs) that are published as a scorecard. These show how successful the team is at maintaining its product areas and parts of the tech stack it owns.

As they rolled this process out, they also identified what makes a good KPI.

Here are some factors they found for identifying good KPIs to track.

Clearly Ladders Up to a Top-Level Business Metric

In order to convince other stakeholders to care about a KPI, you need a clear connection to a top-level business metric (revenue, reduction in expenses, increase in customer LTV, etc.). For tracking pager volume, Faire saw the connection as high pager volume leads to tired and distracted engineers which leads to lower code output and fewer features delivered.

Is Independent of other KPIs

You want to express the maximum amount of information with the fewest number of KPIs. Using KPIs that are correlated with each other (or measuring the same underlying thing) means that you’re taking attention away from another KPI that’s measuring some other area.

Is Normalized In a Meaningful Way

If you’re in a high growth environment, looking at the KPI can be misleading depending on the denominator. You want to adjust the values so that they’re easier to compare over time.

Solely looking at the infrastructure cost can be misleading if your product is growing rapidly. It might be alarming to see that infrastructure costs doubled over the past year, but if the number of users tripled then that would be less concerning.

Instead, you might want to normalize infrastructure costs by dividing the total amount spent by the number of users.

This is a short summary of some of the advice Marcelo offered. For more lessons learnt (and additional details) you should check out the full blog post here.