How Slack Processes 33,000 Jobs per Second

Plus, how Google writes Maintainable Code, scaling an Engineering Team and more.

October 09, 2023

Hey Everyone!

Today we'll be talking about

How Slack Redesigned Their Job Queue
- Slack needs to process billions of jobs per day. They built a highly scalable job queue to help manage this.
- The initial architecture relied on Redis, but they faced scaling pains
- To solve their issues, they incorporated Kafka to add durability and scalability
How Faire Scaled Their Engineering Team to Hundreds of Developers
- Hiring developers who are customer focused
- Implementing code review and testing early on
- Tracking KPIs around developer velocity
- Factors that make a good KPI
Tech Snippets
- How Google writes Clean, Maintainable Code
- Decomposing Language Models Into Understandable Components
- How to get Buy-In with Push and Pull
- Role Of Algorithms

How Slack Redesigned Their Job Queue

Slack is a workplace-chat tool that makes it easier for employees to communicate with each other. You can use it to send messages/files or have video calls. (Email wasn’t bad enough at killing our ability to stay in deep-focus so now we have Slack)

Many of the tasks in the Slack app are handled asynchronously. Things like

Sending push notifications
Scheduling calendar reminders
Billing customers

And more.

When the app needs to execute one of these tasks, they’re added to a job queue that Slack built. Backend workers will then dequeue the job off the queue and complete it.

This job queue needs to be highly scalable. Slack processes billions of jobs per day with a peak rate of 33,000 jobs per second.

Previously, Slack relied on Redis for the task queue but they faced scaling pains and needed to re-architect. The team integrated Apache Kafka and used that to reduce the strain on Redis, increase durability and more.

Saroj Yadav was previously a Staff Engineer at Slack and she wrote a fantastic blog post delving into the old architecture they had for the job queue, scaling issues they faced and how they redesigned the system.

Note - this post was last updated in 2020, so aspects of it are out of date. However, it provides a really good overview of how to build a job queue and also talks about scaling issues you might face in other contexts.

Summary

Here’s the previous architecture that Slack used.

At the top (www1, www2, www3), you have the different web apps that Slack has. Users will access Slack through these web/mobile apps and these apps will send backend requests to create jobs.

The backend previously stored these jobs with a Redis task queue.

Redis is an open-source key-value database that can easily be distributed horizontally across millions of nodes. It’s in-memory (to minimize latency) and provides a large number of pre-built data structures.

As stated, Redis uses a key/value model, so you can use this key to shard your data and store it across different nodes. Redis provides functionality to repartition (move your data between shards) for rebalancing as you grow.

So, when enqueueing a job to the Redis task queue, the Slack app will create a unique identifier for the job based on the job’s type and properties. It will hash this identifier and select a Redis host based on the hash.

The system will first perform some deduplication and make sure that a job with the same identifier doesn’t already exist in the database shard. If there is no duplicate, then the new job is added to the Redis instance.

Pools of workers will then poll the Redis cluster and look for new jobs. When a worker finds a job, it will move it from the queue and add it to a list of in-flight jobs. The worker will also spawn an asynchronous task to execute the job.

Once the task completes, the worker will remove the job from the list of in-flight jobs. If it fails, then the worker will move it to a special queue so it can be retried. Jobs will be retried for a pre-configured number of times until they are moved to a separate list for permanently failed jobs.

Issues with the Architecture

As Slack scaled in size, they ran into a number of issues with this set up.

Here’s some of the constraints the Slack team identified

Too Little Head-Room - The Redis servers were configured to have a limit on the amount of memory available (based on the machine’s available RAM) and the Slack team was frequently running into this limit. When Redis had no free memory, they could no longer enqueue new jobs. Even worse, dequeuing a job required a bit of free memory, so they couldn’t remove jobs either. Instead, the queue was locked up and required an engineer to step in.
Too Many Connections - Every job queue client must connect to every single Redis instance. This meant that every client needed to have up-to-date information on every single instance.
Can’t Scale Job Workers - The job workers were overly-coupled with the Redis instances. Adding additional job workers meant more polling load on the Redis shards. This created complexity where scaling the job workers could overwhelm overloaded Redis instances.
Poor Time Complexity - Previous decisions on which Redis data structure to use meant that dequeuing a job required work proportional to the length of the queue (O(n) time). Turns out leetcoding can be useful!
Unclear Guarantees - Delivery guarantees like exactly-once and at-least-once processing were unclear and hard to define. Changes to relevant features (like task deduplication) were high risk and engineers were hesitant to make them.

To solve these issues, the Slack team identified three aspects of the architecture that they needed to fix

Replace/Augment Redis - They wanted to add a store with durable storage (like Kafka) so it could provide a buffer against memory exhaustion and job loss (in case a Redis instance goes down)
Add New Scheduler - A new job scheduler could manage deliverability guarantees like exactly-once-processing. They could also add features like rate-limiting and prioritizing of certain job types.
Decouple Job Execution - They should be able to scale up job execution and add more workers without overloading the Redis instances.

To solve the issues, the Slack team decided to incorporate Apache Kafka. Kafka is an event streaming platform that is highly scalable and durable. In a past Quastor article, we did a tech dive on Kafka that you can check out here.

In order to alleviate the bottleneck without having to do a major rewrite, the Slack team decided to add Kafka in front of Redis rather than replace Redis outright.

The new components of the system are Kafkagate and JQRelay (the relay that sends jobs from Kafka to Redis).

Kafkagate

This is the new scheduler that takes job requests from the clients (Slack web/mobile apps) and publishes jobs to Kafka.

It’s written in Go and exposes a simple HTTP post interface that the job queue clients can use to create jobs. Having a single entry-point solves the issue of too many connections (where each job request client had to keep track of every single Redis instance).

Slack optimized Kafkagate for latency over durability.

With Kafka, you scale it by splitting each topic up into multiple partitions. You can replicate these partitions so there’s a leader and replica nodes.

When writing jobs to Kafka, Kafkagate only waits for the leader to acknowledge the job and not for the replication to other partition replicas. This minimizes latency but creates a small risk of job loss if the leader node dies before replicating.

Relaying Jobs from Kafka to Redis

To send jobs out of Kafka to Redis, Slack built JQRelay, a stateless service written in Go.

Each Kafka topic maps to a specific Redis instance and Slack uses Consul to make sure that only a single Relay process is consuming messages from a topic at any given time.

Consul is an open source service mesh tool that makes it easy for different services in your backend to communicate with each other. It also provides a distributed, highly-consistent key-value store that you can use to keep track of configuration for your system (similar to Etcd or ZooKeeper)

One feature is Consul Lock, which lets you create a lock so only one service is allowed to perform a specific operation when multiple services are competing for it.

With Slack, they use a Consul lock to make sure that only one Relay process is reading from a certain Kafka topic at a time.

JQRelay also does rate limits (to avoid overwhelming the Redis instance) and retries jobs from Kafka if they fail.

For details on the data migration process, load testing and more, please check out the full article here.

How did you like this summary?

Your feedback really helps me improve curation for future emails.

Tech Snippets

How Google writes clean, maintainable code

Google is well known for their “Readability“ ratings, where a developer gets approved to sign off on pull requests in a certain language.

This is a great read that delves into the pros and cons of Google’s Readability process. The ratings help increase knowledge sharing and enforce best practices but it hurts developer velocity and can be subject to human bias.

engineercodex.substack.com/p/how-google-writes-clean-maintainable

How to get Buy-In with Push and Pull

Pull is where there’s a strong sense that the work being done is valuable and fulfilling a specific need. Push is where developers are instructed to perform a task in a specific manner (pushed to do it in a certain way).

Submitting a PR has lots of pull, since you need to submit it in order to get your code merged. Submitting a daily status report, on the other hand, doesn’t have much pull. It might seem like a meaningless task, so you might need push (everyone must submit a status report).

In order to get buy-in with engineering processes, you need a combination of pull and push. This is an interesting article on how to think about the trade off.

kellanem.com/notes/push-and-pull

Decomposing Language Models Into Understandable Components

Large Language Models have billions of parameters; making it very difficult to understand exactly what’s going on under the hood.

Anthropic AI recently published a paper delving into how they built machinery that lets them break down LLMs into simpler components that they can get intuition around.

www.anthropic.com/index/decomposing-language-models-into-understandable-components

Role Of Algorithms

You probably don’t use Djikstra or BFS on a daily (or weekly/monthly) basis, but understanding these algorithms can help you become a better developer.

Algorithms are a great way to learn about properties/invariants and writing them in code can also help improve your debugging skills.

This is a fantastic read that delves into the benefits of learning algorithms and also enumerates some famous algorithms and their use cases.

matklad.github.io/2023/08/13/role-of-algorithms.html

How Faire Maintained Engineering Velocity as They Scaled

Faire is an online marketplace that connects small businesses with makers and brands. A wide variety of products are sold on Faire, ranging from home decor to fashion accessories and much more. Businesses can use Faire to order products they see demand for and then sell them to customers.

Faire was started in 2017 and became a unicorn (valued at over $1 billion) in under 2 years. Now, over 600,000 businesses are purchasing wholesale products from Faire’s online marketplace and they’re valued at over $12 billion.

Marcelo Cortes is the co-founder and CTO of Faire and he wrote a great blog post for YCombinator on how they scaled their engineering team to meet this growth.

Here’s a summary

Faire’s engineering team grew from five engineers to hundreds in just a few years. They were able to sustain their pace of engineering execution by adhering to four important elements.

Hiring the right engineers
Building solid long-term foundations
Tracking metrics for decision-making
Keeping teams small and independent

Hiring the Right Engineers

When you have a huge backlog of features/bug fixes/etc that need to be pushed, it can be tempting to hire as quickly as possible.

Instead, the engineering team at Faire resisted this urge and made sure new hires met a high bar.

Specific things they looked for were…

Expertise in Faire’s core technology

They wanted to move extremely fast so they needed engineers who had significant previous experience with what Faire was building. The team had to build a complex payments infrastructure in a couple of weeks which involved integrating with multiple payment processors, asynchronous retries, processing partial refunds and a number of other features. People on the team had previous experience building the same infrastructure for Cash App at Square, so that helped tremendously.

Focused on Delivering Value to Customers

When hiring engineers, Faire looked for people who were amazing technically but were also curious about Faire’s business and were passionate about entrepreneurship.

The CTO would ask interviewees questions like “Give me examples of how you or your team impacted the business”. Their answers showed how well they understood their past company’s business and how their work impacted customers.

A positive signal is when engineering candidates proactively ask about Faire’s business/market.

Having customer-focused engineers made it much easier to shut down projects and move people around. The team was more focused on delivering value for the customer and not wedded to the specific products they were building.

Build Solid Long-Term Foundations

From day one, Faire documented their culture in their engineering handbook. They decided to embrace practices like writing tests and code reviews (contrary to other startups that might solely focus on pushing features as quickly as possible).

Faire found that they operated better with these processes and it made onboarding new developers significantly easier.

Here’s four foundational elements Faire focused on.

Being Data Driven

Faire started investing in their data engineering/analytics when they were at just 10 customers. They wanted to ensure that data was a part of product decision-making.

From the start, they set up data pipelines to collect and transfer data into Redshift (their data warehouse).

They trained their team on how to use A/B testing and how to transform product ideas into statistically testable experiments. They also set principles around when to run experiments (and when not to) and when to stop experiments early.

Choosing Technology

For picking technologies, they had two criteria

The team should be familiar with the tech
There should be evidence from other companies that the tech is scalable long term

They went with Java for their backend language and MySQL as their database.

Writing Tests

Many startups think they can move faster by not writing tests, but Faire found that the effort spent writing tests had a positive ROI.

They used testing to enforce, validate and document specifications. Within months of code being written, engineers will forget the exact requirements, edge cases, constraints, etc.

Having tests to check these areas helps developers to not fear changing code and unintentionally breaking things.

Faire didn’t focus on 100% code coverage, but they made sure that critical paths, edge cases, important logic, etc. were all well tested.

Code Reviews

Faire started doing code reviews after hiring their first engineer. These helped ensure quality, prevented mistakes and spread knowledge.

Some best practices they implemented for code reviews are

Be Kind. Use positive phrasing where possible. It can be easy to unintentionally come across as critical.
Don’t block changes from being merged if the issues are minor. Instead, just ask for the change verbally.
Ensure code adheres to your internal style guide. Faire switched from Java to Kotlin in 2019 and they use JetBrains’ coding conventions.

Track Engineering Metrics

In order to maintain engineering velocity, Faire started measuring this with metrics at just 20 engineers.

Some metrics they started monitoring are

CI wait time
Open Defects
Defects Resolution Time
Flaky tests
New Tickets

And more. They built dashboards to monitor these metrics.

As Faire grew to 100+ engineers, it no longer made sense to track specific engineering velocity metrics across the company.

They moved to a model where each team maintains a set of key performance indicators (KPIs) that are published as a scorecard. These show how successful the team is at maintaining its product areas and parts of the tech stack it owns.

As they rolled this process out, they also identified what makes a good KPI.

Here are some factors they found for identifying good KPIs to track.

Clearly Ladders Up to a Top-Level Business Metric

In order to convince other stakeholders to care about a KPI, you need a clear connection to a top-level business metric (revenue, reduction in expenses, increase in customer LTV, etc.). For tracking pager volume, Faire saw the connection as high pager volume leads to tired and distracted engineers which leads to lower code output and fewer features delivered.

Is Independent of other KPIs

You want to express the maximum amount of information with the fewest number of KPIs. Using KPIs that are correlated with each other (or measuring the same underlying thing) means that you’re taking attention away from another KPI that’s measuring some other area.

Is Normalized In a Meaningful Way

If you’re in a high growth environment, looking at the KPI can be misleading depending on the denominator. You want to adjust the values so that they’re easier to compare over time.

Solely looking at the infrastructure cost can be misleading if your product is growing rapidly. It might be alarming to see that infrastructure costs doubled over the past year, but if the number of users tripled then that would be less concerning.

Instead, you might want to normalize infrastructure costs by dividing the total amount spent by the number of users.

This is a short summary of some of the advice Marcelo offered. For more lessons learnt (and additional details) you should check out the full blog post here.

How did you like this summary?

Your feedback really helps me improve curation for future emails. Thanks!