How Slack sends Millions of Messages in Real Time

Plus, how Jane Street writes tests, challenges with caching, why tail latencies matter and more.

April 17, 2023

Hey Everyone!

Today we’ll be talking about

How Slack Sends Millions of Messages in Real Time
- Using Channel, Gateway and Presence Servers
- Consistent Hashing to map Channel IDs to Channel Servers
- Using Envoy (an open source service proxy) for communication with clients
Lessons Learned Building Fraud Detection at Stripe
- Starting with a Simpler ML Architecture
- Constantly Iterating and Testing Out New Features
- Explainability Matters As Much As Detection
Tech Snippets
- How Jane Street Writes Tests
- The Tradeoff Between Efficiency and Resiliency
- Tail Latency Might Matter More Than You Think
- Challenges With Caching (AWS Builder’s Library)

How Slack sends Millions of Messages in Real Time

Slack is a chat tool that helps teams to communicate and work together easily. You can use it to send messages/files as well as do things like schedule meetings or have a video call.

Messages in Slack are sent inside of channels (think of this as a chat room or group chat you can set up within your company’s slack for a specific team/initiative). Every day, Slack has to send millions of messages across millions of channels in real time.

They need to accomplish this while handling highly variable traffic patterns. Most of Slack’s users are in North America, so they’re mostly online between 9 am and 5 pm with peaks at 11 am and 2 pm.

Sameera Thangudu is a Senior Software Engineer at Slack and she wrote a great blog post going through their architecture.

Slack uses a variety of different services to run their messaging system. Some of the important ones are

Channel Servers - These servers are responsible for holding the messages of channels (group chats within your team slack). Each channel has an ID which is hashed and mapped to a unique channel server. Slack uses consistent hashing to spread the load across the channel servers so that servers can easily be added/removed while minimizing the amount of resharding work needed for balanced partitions.
Gateway Servers - These servers sit in-between slack web/desktop/mobile clients and the Channel Servers. They hold information on which channels a user is subscribed to. Users will connect to Gateway Servers for subscribing to a channel’s messages so these servers are deployed across multiple geographical regions. The user can connect to whichever gateway server is closest to him/herself.
Presence Servers - These servers store user information; they keep track of which users are online (they’re responsible for powering the green presence dots). Slack clients can make queries to the Presence Servers through the Gateway Servers. This way, they can get presence status/changes.

Now, we’ll talk about how these different services interact with the slack mobile/desktop app.

Slack Client Boot Up

When you first open the Slack app, it will first send a request to Slack’s backend to get a user token and websocket connection setup information.

This information tells the Slack app which Gateway Server they should connect to (these servers are deployed on the edge to be close to clients).

For communicating with the Slack clients, Slack uses Envoy, an open source service proxy originally built at Lyft. Envoy is quite popular for powering communication between backend services; it has built in functionality to take care of things like protocol conversion, observability, service discovery, retries, load balancing and more. In a previous article, we talked about how Snapchat uses Envoy for their service mesh.

In addition, Envoy can be used for communication with clients, which is how Slack is using it here. The user sends a Websocket connection request to Envoy which forwards it to a Gateway server.

Once a user connects to the Gateway Server, it will fetch information on all of that user’s channel subscriptions from another service in Slack’s backend. After getting the data, the Gateway Server will subscribe to all the channel servers that hold those channels.

Now, the Slack client is ready to send and receive real time messages.

Sending a Message

Once you send a message from your app to your channel, Slack needs to make sure the message is broadcasted to all the clients who are online and in the channel.

The message is first sent to Slack’s backend, which will use another microservice to figure out which Channel Servers are responsible for holding the state around that channel.

The message gets delivered to the correct Channel Server, which will then send the message to every Gateway Server across the world that is subscribed to that channel.

Each Gateway Server receives that message and sends it to every connected client subscribed to that channel ID.

Scaling Channel Servers

As mentioned, Slack uses Consistent Hashing to map Channel IDs to Channel Servers. Here’s a great video on Consistent Hashing if you’re unfamiliar with it (it’s best explained visually).

A TL;DW; is that consistent hashing lets you easily add/remove channel servers to your cluster while minimizing the amount of resharding (shifting around Channel IDs between channel servers to balance the load across all of them).

Slack uses Consul to manage which Channel Servers are storing which Channels and they have Consistent Hash Ring Managers (CHARMs) to run the consistent hashing algorithms.

For more details, read the full blog post here.

How did you like this summary?

Your feedback really helps me improve curation for future emails.

Tech Snippets

How Jane Street writes Tests

If you’re in the finance world, you’re probably familiar with Knight Capital, an HFT firm that managed to lose $440 million dollars in an hour due to a bug. Move fast and break things is probably not a good idea if you’re in high frequency trading.

Jane Street uses a pattern called expect tests that makes writing tests feel like coding in a REPL. There’s far less boilerplate and a significantly faster feedback loop.

They published a great blog post delving into how these tests work and they gave some examples of using them when building a market data processor, a trace tool and more.

blog.janestreet.com/the-joy-of-expect-tests

The Trade-off Between Efficiency and Resiliency

It’s important to be aware of trade-offs that exist in systems. One big one is between how efficient your system is and how resilient it is.

This trade-off shows up again and again in distributed systems, serialization formats, company organizational structures, and much more.

This is a great blog post that delves into this trade off and gives additional examples of where (and why) this occurs.

blog.nelhage.com/post/efficiency-vs-resiliency

Tail Latency Might Matter More Than You Think

Tail Latency is measuring how your system performs in the worst scenarios. What is the response time for your slowest 1% or 0.1% of responses?

Marc Brooker is a Distinguished Engineer at AWS and he wrote a great blog post delving into why they matter when building large scale systems.

brooker.co.za/blog/2021/04/19/latency.html

Challenges with Caching

Implementing caching can be crucial to speed up your system, but implementing it poorly can slow your backend down. Having too many cache misses, picking a poor expiration policy, etc. can lead to increased latency or outages.

This is a great post on the Amazon Builder’s Library about when to use caching, using local vs. external caches, picking the right size/expiration/eviction policy and more.

aws.amazon.com/builders-library/caching-challenges-and-strategies

Lessons Learned Building Stripe Radar

Stripe is one of the largest payment processors in the world with an expected 1 trillion dollars of payment volume processed in 2023. They’ve built a huge amount of tooling, dashboards, APIs and more to help businesses easily collect payment from customers.

Radar is Stripe’s fraud prevention solution. It assesses over a thousand characteristics of a transaction to determine the probability that it’s fraudulent. This is used to flag any risky transactions so they can be blocked/filtered for extra security checks.

Radar is able to make this decision in less than 100 milliseconds with very high accuracy.

Ryan Drapeau is a Staff Software Engineer on Stripe Radar and he wrote a great blog post on key learnings the team had when building the solution.

Lesson 1: Changing the ML Architecture

Radar started with a relatively simple ML architecture and shifted to more complex models over time.

At first, they started with logistic regression models, then used gradient boosted decision trees, later added deep neural networks and have now switched to a pure deep neural network only model (they made the switch in mid-2022).

With each architectural jump, they observed a large improvement in model performance and scaling.

Their current architecture is a deep neural network inspired by ResNeXt. Switching to the DNN gave the team an improvement in performance but also reduced training time by over 85% (to under two hours).

Now, the DNN can be trained multiple times a day, allowing Radar to factor in new trends more quickly.

This year, the team is looking into ML techniques like transfer learning, embeddings and multi-task learning.

Lesson 2: Never Stop Searching for ML Features

One of the biggest levers you have for improving the model is with feature engineering. Features are the data points used to make predictions. If you’re predicting housing prices, one feature might be the house’s zip code or number of bedrooms.

Stripe created several processes to enable ML engineers to iterate on and evaluate new features.

They review past fraud attacks in excruciating detail and build investigation reports that get into the minds of the fraudsters. They look for any signals/indications in the data, like patterns in the email addresses (that could suggest this is a throwaway email) or correlations in timing of the activity.

All of this information is gathered and used to come up with potential features that target the specifics of each attack. They prioritize the most promising features and quickly implement and test them to understand the impact on the ML model’s performance.

They’re also working on increasing the size of their training data. The tradeoff with this is that the time to train increases linearly with the size of the dataset. However, this is mitigated by improving the training-speed.

The stripe team found that a 10x increase in training transaction data resulted in significant model improvements and they’re currently working on a 100x version.

Lesson 3: Explanation Matters as Much as Detection

Explainability is extremely important for Radar. If Stripe flags a transaction, they need to explain which features of the transaction contributed to it being declined. This can be a challenge with deep neural networks, so it’s another tradeoff that the engineering team has to consider.

Stripe built their Risk Insights feature, which lets businesses see exactly what features of the transaction led to it being declined. For example, an unusually large number of cards might be associated with the customer’s IP address.

For more details on explainable deep learning, you can check out this paper from September of 2021.

Here’s the full post from the Stripe Engineering blog.

How did you like this summary?

Your feedback really helps me improve curation for future emails. Thanks!