How Slack sends Millions of Messages in Real Time
Plus, how Jane Street writes tests, challenges with caching, why tail latencies matter and more.
Today we’ll be talking about
How Slack Sends Millions of Messages in Real Time
Using Channel, Gateway and Presence Servers
Consistent Hashing to map Channel IDs to Channel Servers
Using Envoy (an open source service proxy) for communication with clients
Lessons Learned Building Fraud Detection at Stripe
Starting with a Simpler ML Architecture
Constantly Iterating and Testing Out New Features
Explainability Matters As Much As Detection
How Jane Street Writes Tests
The Tradeoff Between Efficiency and Resiliency
Tail Latency Might Matter More Than You Think
Challenges With Caching (AWS Builder’s Library)
Database Trends in 2023
Databases are often the biggest performance bottleneck when scaling a web app. It’s also quite expensive to switch databases after so making the right choice is critical.
Over the past few years, there’s been a ton of research on what tradeoffs databases should be making to optimize for different use cases.
Design choices include factors like
On-Disk Storage Format
Compression vs. Query performance
You should be aware of what tradeoffs databases like InfluxDB, Cassandra, MongoDB, Postgres, etc. make when you’re picking your solution.
A relational database will make completely different optimizations compared to a time-series or document database. This will affect your access patterns, latencies, data model and much more.
Here’s a great article that delves into the latest trends in the ecosystem, what factors you should be considering, why you might want to use a specialized database and more.
How Slack sends Millions of Messages in Real Time
Slack is a chat tool that helps teams to communicate and work together easily. You can use it to send messages/files as well as do things like schedule meetings or have a video call.
Messages in Slack are sent inside of channels (think of this as a chat room or group chat you can set up within your company’s slack for a specific team/initiative). Every day, Slack has to send millions of messages across millions of channels in real time.
They need to accomplish this while handling highly variable traffic patterns. Most of Slack’s users are in North America, so they’re mostly online between 9 am and 5 pm with peaks at 11 am and 2 pm.
Sameera Thangudu is a Senior Software Engineer at Slack and she wrote a great blog post going through their architecture.
Slack uses a variety of different services to run their messaging system. Some of the important ones are
Channel Servers - These servers are responsible for holding the messages of channels (group chats within your team slack). Each channel has an ID which is hashed and mapped to a unique channel server. Slack uses consistent hashing to spread the load across the channel servers so that servers can easily be added/removed while minimizing the amount of resharding work needed for balanced partitions.
Gateway Servers - These servers sit in-between slack web/desktop/mobile clients and the Channel Servers. They hold information on which channels a user is subscribed to. Users will connect to Gateway Servers for subscribing to a channel’s messages so these servers are deployed across multiple geographical regions. The user can connect to whichever gateway server is closest to him/herself.
Presence Servers - These servers store user information; they keep track of which users are online (they’re responsible for powering the green presence dots). Slack clients can make queries to the Presence Servers through the Gateway Servers. This way, they can get presence status/changes.
Now, we’ll talk about how these different services interact with the slack mobile/desktop app.
Slack Client Boot Up
When you first open the Slack app, it will first send a request to Slack’s backend to get a user token and websocket connection setup information.
This information tells the Slack app which Gateway Server they should connect to (these servers are deployed on the edge to be close to clients).
For communicating with the Slack clients, Slack uses Envoy, an open source service proxy originally built at Lyft. Envoy is quite popular for powering communication between backend services; it has built in functionality to take care of things like protocol conversion, observability, service discovery, retries, load balancing and more. In a previous article, we talked about how Snapchat uses Envoy for their service mesh.
In addition, Envoy can be used for communication with clients, which is how Slack is using it here. The user sends a Websocket connection request to Envoy which forwards it to a Gateway server.
Once a user connects to the Gateway Server, it will fetch information on all of that user’s channel subscriptions from another service in Slack’s backend. After getting the data, the Gateway Server will subscribe to all the channel servers that hold those channels.
Now, the Slack client is ready to send and receive real time messages.
Sending a Message
Once you send a message from your app to your channel, Slack needs to make sure the message is broadcasted to all the clients who are online and in the channel.
The message is first sent to Slack’s backend, which will use another microservice to figure out which Channel Servers are responsible for holding the state around that channel.
The message gets delivered to the correct Channel Server, which will then send the message to every Gateway Server across the world that is subscribed to that channel.
Each Gateway Server receives that message and sends it to every connected client subscribed to that channel ID.
Scaling Channel Servers
As mentioned, Slack uses Consistent Hashing to map Channel IDs to Channel Servers. Here’s a great video on Consistent Hashing if you’re unfamiliar with it (it’s best explained visually).
A TL;DW; is that consistent hashing lets you easily add/remove channel servers to your cluster while minimizing the amount of resharding (shifting around Channel IDs between channel servers to balance the load across all of them).
Slack uses Consul to manage which Channel Servers are storing which Channels and they have Consistent Hash Ring Managers (CHARMs) to run the consistent hashing algorithms.
For more details, read the full blog post here.
How did you like this summary?
Your feedback really helps me improve curation for future emails.
The Best Way to Handle Large Sets of Time-Stamped Data
Working with large sets of time-stamped data has its challenges.
Fortunately, InfluxDB is a time series database purpose-built to handle the unique workloads of time series data.
Using InfluxDB, developers can ingest billions of data points in real-time with unbounded cardinality, and store, analyze, and act on that data – all in a single database.
No matter what kind of time series data you’re working with – metrics, events, traces, or logs – InfluxDB Cloud provides a performant, elastic, serverless time series platform with the tools and features developers need. Native SQL compatibility makes it easy to get started with InfluxDB and to scale your solutions.
Companies like IBM, Cisco, and Robinhood all rely heavily on InfluxDB to build and manage responsive backend applications, to power predictive intelligence, and to monitor their systems for insights that they would otherwise miss.
See for yourself by quickly spinning up the platform and testing it out InfluxDB Cloud for free.
Lessons Learned Building Stripe Radar
Stripe is one of the largest payment processors in the world with an expected 1 trillion dollars of payment volume processed in 2023. They’ve built a huge amount of tooling, dashboards, APIs and more to help businesses easily collect payment from customers.
Radar is Stripe’s fraud prevention solution. It assesses over a thousand characteristics of a transaction to determine the probability that it’s fraudulent. This is used to flag any risky transactions so they can be blocked/filtered for extra security checks.
Radar is able to make this decision in less than 100 milliseconds with very high accuracy.
Ryan Drapeau is a Staff Software Engineer on Stripe Radar and he wrote a great blog post on key learnings the team had when building the solution.
Lesson 1: Changing the ML Architecture
Radar started with a relatively simple ML architecture and shifted to more complex models over time.
At first, they started with logistic regression models, then used gradient boosted decision trees, later added deep neural networks and have now switched to a pure deep neural network only model (they made the switch in mid-2022).
With each architectural jump, they observed a large improvement in model performance and scaling.
Their current architecture is a deep neural network inspired by ResNeXt. Switching to the DNN gave the team an improvement in performance but also reduced training time by over 85% (to under two hours).
Now, the DNN can be trained multiple times a day, allowing Radar to factor in new trends more quickly.
This year, the team is looking into ML techniques like transfer learning, embeddings and multi-task learning.
Lesson 2: Never Stop Searching for ML Features
One of the biggest levers you have for improving the model is with feature engineering. Features are the data points used to make predictions. If you’re predicting housing prices, one feature might be the house’s zip code or number of bedrooms.
Stripe created several processes to enable ML engineers to iterate on and evaluate new features.
They review past fraud attacks in excruciating detail and build investigation reports that get into the minds of the fraudsters. They look for any signals/indications in the data, like patterns in the email addresses (that could suggest this is a throwaway email) or correlations in timing of the activity.
All of this information is gathered and used to come up with potential features that target the specifics of each attack. They prioritize the most promising features and quickly implement and test them to understand the impact on the ML model’s performance.
They’re also working on increasing the size of their training data. The tradeoff with this is that the time to train increases linearly with the size of the dataset. However, this is mitigated by improving the training-speed.
The stripe team found that a 10x increase in training transaction data resulted in significant model improvements and they’re currently working on a 100x version.
Lesson 3: Explanation Matters as Much as Detection
Explainability is extremely important for Radar. If Stripe flags a transaction, they need to explain which features of the transaction contributed to it being declined. This can be a challenge with deep neural networks, so it’s another tradeoff that the engineering team has to consider.
Stripe built their Risk Insights feature, which lets businesses see exactly what features of the transaction led to it being declined. For example, an unusually large number of cards might be associated with the customer’s IP address.
For more details on explainable deep learning, you can check out this paper from September of 2021.
Here’s the full post from the Stripe Engineering blog.
How did you like this summary?
Your feedback really helps me improve curation for future emails. Thanks!