How Slack Automates Deploys

Plus, lessons on tech leadership from a Staff Engineer. Open source projects with great documentation and more.

Arpan KG
April 01, 2024

Hey Everyone!

Today we’ll be talking about

How Slack Automates their Deploys
- Slack mostly runs on a massive monolith with hundreds of developers pushing hundreds of changes every week
- For deployment, they previously had engineers take shifts where they’d manually deploy changes and monitor for any regressions/issues
- They’ve shifted to automating this with ReleaseBot
- We’ll talk about how they did this and how ReleaseBot monitors for regressions
Things I have learned about Tech Leadership
- Polly McEldowney is a senior staff engineer at Mozilla. She wrote a great blog post about her experience leading teams at the BBC and what she learned from being in engineering leadership
- Leadership does not mean telling people what to do. Instead, prefer collective decision making
- Be a servant leader and know the importance of building interpersonal relationships
Tech Snippets
- Inside the failed attempt to backdoor SSH globally (xz Utils Backdoor)
- Stop Parroting YouTube solutions in System Design Interviews
- Open Source Projects with Exceptional Documentation
- Implementing Distributed Tracing

How Slack Automates their Deploys

Slack is a workplace chat tool that helps employees communicate easily. You can send messages, share files, make voice/video calls and more. Slack is used by hundreds of thousands of companies and they have tens of millions of users.

Most of Slack runs on a monolith called The Webapp. It's very large, with hundreds of developers pushing hundreds of changes every week.

Their engineering team uses the Continuous Deployment approach, where they deploy small code changes frequently rather than batching them into larger, less frequent releases.

To handle this, Slack previously had engineers take turns acting as Deployment Commanders (DCs). These DCs would take 2 hour shifts where they’d walk The Webapp through the deployment steps and monitor it in case anything went wrong.

Recently, Slack fully automated the job of Deployment Commanders with “ReleaseBot”. ReleaseBot is responsible for monitoring deployments and detecting any anomalies. If something goes wrong, then it can pause/rollback and send an alert to the relevant team.

Sean McIlroy is a senior software engineer at Slack and he wrote a fantastic blog post about how ReleaseBot works.

Why do Continuous Deployment

As we said earlier, Continuous Deployment (CD) is where you deploy small code changes frequently rather than batching them into larger, less frequent releases.

This has several benefits:

Risk Management - with smaller changes, CD reduces the risk associated with each release. If there’s an issue then it’s much easier to isolate faults since there’s fewer lines of code.
Ship Faster - With CD, engineers can ship features to customers much faster. This allows them to quickly see what features are getting traction and also helps the business beat competitors (since customers can get an updated app faster)

Slack deploys 30-40 times a day to their production servers. Each deploy has a median size of 3 PRs.

Slack’s Old Process of Deploys

The big challenge with Continuous Deployment is the implementation. You need to make it easy for engineers to deploy their changes frequently without accidentally causing an outage.

There should be a clear process for initiating a deployment pause/rollback. On the flip side, you also don’t want to block deployments for small errors (that won’t affect user experience) as that hurts developer velocity.

Previously, Slack had engineers take turns working as the Deployment Commander (DC). They’d work a 2 hour shift where they’d walk Webapp through the deployment steps. As they did this, they’d watch dashboards and manually run tests.

However, many DCs had difficulty monitoring the system and didn’t feel confident when making decisions. They weren’t sure what specific errors/anomalies meant they should stop the deployment.

To solve this, Slack built ReleaseBot. This would handle deployments and also do anomaly detection and monitoring. If anything went wrong with the deployment, ReleaseBot can cancel the deploy and ping a Slack engineer.

However, Slack had to determine exactly what kind of issues were considered an anomaly. If there’s a 10% spike in latency, should that be enough to cancel the deploy?

In order to program ReleaseBot to properly identify anomalies (without canceling deploys for unimportant reasons), Slack uses two strategies

Z scores
Dynamic Thresholds

Z scores

A z-score is a statistical measurement that indicates how many standard deviations a data point is from the mean of a dataset. It helps give you context for how rare a certain observation is if you just have the raw value.

If you said that the request’s latency was 270 ms, that number is meaningless without context. Instead, if you said the latency had a z-score of 3.5, then gives you more context (it took 3.5 standard deviations longer than the average request) and you might look into what went wrong with it.

The formula for calculating a z-score involves subtracting the mean from the data point and then dividing the result by the standard deviation of the dataset.

A z-score greater than zero means the data point is above the mean, while a z-score less than zero means the data point is below the mean. The absolute value of the z-score tells you the number of standard deviations. A z-score of 2.5 means the metric is 2.5 standard deviations above the mean.

Some learnings the Slack team had related to z scores were

Three Standard Deviations - Slack found that three standard deviations was a good starting point for many of their metrics. So if the z score threshold is above 3 or below -3, then you might want to pause and investigate.
Ignore Negatives/Positives - You might want to ignore either positive or negative z scores for some metrics. Do you care if latency decreased by 3 standard deviations? Maybe or maybe not.
Monitor Non-traditional Areas - In addition to the usual suspects (latency, error rate, server load) Slack also monitors other things like total log volume. This helps them catch errors that the usual metrics might miss.

Dynamic Thresholds

In addition to z scores, Slack also uses static and dynamic thresholds. These measure things like database load, error rate, latency, etc.

Static thresholds are… static. They are a pre-configured value and they don’t factor in time-of-day or day-of-week.

On the other hand, Dynamic thresholds will sample past historical data from similar time periods for whatever metric is being monitored. They help filter out threshold breaches that are “normal” for the time period.

For example, Slack’s fleet CPU usage might always be lower at 9 pm on a Friday night compared to 8 am on a Monday morning. The Dynamic Threshold for Server CPU usage will take this into account and avoid raising the alarm unnecessarily.

For key metrics, Slack will set static thresholds and have ReleaseBot calculate it’s own dynamic threshold. Then, ReleaseBot will use the higher of the two.

Results

With these strategies, Slack has been able to automate the task of Deployment Commanders and deploy changes in a safer, more efficient way.

Tech Snippets

What we know about the xz Utils backdoor that almost infected the world

On Friday, a Microsoft developer revealed a backdoor that had been intentionally planted in xz Utils, an open source data compression utility.

The hacker behind this hack had spent years executing it by first building trust from the community before planting the backdoor. They made their first open source commit in 2021.

In early 2023, they started making commits to xz Utils and became increasingly involved in the project. A year later, they issued several commits for xz Utils with the backdoor. It allows someone with the right private key to execute malicious commands.

Ars Technica wrote a great article explaining all the details.

arstechnica.com/security/2024/04/what-we-know-about-the-xz-utils-backdoor-that-almost-infected-the-world

Stop Parroting YouTube solutions in System Design Interviews

Raviraj is a Staff Engineer at Meta and he wrote a great blog post on his experience conducting hundreds of system design interviews. He talks about what engineers do wrong and how they should change their approach.

Developers should spend more time determining what the important aspects of the problem are and focusing in on those. Spend more time with the interviewer delving into the nuances of the requirements and exploring scenarios to uncover the core priorities and SLAs.

newsletter.techleadmentor.com/p/stop-parroting-youtube-solutions

Open Source Projects with Exceptional Documentation

Reading great open source projects is a fantastic way to learn best practices and improve your own skills as a developer.

For writing better documentation, two codebases that are great examples of this are esbuild and Redis. These projects use READMEs, changelogs, architecture documents and code comments to explain their design in a way that helps newcomers understand the structure, practices and reasoning behind design choices.

johnjago.com/great-docs

A simple way to get more value from tracing

Many engineers believe that distributed tracing isn’t worth the effort unless you’re at Big Tech-level scale. However, many developers may overestimate the amount of work required to gain value from tracing.

This is a fantastic post that delves into tracing infrastructure built at Twitter, how much work it took and the insights that the engineers were able to gain.

Tracing and metrics are complementary. Only having one or the other can leave you blind to many problems and result in higher costs from unnecessary incidents, inefficient infra usage and more.

danluu.com/tracing-analytics

Premium Content

Subscribe to Quastor Pro for long-form articles on concepts in system design and backend engineering.

Past article content includes

System Design Concepts

Measuring Availability
API Gateways
Database Replication
Load Balancing
API Paradigms
Database Sharding
Caching Strategies
Event Driven Systems
Database Consistency
Chaos Engineering
Distributed Consensus

Tech Dives

Redis
Postgres
Kafka
DynamoDB
gRPC
Apache Spark
HTTP
DNS
B Trees & LSM Trees
OLAP Databases
Database Engines

When you subscribe, you’ll also get Spaced Repetition (Anki) Flashcards for reviewing all the main concepts discussed in past Quastor articles

Lessons about Tech Leadership from the BBC

Polly McEldowney was a team lead at the BBC where she worked on the BBC Sounds mobile app. Now, she’s a senior staff engineer at Mozilla.

She wrote a terrific blog post for the BBC engineering blog where delved into the lessons she learned.

Leadership Does Not Mean Telling People What to Do

Polly was initially hesitant to become a leader because she didn’t want to “tell people what to do“.

However, she’s realized that isn’t what tech leadership entails. Instead, you should make sure that the team understand what tasks need to be done and why (priorities are clear).

People will add the tasks into their own work streams and pick them up when they have capacity to do so.

Prefer Collective Decision Making

Discussing what needs to be done through collective decision making can be much more preferable to the leader coming up with the vision herself.

Instead, Polly liked to work with her team to outline the high level goals and then break them down into tickets. This can be far more motivating for the team and also tends to generate better ideas.

Pick Your Battles

Influence in the workplace is like a currency that you can choose to spend in a particular argument. Building influence is a slow, laborious process where you gradually gain trust by listening to people’s problems and helping them fix issues.

If you try to wield influence without earning it, people will generally (politely) ignore you. It also gets depleted the more you use it unnecessarily.

Instead, you should use it sparingly on something that is genuinely important.

Be a Servant Leader

A question Polly likes to ask herself is “what is the most helpful thing I can do for the team right now?“

This helps her understand what to focus on. The “servant leader” is someone who keeps themselves fully aware of the challenges the team is facing and constantly works to remove blockers and resolve problems.

This could be mediating a difficult conversation, handling some boring administrative work or putting in a PR for something that the team needs but doesn’t have time for.

Interpersonal Relationships

To Polly, interpersonal relationships are the most important part of the job. She invested a great deal of time into building good relationships with her colleagues.

Having good relationships makes everything easier. It also makes the company a much happier place to work.

Small talk may feel like an indulgence, but it makes the “big talk” much easier to approach when necessary.

These are a couple of the tips Polly mentions in her blog post. Read the full blog post here for the rest of her advice.