How Slack Automates Deploys
Plus, lessons on tech leadership from a Staff Engineer. Open source projects with great documentation and more.
Hey Everyone!
Today we’ll be talking about
How Slack Automates their Deploys
Slack mostly runs on a massive monolith with hundreds of developers pushing hundreds of changes every week
For deployment, they previously had engineers take shifts where they’d manually deploy changes and monitor for any regressions/issues
They’ve shifted to automating this with ReleaseBot
We’ll talk about how they did this and how ReleaseBot monitors for regressions
Things I have learned about Tech Leadership
Polly McEldowney is a senior staff engineer at Mozilla. She wrote a great blog post about her experience leading teams at the BBC and what she learned from being in engineering leadership
Leadership does not mean telling people what to do. Instead, prefer collective decision making
Be a servant leader and know the importance of building interpersonal relationships
Tech Snippets
Inside the failed attempt to backdoor SSH globally (xz Utils Backdoor)
Stop Parroting YouTube solutions in System Design Interviews
Open Source Projects with Exceptional Documentation
Implementing Distributed Tracing
How Slack Automates their Deploys
Slack is a workplace chat tool that helps employees communicate easily. You can send messages, share files, make voice/video calls and more. Slack is used by hundreds of thousands of companies and they have tens of millions of users.
Most of Slack runs on a monolith called The Webapp. It's very large, with hundreds of developers pushing hundreds of changes every week.
Their engineering team uses the Continuous Deployment approach, where they deploy small code changes frequently rather than batching them into larger, less frequent releases.
To handle this, Slack previously had engineers take turns acting as Deployment Commanders (DCs). These DCs would take 2 hour shifts where they’d walk The Webapp through the deployment steps and monitor it in case anything went wrong.
Recently, Slack fully automated the job of Deployment Commanders with “ReleaseBot”. ReleaseBot is responsible for monitoring deployments and detecting any anomalies. If something goes wrong, then it can pause/rollback and send an alert to the relevant team.
Sean McIlroy is a senior software engineer at Slack and he wrote a fantastic blog post about how ReleaseBot works.
Why do Continuous Deployment
As we said earlier, Continuous Deployment (CD) is where you deploy small code changes frequently rather than batching them into larger, less frequent releases.
This has several benefits:
Risk Management - with smaller changes, CD reduces the risk associated with each release. If there’s an issue then it’s much easier to isolate faults since there’s fewer lines of code.
Ship Faster - With CD, engineers can ship features to customers much faster. This allows them to quickly see what features are getting traction and also helps the business beat competitors (since customers can get an updated app faster)
Slack deploys 30-40 times a day to their production servers. Each deploy has a median size of 3 PRs.
Slack’s Old Process of Deploys
The big challenge with Continuous Deployment is the implementation. You need to make it easy for engineers to deploy their changes frequently without accidentally causing an outage.
There should be a clear process for initiating a deployment pause/rollback. On the flip side, you also don’t want to block deployments for small errors (that won’t affect user experience) as that hurts developer velocity.
Previously, Slack had engineers take turns working as the Deployment Commander (DC). They’d work a 2 hour shift where they’d walk Webapp through the deployment steps. As they did this, they’d watch dashboards and manually run tests.
However, many DCs had difficulty monitoring the system and didn’t feel confident when making decisions. They weren’t sure what specific errors/anomalies meant they should stop the deployment.
To solve this, Slack built ReleaseBot. This would handle deployments and also do anomaly detection and monitoring. If anything went wrong with the deployment, ReleaseBot can cancel the deploy and ping a Slack engineer.
However, Slack had to determine exactly what kind of issues were considered an anomaly. If there’s a 10% spike in latency, should that be enough to cancel the deploy?
In order to program ReleaseBot to properly identify anomalies (without canceling deploys for unimportant reasons), Slack uses two strategies
Z scores
Dynamic Thresholds
Z scores
A z-score is a statistical measurement that indicates how many standard deviations a data point is from the mean of a dataset. It helps give you context for how rare a certain observation is if you just have the raw value.
If you said that the request’s latency was 270 ms, that number is meaningless without context. Instead, if you said the latency had a z-score of 3.5, then gives you more context (it took 3.5 standard deviations longer than the average request) and you might look into what went wrong with it.
The formula for calculating a z-score involves subtracting the mean from the data point and then dividing the result by the standard deviation of the dataset.
A z-score greater than zero means the data point is above the mean, while a z-score less than zero means the data point is below the mean. The absolute value of the z-score tells you the number of standard deviations. A z-score of 2.5 means the metric is 2.5 standard deviations above the mean.
Some learnings the Slack team had related to z scores were
Three Standard Deviations - Slack found that three standard deviations was a good starting point for many of their metrics. So if the z score threshold is above 3 or below -3, then you might want to pause and investigate.
Ignore Negatives/Positives - You might want to ignore either positive or negative z scores for some metrics. Do you care if latency decreased by 3 standard deviations? Maybe or maybe not.
Monitor Non-traditional Areas - In addition to the usual suspects (latency, error rate, server load) Slack also monitors other things like total log volume. This helps them catch errors that the usual metrics might miss.
Dynamic Thresholds
In addition to z scores, Slack also uses static and dynamic thresholds. These measure things like database load, error rate, latency, etc.
Static thresholds are… static. They are a pre-configured value and they don’t factor in time-of-day or day-of-week.
On the other hand, Dynamic thresholds will sample past historical data from similar time periods for whatever metric is being monitored. They help filter out threshold breaches that are “normal” for the time period.
For example, Slack’s fleet CPU usage might always be lower at 9 pm on a Friday night compared to 8 am on a Monday morning. The Dynamic Threshold for Server CPU usage will take this into account and avoid raising the alarm unnecessarily.
For key metrics, Slack will set static thresholds and have ReleaseBot calculate it’s own dynamic threshold. Then, ReleaseBot will use the higher of the two.
Results
With these strategies, Slack has been able to automate the task of Deployment Commanders and deploy changes in a safer, more efficient way.
Tech Snippets
Subscribe to Quastor Pro for long-form articles on concepts in system design and backend engineering.
Past article content includes
System Design Concepts
| Tech Dives
|
When you subscribe, you’ll also get Spaced Repetition (Anki) Flashcards for reviewing all the main concepts discussed in past Quastor articles
Lessons about Tech Leadership from the BBC
Polly McEldowney was a team lead at the BBC where she worked on the BBC Sounds mobile app. Now, she’s a senior staff engineer at Mozilla.
She wrote a terrific blog post for the BBC engineering blog where delved into the lessons she learned.
Leadership Does Not Mean Telling People What to Do
Polly was initially hesitant to become a leader because she didn’t want to “tell people what to do“.
However, she’s realized that isn’t what tech leadership entails. Instead, you should make sure that the team understand what tasks need to be done and why (priorities are clear).
People will add the tasks into their own work streams and pick them up when they have capacity to do so.
Prefer Collective Decision Making
Discussing what needs to be done through collective decision making can be much more preferable to the leader coming up with the vision herself.
Instead, Polly liked to work with her team to outline the high level goals and then break them down into tickets. This can be far more motivating for the team and also tends to generate better ideas.
Pick Your Battles
Influence in the workplace is like a currency that you can choose to spend in a particular argument. Building influence is a slow, laborious process where you gradually gain trust by listening to people’s problems and helping them fix issues.
If you try to wield influence without earning it, people will generally (politely) ignore you. It also gets depleted the more you use it unnecessarily.
Instead, you should use it sparingly on something that is genuinely important.
Be a Servant Leader
A question Polly likes to ask herself is “what is the most helpful thing I can do for the team right now?“
This helps her understand what to focus on. The “servant leader” is someone who keeps themselves fully aware of the challenges the team is facing and constantly works to remove blockers and resolve problems.
This could be mediating a difficult conversation, handling some boring administrative work or putting in a PR for something that the team needs but doesn’t have time for.
Interpersonal Relationships
To Polly, interpersonal relationships are the most important part of the job. She invested a great deal of time into building good relationships with her colleagues.
Having good relationships makes everything easier. It also makes the company a much happier place to work.
Small talk may feel like an indulgence, but it makes the “big talk” much easier to approach when necessary.
These are a couple of the tips Polly mentions in her blog post. Read the full blog post here for the rest of her advice.