• Quastor
  • Posts
  • How Slack Automates Deploys

How Slack Automates Deploys

Plus, lessons on tech leadership from a Staff Engineer. Open source projects with great documentation and more.

Hey Everyone!

Today we’ll be talking about

  • How Slack Automates their Deploys

    • Slack mostly runs on a massive monolith with hundreds of developers pushing hundreds of changes every week

    • For deployment, they previously had engineers take shifts where they’d manually deploy changes and monitor for any regressions/issues

    • They’ve shifted to automating this with ReleaseBot

    • We’ll talk about how they did this and how ReleaseBot monitors for regressions

  • Things I have learned about Tech Leadership

    • Polly McEldowney is a senior staff engineer at Mozilla. She wrote a great blog post about her experience leading teams at the BBC and what she learned from being in engineering leadership

    • Leadership does not mean telling people what to do. Instead, prefer collective decision making

    • Be a servant leader and know the importance of building interpersonal relationships

  • Tech Snippets

    • Inside the failed attempt to backdoor SSH globally (xz Utils Backdoor)

    • Stop Parroting YouTube solutions in System Design Interviews

    • Open Source Projects with Exceptional Documentation

    • Implementing Distributed Tracing

Vincent Gosselin and Albert Antoine are two engineers with over 30 years of AI/ML leadership experience at companies like Samsung, McDonalds, Toyota, Disney, TSMC and more.

During their leadership, they saw the explosive growth of Python in the data science and machine learning ecosystems. However, they also noticed the lack of tooling for turning ML model prototypes into full-fledged web applications. Vincent and Albert set out to solve this by building Taipy.

Taipy is an open-source Python library for building scalable, production-ready web apps without messing around with JavaScript or HTML/CSS. You can build complex charts, interactive dashboards, chatbots and much more… all in pure Python!

Learn more about Vincent and Albert and how they were able to turn their passion for open source into a venture-backed startup.

sponsored

How Slack Automates their Deploys

Slack is a workplace chat tool that helps employees communicate easily. You can send messages, share files, make voice/video calls and more. Slack is used by hundreds of thousands of companies and they have tens of millions of users.

Most of Slack runs on a monolith called The Webapp. It's very large, with hundreds of developers pushing hundreds of changes every week.

Their engineering team uses the Continuous Deployment approach, where they deploy small code changes frequently rather than batching them into larger, less frequent releases.

To handle this, Slack previously had engineers take turns acting as Deployment Commanders (DCs). These DCs would take 2 hour shifts where they’d walk The Webapp through the deployment steps and monitor it in case anything went wrong.

Recently, Slack fully automated the job of Deployment Commanders with “ReleaseBot”. ReleaseBot is responsible for monitoring deployments and detecting any anomalies. If something goes wrong, then it can pause/rollback and send an alert to the relevant team.

Sean McIlroy is a senior software engineer at Slack and he wrote a fantastic blog post about how ReleaseBot works.

Why do Continuous Deployment

As we said earlier, Continuous Deployment (CD) is where you deploy small code changes frequently rather than batching them into larger, less frequent releases.

This has several benefits:

  • Risk Management - with smaller changes, CD reduces the risk associated with each release. If there’s an issue then it’s much easier to isolate faults since there’s fewer lines of code.

  • Ship Faster - With CD, engineers can ship features to customers much faster. This allows them to quickly see what features are getting traction and also helps the business beat competitors (since customers can get an updated app faster)

Slack deploys 30-40 times a day to their production servers. Each deploy has a median size of 3 PRs.

Slack’s Old Process of Deploys

The big challenge with Continuous Deployment is the implementation. You need to make it easy for engineers to deploy their changes frequently without accidentally causing an outage.

There should be a clear process for initiating a deployment pause/rollback. On the flip side, you also don’t want to block deployments for small errors (that won’t affect user experience) as that hurts developer velocity.

Previously, Slack had engineers take turns working as the Deployment Commander (DC). They’d work a 2 hour shift where they’d walk Webapp through the deployment steps. As they did this, they’d watch dashboards and manually run tests.

However, many DCs had difficulty monitoring the system and didn’t feel confident when making decisions. They weren’t sure what specific errors/anomalies meant they should stop the deployment. 

To solve this, Slack built ReleaseBot. This would handle deployments and also do anomaly detection and monitoring. If anything went wrong with the deployment, ReleaseBot can cancel the deploy and ping a Slack engineer.

However, Slack had to determine exactly what kind of issues were considered an anomaly. If there’s a 10% spike in latency, should that be enough to cancel the deploy?

In order to program ReleaseBot to properly identify anomalies (without canceling deploys for unimportant reasons), Slack uses two strategies

  • Z scores

  • Dynamic Thresholds

Z scores

A z-score is a statistical measurement that indicates how many standard deviations a data point is from the mean of a dataset. It helps give you context for how rare a certain observation is if you just have the raw value.

If you said that the request’s latency was 270 ms, that number is meaningless without context. Instead, if you said the latency had a z-score of 3.5, then gives you more context (it took 3.5 standard deviations longer than the average request) and you might look into what went wrong with it.

The formula for calculating a z-score involves subtracting the mean from the data point and then dividing the result by the standard deviation of the dataset.

A z-score greater than zero means the data point is above the mean, while a z-score less than zero means the data point is below the mean. The absolute value of the z-score tells you the number of standard deviations. A z-score of 2.5 means the metric is 2.5 standard deviations above the mean.

Some learnings the Slack team had related to z scores were

  1. Three Standard Deviations - Slack found that three standard deviations was a good starting point for many of their metrics. So if the z score threshold is above 3 or below -3, then you might want to pause and investigate.

  2. Ignore Negatives/Positives - You might want to ignore either positive or negative z scores for some metrics. Do you care if latency decreased by 3 standard deviations? Maybe or maybe not.

  3. Monitor Non-traditional Areas - In addition to the usual suspects (latency, error rate, server load) Slack also monitors other things like total log volume. This helps them catch errors that the usual metrics might miss.

Dynamic Thresholds

In addition to z scores, Slack also uses static and dynamic thresholds. These measure things like database load, error rate, latency, etc.

Static thresholds are… static. They are a pre-configured value and they don’t factor in time-of-day or day-of-week.

On the other hand, Dynamic thresholds will sample past historical data from similar time periods for whatever metric is being monitored. They help filter out threshold breaches that are “normal” for the time period.

For example, Slack’s fleet CPU usage might always be lower at 9 pm on a Friday night compared to 8 am on a Monday morning. The Dynamic Threshold for Server CPU usage will take this into account and avoid raising the alarm unnecessarily.

For key metrics, Slack will set static thresholds and have ReleaseBot calculate it’s own dynamic threshold. Then, ReleaseBot will use the higher of the two.

Results

With these strategies, Slack has been able to automate the task of Deployment Commanders and deploy changes in a safer, more efficient way.

Taipy is an open-source Python library for building scalable, production-ready web apps without having to mess around with JavaScript and HTML/CSS.

You can build complex charts, interactive dashboards, chatbots, and much more…. all in pure Python! Their UI builder is easily integrated with VSCode or Jupyter.

They’re also developing an enterprise version to help organizations increase efficiency and make more informed business decisions. It’s the next generation of Decision Support Systems and provides a comprehensive platform for integrating scenario management and using AI to enhance decision-making capabilities. 

The enterprise edition also includes features for the backend. Their distributed computing capabilities allow you to distribute computationally-heavy tasks across multiple machines. The libraries also have telemetry features where you can track performance metrics and health indicators for your servers.  

Other features include authentication & authorization, a job scheduler and much more. 

Taipy is used at companies like Microsoft, the World Bank and many others.

sponsored

Tech Snippets

Premium Content

Subscribe to Quastor Pro for long-form articles on concepts in system design and backend engineering.

Past article content includes

System Design Concepts

  • Measuring Availability

  • API Gateways

  • Database Replication

  • Load Balancing

  • API Paradigms

  • Database Sharding

  • Caching Strategies

  • Event Driven Systems

  • Database Consistency

  • Chaos Engineering

  • Distributed Consensus

Tech Dives

  • Redis

  • Postgres

  • Kafka

  • DynamoDB

  • gRPC

  • Apache Spark

  • HTTP

  • DNS

  • B Trees & LSM Trees

  • OLAP Databases

  • Database Engines

When you subscribe, you’ll also get Spaced Repetition (Anki) Flashcards for reviewing all the main concepts discussed in past Quastor articles

Lessons about Tech Leadership from the BBC

Polly McEldowney was a team lead at the BBC where she worked on the BBC Sounds mobile app. Now, she’s a senior staff engineer at Mozilla.

She wrote a terrific blog post for the BBC engineering blog where delved into the lessons she learned.

Leadership Does Not Mean Telling People What to Do

Polly was initially hesitant to become a leader because she didn’t want to “tell people what to do“.

However, she’s realized that isn’t what tech leadership entails. Instead, you should make sure that the team understand what tasks need to be done and why (priorities are clear).

People will add the tasks into their own work streams and pick them up when they have capacity to do so.

Prefer Collective Decision Making

Discussing what needs to be done through collective decision making can be much more preferable to the leader coming up with the vision herself.

Instead, Polly liked to work with her team to outline the high level goals and then break them down into tickets. This can be far more motivating for the team and also tends to generate better ideas.

Pick Your Battles

Influence in the workplace is like a currency that you can choose to spend in a particular argument. Building influence is a slow, laborious process where you gradually gain trust by listening to people’s problems and helping them fix issues.

If you try to wield influence without earning it, people will generally (politely) ignore you. It also gets depleted the more you use it unnecessarily.

Instead, you should use it sparingly on something that is genuinely important.

Be a Servant Leader

A question Polly likes to ask herself is “what is the most helpful thing I can do for the team right now?

This helps her understand what to focus on. The “servant leader” is someone who keeps themselves fully aware of the challenges the team is facing and constantly works to remove blockers and resolve problems.

This could be mediating a difficult conversation, handling some boring administrative work or putting in a PR for something that the team needs but doesn’t have time for.

Interpersonal Relationships

To Polly, interpersonal relationships are the most important part of the job. She invested a great deal of time into building good relationships with her colleagues.

Having good relationships makes everything easier. It also makes the company a much happier place to work.

Small talk may feel like an indulgence, but it makes the “big talk” much easier to approach when necessary.

These are a couple of the tips Polly mentions in her blog post. Read the full blog post here for the rest of her advice.