How Stack Overflow Handles DDoS Attacks
Hey Everyone!
Today we’ll be talking about
How Stack Overflow Handles DDoS Attacks
An overview of DDoS Attacks and the different types
How attackers targeted Stack Overflow by spamming actions with expensive database queries and HTTP Post requests
The steps Stack Exchange took to mitigate these attacks and make them ineffective in the future
Tech Snippets
How Slack uses Infrastructure as Code (Terraform)
A collection of Post-Mortems published by companies on why their systems went down.
A Curated List of Software and Architecture Related Design Patterns
Plus, we have a solution to our last coding interview question.
How Stack Overflow handles DDoS Attacks
In early 2022, Stack Overflow was the target of ongoing Distributed Denial of Service (DDoS) attacks. These attacks went on for weeks and progressively grew in size and scope. In each incident, the attacker would change their methodology in response to Stack Overflow’s countermeasures.
The DDoS attempts targeted the Stack Exchange API and their website. Hackers mimicked users browsing the Stack Overflow website using a botnet and also sent a flood of HTTP requests to various routes on the API.
Stack Overflow is still getting hit by DDoS attempts, but they've been able to minimize the impact thanks to work by their Staff Reliability Engineering teams and changes to their codebase.
Josh Zhang is a Staff SRE at Stack Exchange and he wrote a great blog post on how they handled these attacks and what policies they implemented to mitigate future attacks.
We’ll summarize the post below and add some context on DDoS attacks.
Overview of DDoS Attacks
With a Distributed Denial of Service Attack, a hacker will use many geographically distributed machines to send traffic to a website. These machines usually belong to an unsuspecting user and have been infected with malware to make them part of the attacker’s botnet.
The goal is to overwhelm the target’s backend with traffic, so that site can no longer serve legitimate users. The attacker might then request a ransom from the company, where he will stop the DDoS attack if the company pays up.
DDoS Attacks can roughly be split into 3 main types: Volumetric, Application Layer and Protocol Attacks.
Volumetric Attacks
These attacks are based on brute force techniques where the target server is flooded with data packets to consume bandwidth and server resources.
Volumetric attacks will frequently rely on amplification and reflection.
Amplification is where a request in a certain protocol will result in a much larger response (in terms of the number of bits); the ratio between the request size and response size is called the Amplification Factor.
Reflection is where the attacker will forge the source of request packets to be the target victim’s IP address. Servers will be unable to distinguish legitimate from spoofed requests so they’ll send the (much larger) response payload to the targeted victim’s servers and unintentionally flood them.
Network Time Protocol (NTP) DDoS attacks are an example of volumetric attacks where you can send an 234 byte spoofed request to an NTP server, which will then send a 48,000 byte response to the target victim. Attackers will repeat this on many different open NTP servers simultaneously to DDoS the victim.
Application Layer Attacks
These DDoS attacks target Layer 7 in the OSI Model (the Application Layer) through layer 7 protocols or by attacking the applications themselves. Examples include HTTP flood attacks, attacks that target web server software and more.
Database DDoS attacks are quite common, where a hacker will look for requests that are particularly database-intensive and then spam these in an attempt to exhaust the database resources. Scaling a stateless web server is a lot faster than adding read replicas to your database, so this can be quite successful.
HTTP Floods are some of the most widely seen layer 7 DDoS attacks, where hackers will spam a web server with HTTP GET/POST requests. The intelligent ones will specifically design these to request resources with low usage in order to maximize the number of cache misses the web server has and increase the load on the database.
Protocol Layer Attacks
Protocol attacks will rely on weaknesses in how particular protocols are designed. Examples of these kinds of exploits are SYN floods, BGP hijacking, Smurf attacks and more.
A SYN flood attack exploits how TCP is designed, specifically the handshake process. The three-way handshake consists of SYN-SYN/ACK-ACK, where the client sends a synchronize (SYN) message to initiate, the server responds with a synchronize-acknowledge (SYN-ACK) message and the client then responds back with an acknowledgement (ACK) message.
In a SYN flood attack, a malicious client will send large volumes of SYN messages to the server, who will then respond back with SYN-ACK. The client will ignore these and never respond back with an ACK message. The server will waste resources (open ports) waiting for the ACK responses from the malicious client. Repeat this on a large enough scale and it can bring the server down since the server won’t know which requests are legitimate.
DDoS Attacks at Stack Overflow
For Stack Overflow, they were hit with two main types of attacks
User mimicked browsing where machines in the botnet were browsing the Stack Overflow website and attempting things that triggered expensive database queries
HTTP Flood Attacks on the Stack Overflow API where the botnet did things like send a large number of POST requests.
The attacks were distributed over a huge pool of IP addresses, where some IPs only sent a couple of requests. This made Stack Overflow's standard policy of rate limiting by IP ineffective.
For a short term solution, the Stack Overflow team implemented numerous filters to try and sort out malicious requests and block them. Initially, the filters were overzealous and blocked some legitimate requests but over time the team was able to refine them.
The long term solution was to implement numerous protections for their backend.
Here’s a list of some of the protections the team put in
Authenticate - Insist that every API call be authenticated. This helps massively in identifying malicious users. If this is not possible, then set strict limits for anonymous/unauthenticated traffic.
Block Weird URLs - If you’re being DDoS’d and you’re getting HTTP calls with trailing slashes where you don’t use them, requests to invalid paths and other irregularities then that can be a signal of a malicious machine. You may want to filter IPs that send those kinds of requests.
Tar Pitting - A tarpit is a service that purposely delays incoming connections. Rather than completely blocking suspicious requests (where you aren't sure if they're malicious), you might intentionally add a bit of delay to the responses for them. This can slow down the botnet by increasing the time between requests while reducing the amount of collateral damage on innocent users who were accidentally flagged as bots.
Minimize Data - Minimize the amount of data a single API call can return. Implement things like pagination to limit API calls that are extremely expensive.
Load Balancers - Put in some reverse proxy to filter malicious traffic before it hits your application. Stack Overflow uses HAProxy load balancers. Implement thorough and easily queryable logs so you can easily identify and block malicious IPs.
Some of the lessons the Stack Overflow team learned were
Invest in Monitoring and Alerting - Having a robust stack for monitoring helped massively with alerting, identifying and blocking malicious actors. The application layer attacks in particular had telltale signs that the team added to their monitoring portfolio.
Automate - Because they were dealing with several DDoS attacks in a row, the team could spot patterns in the attacks. Whenever an SRE saw a pattern, they automated the detection and mitigation of it to minimize any downtime.
Write It All Down - It can be hard to step back during a crisis and take notes, but these notes can be invaluable for future SREs. The team made sure to take out time after the attacks to create runbooks for future DDoS attempts.
Inform Users - Communicating the situation with users is extremely important, especially since many innocent users can get caught up in your blocking filters. Tor exit nodes were a source of a significant amount of traffic during one of the volume attacks, so Stack Overflow blocked them. This created issues for legitimate users who were using the same IPs. The engineering team had to get on Meta Stack Exchange to explain the situation.
How did you like this summary?Your feedback really helps me improve curation for future emails. |
Tech Snippets
A Collection of Post Mortems - This is a really cool list of post mortem blog posts where companies talk about why their systems went down. Learning from other people’s mistakes is far cheaper than learning from your own, so it’s great to read some of these to understand all the different things that can go wrong.
Infrastructure as Code at Slack - For their infrastructure, Slack relies on AWS, Google Cloud Platform and Digital Ocean. In order to manage their infrastructure across these platforms, Slack relies on Terraform. They published a great blog post that delves into their setup with details on how they manage the Terraform codebase, where they run and test changes and issues they've had to solve so far.
A curated list of software and architecture related design patterns - This is a great Github repo with a ton of resources on software architecture. It has links to articles/books on cloud architectures and patterns, serverless strategies, SQL/Non-Relational databases and more.
Get a look into what CTOs are reading - If you find Quastor useful, you should check out Pointer.io. It's a reading club for software developers that sends out super high quality engineering-related content. It's read by CTOs, engineering managers and senior engineers so you should definitely sign up if you're on that path (or if you want to go down that path in the future). It's completely free! (cross promo)
Interview Question
You are given an n x n 2D matrix.
Rotate the matrix by 90 degrees clockwise.
Previous Question
As a reminder, here’s our last question
You are given an array nums of distinct positive integers.
Return the number of tuples (a, b, c, d) such that
a * b = c * d
where a, b, c and d are elements of nums and
a != b != c != d
Solution
The naive solution would be to use 4 nested for loops.
We iterate through nums in 4 nested loops with each loop representing a, b, c and d respectively.
Then, we can count the number of tuples where a * b = c * d.
This would result in a time complexity of O(n^4).
We can do better by using more memory in exchange for less time.
We start by splitting up our calculations.
First, we calculate every possible product of a * b.
We use a nested for loop (only 2 loops this time) with the first loop representing a and the second loop representing b.
In our inner for loop, we calculate the possible products for a * b.
We’ll store all of these products in a hash table where the keys of the hash table are the products and the values are how many times each of those products appeared as a result of a * b.
After, we’ll have a second nested for loop, with the first loop representing c and the second loop representing d.
We calculate the value of c * d and then check our hash table for how many times this product appeared.
Then, we can add up how many times the product appeared to a counter that keeps track of this.
However, we must remember to avoid double counting due to the condition that
a != b != c != d
In order to do this, we subtract the value from our hash table by 2 (counting the two times where a = c, b = d and a = d, b = c.
After tallying up all the counts, we can return the total number of tuples.