Challenges that come with Distributed Systems

July 19, 2022

Hey Everyone!

Today we’ll be talking about

Challenges That Come When Working With Real Time Distributed Systems
- This is a great article from the Amazon Builder's Library on some of the challenges that come with distributed systems.
- Request/Reply networking massively expands the amount of testing you have to do and introduces many potential areas for faults.
- Bugs in distributed systems are often latent and spread epidemically
Tech Snippets
- The Tech Behind Bloom Training (an open source version of GPT-3)
- Why am I Excited about WebAssembly?
- 5000x faster CRDTs: An Adventure in Optimization
- Bloom Filters - Much, much more than a space efficient hashmap
- The Embedded Rust Book

Challenges with Distributed Systems

When you’re working at massive scale, you’ll usually have to resort to horizontal scaling to scale the system up. This means working with a distributed system and dealing with all the ensuing challenges.

Jacob Gabrielson is a VP & Distinguished Engineer at Oracle and was previously a Senior Principal Engineer at Amazon. He wrote a great article for the AWS Builder’s Library on the challenges he faced while building distributed systems during his 20 years at Amazon.

Here’s a summary

Types of Distributed Systems

Distributed systems can be divided into different categories, and some categories have more challenges than others.

On the “easier” side (but still far from trivial to implement) are Offline Distributed Systems where you take a batch job and split it up across many machines that are located in close proximity. These systems are frequently used for big data analysis or high performance computing. You can get almost all the benefits of distributed computing (scalability and fault tolerance) and avoid much of the downside (complex failure modes and non-determinism).

In the middle are Soft Real-Time Distributed Systems. These are systems that must continually produce or update results, but have a relatively generous time window in which to do so (hence soft real-time). Things like web crawlers, search indexers, ML training infrastructure, etc. The system can go down for several hours without undue customer impact.

The most difficult are Hard Real-Time Distributed Systems. These are request/reply services where clients will randomly send requests and expect an immediate reply. Web servers, credit card processors, every AWS API, etc. are examples of hard, real-time distributed systems. The article delves into why these systems are difficult to build.

Complexity

Request/reply networking is the main reason why hard, real-time distributed systems are so challenging. Regardless of what protocols you’re using, using the network means you’re sending messages from one fault domain to another.

This introduces many steps where something can go wrong. As your systems grow larger, what had previously been theoretical edge cases will turn into regular occurrences due to the law of large numbers.

Here are the steps involved with request/reply networking.

Post Request - The client sends the request message onto the network.
Deliver Request - The network delivers the message to the server.
Validate Request - The server validates the message.
Update Server State - The server may update its state based on the message.
Post Reply - The server sends a reply onto the network.
Deliver Reply - The network delivers the reply to the client.
Validate Reply - The client validates the reply.
Update Client State - The client may update its state based on the reply.

Creating a distributed system means introducing all of these steps into your program. It turns one step (calling a method or writing to disk) into eight steps that will each fail with some non-zero probability.

Handling Failure Modes and Testing

When you’re working with a single machine, fate sharing reduces the complexity of the testing process. Fate sharing is where when one component of the system fails, then everything else will fail too. It cuts down on the different failure modes that you have to handle.

With a single machine, you don’t have to test for conditions where the CPU dies. If the CPU dies on your laptop, then those test conditions obviously won’t be processed anyway.

However, in hard real-time distributed systems, the client, network and server do not share fate. One of the machines can die on the backend but the other machines, the client and the network will still function as normal.

This means testing for all possible failure scenarios and controlling for code behavior during these faults. The increased number of failure modes multiply the number of test conditions.

Previously, you had to write a test for handling bugs in the method you were calling. Now, you still need those tests but you also need to test for network failures, unrelated server failures, delayed responses, no responses, etc.

Each of those eight steps in request/reply networking introduce possible failure modes, and building distributed systems at scale means you have to test for all of them and handle all the permutations.

Distributed Bugs are Often Latent

If a failure is going to happen, common wisdom is that it’s better if it happens sooner rather than later.

Distributed bugs (those that result from failing to handle all the permutations of the eight failure modes) are usually severe and can be caused by bugs that were deployed to production months earlier.

It takes a while to trigger the exact combination of scenarios that lead to these bugs happening, hence the delay.

Distributed Bugs Spread Epidemically

Another problem that is fundamental to distributed bugs is that they involve use of the network. Therefore, these bugs are more likely to spread and start to cause problems in other machines on the network.

This is especially true since distributed systems will have multiple layers of abstraction. Your system usually won’t just be a single client, a network and a single server machine.

Instead, the backend will consist of multiple machines grouped together across different geographic regions.

Jason elaborates on this by giving a story of a bug that took down the Amazon website. It was caused by a single server failing within the remote catalog service when its disk filled up.

“The failure was caused by a single server failing within the remote catalog service when its disk filled up. Due to mishandling of that error condition, the remote catalog server started returning empty responses to every request it received. It also started returning them very quickly, because it’s a lot faster to return nothing than something (at least it was in this case). Meanwhile, the load balancer between the website and the remote catalog service didn’t notice that all the responses were zero-length. But, it did notice that they were blazingly faster than all the other remote catalog servers. So, it sent a huge amount of the traffic from www.amazon.com to the one remote catalog server whose disk was full. Effectively, the entire website went down because one remote server couldn’t display any product information.”

For more details, you can read the full blog post here.

Tech Snippets

The Technology Behind Bloom Training - Bloom is an open source language model developed by HuggingFace that has a very similar architecture to OpenAI's GPT-3. Bloom was trained on a 1.5 terabyte dataset of cleaned text in 46 languages and has 176 billion parameters (GPT-3 has 175 billion parameters). It was trained on Jean Zay, a French-government funded supercomputer and took 3.5 months to train (1 million compute hours). Read the full article to learn about the process.

Why Am I Excited about WebAssembly? - WebAssembly is a portable intermediate language that can be executed on many different platforms (web browser, server-side, on the edge, etc.). You can write your code in C, C++, Rust, Go, Swift and then compile it to WebAssembly. This is an interesting blog post that delves into WebAssembly’s use cases for the Internet of Things.

5000x faster CRDTs: An Adventure in Optimization - Conflict Free Replicated Data Types are a data structure that lets multiple (online or offline) users edit the same data at the same time. When the offline users connect online, everything will sync up and become consistent. CRDTs are used extensively with applications in the Facebook app, Google Docs, Redis, Riak and many others. This is a really interesting blog post that delves into CRDTs and talks about how the author made a fast, lightweight CRDT implementation.

Bloom Filters - Much, much more than a space efficient hashmap! - A Bloom Filter is a space-efficient data structure that will tell you if an element is not a member of a set (in constant time). If you have a website with hundreds of millions of users and you want to check if a username has already been taken, using a hash table might take too much space. A bloom filter could do that for you and could easily fit in RAM. Bloom filters also have many more applications! This is a great blog post that goes into some other interesting ways to use them.

The Embedded Rust Book - This is an awesome book on using Rust on bare metal embedded systems. Typically, you’d use C/C++ but Rust provides a bunch of other features/safety-guarantees. The book explains how to write embedded programs using QEMU, an open source hardware emulator.

Interview Question

You are given a non-negative number n.

Count the number of primes less than n and return the count.

Here’s the question in LeetCode.

Previous Question

As a refresher, here’s the previous question

You are given a math equation that consists of positive integers and +, -, * and / operators (addition, subtraction, multiplication and division).

Write a function that computes the result.

Be sure to follow order of operations (you will not be given parentheses).

Input - “2 * 3 + 5 / 6 * 3 + 15”

Output - 23.5

Solution

First of all, it’s important to remember order of operations.

Solve any multiplication/division operations
Solve any addition/subtraction operations

The underlying data structure necessary to solve this question is a stack.

One stack will keep track of our numbers and the other will keep track of our operators (addition/subtraction/multiplication/division).

We’ll iterate through the input string and every time we see a number, we’ll push the number on to our number stack.

Remember that numbers can be multiple digits, so we’ll have to check the next item in our string to see if it’s a number with multiple digits.

Every time we see an operator, then we’ll first check what the priority of our operation is.

If the current top of our operation stack is a multiplication/division operation, then we’ll have to solve the operation at the top of our stack first.

So, we’ll pop the top element off our operation stack and pop off the top 2 elements from our number stack and we’ll calculate the operation.

Then, we’ll push the result to our number stack and continue on processing the input string.

After processing the input string, we’ll then process our number and operation stacks until they’re completely empty (just continuously pop off elements from the stacks and complete the operations until the operation stack is empty).

Finally, we should be left with one remaining number in number stack and we can return that as our final answer.