Tech Dive on the CAP Theorem
We'll talk about the CAP Theorem and why it's useful. Plus, we'll delve into common misconceptions and then classify real world databases (MongoDB, DynamoDB and Cassandra) in terms of the CAP Theorem.
Hey Everyone!
Today we’ll be talking about
Tech Dive on the CAP Theorem
What is the CAP Theorem and why is it used
Common Misconceptions with Consistency, Availability and Partition Tolerance
Evaluation the CAP Theorem with Real World Examples (DynamoDB, MongoDB and Cassandra)
The PACELC Theorem and it’s improvements on CAP
Tech Snippets
Advice for Negotiating Job Offers with Meta
Different Ways of Implementing Authorization
Using Wide Events for Observability
How I turned my Open Source project into a Business
Benchmarking Databases with Graphs
Tech Dive on CAP Theorem
If you look into distributed databases, you’ll quickly come across The CAP Theorem with people throwing around terms like CP/AP databases, BASE, PACELC and more.
The CAP Theorem is fundamental to distributed databases and is extremely important to be aware of. However, its use is also controversial with many distributed systems developers. Many argue that it can be misleading as a way of describing databases and that there are better alternatives available.
What is the CAP Theorem
In the layman’s definition, the CAP Theorem describes a trade off your distributed system must make between 3 possible guarantees.
Consistency - every read operation will always return the most up-to-date data regardless of which node you read the data from. This means that all nodes in the distributed database must store the most up-to-date data.
Availability - every read/write operation to a node in the distributed database will give you “an answer”. However, for reads, this answer might be stale data. For write requests, the write might not propagate to all the other nodes in the distributed database (meaning other nodes may return stale data to reads).
Partition Tolerance - The system will continue to operate even if there are network partitions and nodes in the distributed database aren’t able to talk to each other. You might have a scenario where the nodes in Europe are separated from the nodes in the US due to a network failure. A partition tolerance guarantee means that the database will continue to operate; each partition will operate independently and continue to process requests from clients within their partition.
The layman’s definition is that of these 3 guarantees, you can only pick 2. It’s not possible to have a database that gives all 3 of these guarantees.
Then, you can use this to label certain databases as “CP” (picks consistency and partition tolerance) and “AP” (picks availability and partition tolerance).
Frankly, you could probably get away with giving this definition in a technical interview. But… it’s wrong (if you’re being generous you might just call it extremely misleading).
To answer why, we’ll have to delve into each of these guarantees.
Partition Tolerance
We’ll start with the most frequent error, which is viewing partition tolerance as a choice. This is false.
If you’re creating a distributed database, you can’t just choose whether you want partition tolerance. Your database must do something when there is a network partition and nodes can’t communicate with each other (at some point there will be network issues between the nodes).
In that scenario, you can either design your system to optimize for availability and always respond with some data (even though this data might be out of date if there was an update on one of the partitioned nodes) or you can optimize for consistency and have your database respond with “error - there’s a network partition”.
(In practice, this is not a binary choice and you’ll pick something in between. We’ll delve into that further below.)
If you want a database where you don’t have to deal with partition tolerance, then you should go with a single, centralized database with something like Postgres or MySQL. However, you can only scale that database with vertical scaling (upgrading the hardware).
Consistency
With consistency, there’s a huge number of different consistency models that you can adhere to. Here’s an awesome map that goes through the different models and how strong each one is.
In the CAP Theorem, Consistency actually refers to a specific consistency model called Linearizability.
A TLDR of linearizability is just that your distributed system will behave like a single machine in terms of reads/writes (from the point of view of the clients). Each operation appears to take effect at a single moment in time, and all the other operations appear to take effect either before or after that point. This ordering is the same across all database nodes.
You can have multiple clients accessing your distributed database, with each modifying state with write requests or querying state with reads. These requests will be sent to different nodes in your database.
If the database provides a linearizability guarantee, then the responses the clients receive will make it seem like all these clients are talking to the same node (a singular node instead of a distributed system).
Martin Kleppmann (author of Designing Data Intensive Applications) has a great video on his YouTube channel where he delves into Linearizability.
Availability
The actual definition of availability is a bit different from how the term availability is typically used in computing (with an SLA, SLOs, nines, etc.).
Under a network issue, the CAP theorem assumes that the database will break up into partitions, where each partition contains database nodes and clients.
Clients in a certain partition can’t talk to nodes in a different partition. Database nodes can only talk to other nodes in the same partition.
CAP Availability means that the nodes in your partition (assuming you’re a client) will always give you a non-error response.
However, this response can be stale data for read requests. For write requests, it has to accept the write, but it won’t guarantee that all the nodes in the database will get updated on the write. The nodes in the other partitions may not see the write and send stale data to their users who query for it.
This is the first part of our tech dive on the CAP theorem.
In the full article, we’ll delve into
Commonly used Distributed Databases (DynamoDB, MongoDB, Cassandra) and why classifying them as CP/AP is misleading.
Consistency options for Reads/Writes offered by DynamoDB, MongoDB and Cassandra
The PACELC Theorem and it’s improvements on CAP
Thanks a ton for supporting Quastor. I really appreciate it!