The Architecture of Databases

September 21, 2021

Hey Everyone,

Today we’ll be talking about

The Architecture of Databases
- The different components: Transport System, Query Processor, Execution Engine and the Storage Engine
- What purpose do each of these components serve
- How do these components interact with each other
Plus, a couple awesome tech snippets on
- A dive into HTTP/3 (history, core concepts, performance features)
- Advanced Linux Programming (an amazing free textbook that dives into things you should know when writing software for the Linux OS)
- 7 different database paradigms

We also have a solution to our last coding interview question on Dynamic Programming, plus a new question from Facebook.

Quastor Daily is a free Software Engineering newsletter sends out FAANG Interview questions (with detailed solutions), Technical Deep Dives and summaries of Engineering Blog Posts.

The Architecture of Databases

Database Management Systems provide an abstraction that you can use in your applications for storing and (later) retrieving data.

The DBMS exposes an API that you can use through some type of a Query Language and it provides a set of guarantees on how you can store/retrieve your data from the database.

A couple examples of guarantees are things like durability (you won’t lose your data if the database server crashes), strong consistency (for distributed databases - once you write some data to the database, all subsequent reads will return that value and you won’t get “stale” data) or read/write speeds (IOPS is the standard measure of input and output operations per second on storage devices).

The architecture of Database Management Systems can vary widely based on it’s guarantees and design goals (a distributed database meant to store petabytes of data will have a different design than an embedded database like SQLite).

However, databases have some common themes in their architectures and knowing these themes can provide a useful model for how they work.

The general architecture can be described by this diagram...

Database Management Systems use a client/server model.

When you use a database, your application is the client and the database system is the server (either hosted on the same machine or hosted on a different machine).

The Transport system is how the DBMS accepts client requests. Client requests come in the form of a database query and are usually expressed in some type of a query language.

The Transport system first takes in the client’s query and passes it on to the Query Processor.

The first part of the Query Processor is the Query Parser. It parses the client’s query ( using an Abstract Syntax Tree for example) and makes sure the client’s query is valid. If the query isn’t valid, then the database will return an error to the client.

The Query Parser may also run access control checks, and make sure that the client is correctly permissioned to access/modify the data that they’re requesting.

After, the parsed query is passed to the Query Optimizer.

A parsed query can normally be satisfied in different ways, but the different ways vary in efficiency. The optimizer’s job is to find the most efficient way.

The optimizer will first eliminate redundant parts of the query and then use internal database statistics (index cardinality, approximate intersection size, etc.) to find the most efficient way to execute the query.

For distributed databases, the Optimizer will also consider data placement like which node in the cluster holds the data and the costs associated with the transfer.

The output from the Optimizer is an Execution Plan that describes the optimal method of executing the query. This plan is also called the Query Plan or Query Execution Plan.

This Execution Plan gets passed on to the Execution Engine which carries out the plan. When you’re using a distributed database, the plan can involve Remote Execution (making network requests for data that is stored on a different machine). Otherwise, it’s just Local Execution (carrying out queries for data that is stored locally).

Remote Execution involves Cluster Communication, where we communicate with other machines in our database cluster and send them requests for data.

Local Execution involves talking to the Storage Engine to get the data.

The Storage Engine is the component in the database that is directly responsible for storing, retrieving and managing data in memory and on disk.

Storage Engines typically provide a simple data manipulation API (allowing for create, read, update, delete features) and contain all the logic for the actual details of how to manipulate the data.

Examples of Storage Engines include BerkeleyDB, LevelDB, RocksDB, etc.

Databases will often allow you to pick the Storage Engine that’s being used.

MySQL, for example, has several choices for the storage engine, including RocksDB and InnoDB.

The Storage Engine then consists of several components

Transaction Manager - responsible for creating transaction objects and managing their atomicity (either the entire transaction succeeds or it is rolled back).

Lock Manager - Transactions will be executing concurrently, so the Lock Manager manages the locks on database objects being accessed by each transaction (and releasing those locks when the transaction either commits or rolls back).

Access Methods - These manage access, compression and organizing data on disk. Access methods include heap files and storage structures such as B-trees.

Buffer Manager - This manager caches data pages in RAM to reduce the number of accesses to disk.

Recovery Manager - Maintains the operation log and restoring the system state in case of a failure.

Different Storage Engines make different tradeoffs between these components resulting in differing performance for things like compression, scaling, partitioning, speed, etc.

This is a brief summary from Database Internals by Alex Petrov.

Quastor Daily is a free Software Engineering newsletter sends out FAANG Interview questions (with detailed solutions), Technical Deep Dives and summaries of Engineering Blog Posts.

Tech Snippets

Advanced Linux Programming - If you’re a developer for the Linux OS, this is a great textbook to teach you advanced topics around how to write programs that work with features like IPC, multiprocessing, multi-threading, and interaction with hardware devices.The book goes into how you can build applications that are faster and more secure and it also shows peculiarities in the Linux OS including limitations and conventions.
7 Database paradigms - A great introduction to seven different database paradigms and what they do best
- Key-Value databases like Redis
- Wide Column databases like Cassandra and HBase
- Document Databases like MongoDB
- Relational Databases like Postgres and MySQL
- Graph Databases like Neo4J
- Search Engines like Elasticsearch and Algolia
- Multi-model like FaunaDB
HTTP/3 from A to Z - After 5 years of development, the new HTTP/3 protocol is nearing its final form. This blog post goes into
- HTTP/3 History and core concepts - For people who are new to HTTP/3 and protocols in general. It mainly discusses the basics
- HTTP/3 Performance features - performance features in QUIC vs. TCP and how you can look at HTTP/3 as HTTP/2-over-QUIC
- HTTP/3 deployment options - The challenges involved in deploying and testing HTTP/3 yourself. It also talks about if and how you should change your web pages and resources.

Interview Question

Write a function to calculate the square root of a non-negative integer.

You should truncate any decimal digits in your answer and just return the integer.

We’ll send the solution in our next email, so make sure you move our emails to primary, so you don’t miss them!

Gmail users—move us to your primary inbox

On your phone? Hit the 3 dots at the top right corner, click "Move to" then "Primary"
On desktop? Back out of this email then drag and drop this email into the "Primary" tab near the top left of your screen
A pop-up will ask you “Do you want to do this for future messages from [email protected]” - please select yes

Apple mail users—tap on our email address at the top of this email (next to "From:" on mobile) and click “Add to VIPs”

Previous Solution

As a reminder, here’s our last question

Given two strings word1 and word2, return the minimum number of steps required to make word1 and word2 the same.

In one step, you can delete exactly one character in either string.

Example 1

Input: word1 = “sea”, word2 = “eat”

Output: 2

Here’s the question in LeetCode

Solution

We can solve this question using Dynamic Programming.

We’ll use top-down dynamic programming with recursion and a memo table (hash table for caching repeated computations).

In our recursive function, we’ll use c1 and c2 as indexes for word1 and word2.

We can terminate when c1 is equal to the length of word1 or c2 is equal to the length of word2.

If either is the case, we can just return the remaining characters left in the other word (as we’d have to delete those characters for the two words to be equal).

Otherwise, we’ll first check to see if word1[c1] == word2[c2]

If so, then we don’t have to delete any characters and we can call our recursive function with (c1 + 1, c2 + 1).

If they’re not equal, then we’ll try deleting the current character of word1.

We tally 1 edit (since we deleted a character) and then call our function with (c1 + 1, c2).

After we find the total number of edits from deleting the current character in word1, we’ll try deleting the character in word2 instead.

Again, we tally 1 edit (since we deleted a character in word2) and then we call our function with (c1, c2 + 1).

That will give us the total number of edits from deleting the current character in word2.

Now, we compare the total edits from deleting a character in word1 vs. word2 and return the minimum total edits.

There are quite a few repeat computations with this approach, so we can cache our answers in our memo table.

Here’s the Python 3 code.