Building a basic Storage Engine

September 27, 2021

Hey Everyone,

Today we’ll be talking about

Diving deeper into Databases (the storage engine)
- The different components
- What purpose do each of these components serve
- How do these components interact with each other
Plus some awesome tech snippets!

We also have a solution to our last coding interview question on Binary Search, plus a new question from Google. This interview question is quite difficult, so be sure to give it a shot!

Join Quastor and get emails on system design, software architecture and algorithmic interviews today!

Building a basic Storage Engine

In our last email, we went over the components of a typical Database Management System (DBMS).

Now, we’re going to dive into the most important component, the storage engine.

The storage engine is responsible for storing, retrieving and managing data in memory and on disk.

A DBMS will often let you pick which storage engine you want to use. For example, MySQL has several choices for the storage engine, including RocksDB and InnoDB.

Having an understanding of how storage engines work is crucial for helping you pick the right database.

The two most popular types of database storage engines are

Log Structured (like Log Structured Merge Trees)
Page Oriented (like B-Trees)

Log Structured storage engines treat the database as an append-only log file where data is sequentially added to the end of the log.

Page Oriented storage engines breaks the database down into fixed-size pages (traditionally 4 KB in size) and they read/write one page at a time.

Each page is identified using an address, which allows one page to refer to another page. These page references are used to construct a tree of pages.

In this update, we’ll be focusing on Log Structured storage engines. We’ll look at how to build a basic log structured engine with a hash index for faster reads.

Log Structured Storage Engines

Our storage engine will handle key-value pairs. It’ll have two functions

db_set(key, value) - give the database a (key, value) pair and the database will store it. If the key already exists then the database will just update the value.
db_get(key) - give the database a key and the database will return the associated value. If the key doesn’t exist, then the database will return null.

The storage engine works by maintaining a log of all the (key, value) pairs.

By a log, we mean an append-only sequence of records. The log doesn’t have to be human readable (it can be binary).

Anytime you call db_set, you append the (key, value) pair to the bottom of the log.

This is done regardless of whether the key already existed in the log.

The append-only strategy works well because appending is a sequential write operation, which is generally much faster than random writes on magnetic spinning-disk hard drives (and on solid state drives to some extent).

Now, anytime you call db_get, you look through all the (key, value) pairs in the log and return the last value for any given key.

We return the last value since there may be duplicate values for a specific key (if the key, value pair was inserted multiple times) and the last value that was inserted is the most up-to-date value for the key.

The db_set function runs in O(1) time since appending to a file is very efficient.

On the other hand, db_get runs in O(n) time, since it has to look through the entire log.

This becomes an issue as we scale the log to handle more (key, value) pairs.

But, we can make db_get faster by adding a database index.

The general idea behind an index is to keep additional metadata on the side.

This metadata can act as a “signpost” as the storage engine searches for data and helps it find the data faster.

The database index is an additional structure that is derived from the primary data and adding a database index to our storage engine will not affect the log in any way.

The tradeoff with a database index is that your database reads can get significantly faster. You can use the database index during a read to achieve sublinear read speeds.

However, database writes are slower now because for every write you have to update your log and update the database index.

An example of a database index we can add to our storage engine is a hash index.

Hash Indexes

One index we can use is a hash table.

We keep a hash table in-memory where each key in our log is stored as a key in our hash table and the location in the log where that key can be found (the byte offset) is stored as the key’s value in our hash table.

Now, the process for db_set is

Append the (key, value) pair to the end of the log. Note the byte offset for where it’s added.
Check if the key is in the hash table.
If it is, then update it’s value in the hash table to the new byte offset.
If it isn’t, then add (key, byte offset) to the hash table.

One requirement with this strategy is that all your (key, byte offset) pairs have to fit in RAM since the hash map is kept in-memory.

An issue that will eventually come up is that we’ll run out of disk space.

Since we’re only appending to our log, we’ll have repeat entries for the same key even though we’re only using the most recent value.

We can solve this issue by breaking our log up into segments.

Whenever our log reaches a certain size, we’ll close it and then make subsequent writes to a new segment file.

Then, we can perform compaction on our old segment, where we throw away duplicate keys in that segment and only keep the most recent update for each key.

After removing duplicate keys, we can merge past segments together into larger segments of a fixed size.

We’ll also create new hash tables with (key, byte offset) pairs for all the past merged segments we have and keep those hash tables in-memory.

The merging and compaction of past segments is done in a background thread, so the database can still respond to read/writes while this is going on.

Now, the process for db_get is

Check the current segment’s in-memory hash table for the key.
If it’s there, then use the byte offset to find the value in the segment on disk and return the value.
Otherwise, look through the hash tables for our previous segments to find the key and it’s byte offset.

The methodology we’ve described is very similar to a real world storage engine - Bitcask.

Bitcask is a storage engine used for the Riak database, and it’s written in Erlang. You can read Bitcask’s whitepaper here.

Bitcask (the storage engine that uses this methodology) also stores snapshots of the in-memory hash tables to disk.

This helps out in case of crash recovery, so the database can quickly spin back up if the in-memory hash tables are lost.

This is a brief overview of how a Log Structured database can work.

However, issues that arise are

The hash table of (key, byte offsets) must fit in memory. Maintaining a hashmap on disk is too slow.
Range queries are not efficient. If you want to scan over all the keys that start with the letter a, then you’ll have to look up each key individually in the hash maps.

In our next emails, we’ll look at SSTables, LSM-trees and B-trees!

So, stay tuned.

Tech Snippets

What makes the Haskell type system so revered?There’s a lot of buzz on reddit and hacker news around Haskell and one of the reasons is Haskell’s type system.Haskell is very popular in academia for research around programming languages, specifically around type systems.The main goal of researching types is to find better ways to reduce the number of runtime bugs (catch them at compile time) while also not annoying developers (by adding too much boilerplate code or excessively reducing flexibility).This answer on Stack Exchange to the question What makes the Haskell type system so revered provides a great list of some of the features that Haskell’s type system provides.
It’s tough being an Azure fan - Microsoft has been struggling recently with Azure. Numerous (major) security breaches, several long (12+ hour) outages and some outrageous price schemes.This article goes through some specific criticisms of Azure from a CTO who advises many high-growth companies on how to build their tech stacks.
Foundations of Blockchains - a course by Columbia Professor Tim RoughgardenTim Roughgarden is a CS Professor at Columbia University.You may know him from his outstanding Coursera MOOCs on Algorithms. He also wrote the Algorithms Illuminated book series and has fantastic lectures on Algorithmic Game Theory.He’s now teaching a course on the fundamentals of blockchains.Around 60% of the course is on Layer 1 solutions like Bitcoin and Ethereum.20% of the course is on Layer 2 scaling solutions like the Lightning Network, optimistic rollups and zero-knowledge roll ups.The last 20% of the course is on the application layer, primarily covering DeFi (Decentralized Finance) protocols

Interview Question

You are given a string s.

You can convert s to a palindrome by adding characters in front of it.

Return the shortest palindrome that you can find by adding characters in front of s.

Example

Input: s = ”abcd”

Output: “dcbabcd”

We’ll send the solution in our next email, so make sure you move our emails to primary, so you don’t miss them!

Gmail users—move us to your primary inbox

On your phone? Hit the 3 dots at the top right corner, click "Move to" then "Primary"
On desktop? Back out of this email then drag and drop this email into the "Primary" tab near the top left of your screen
A pop-up will ask you “Do you want to do this for future messages from [email protected]” - please select yes

Apple mail users—tap on our email address at the top of this email (next to "From:" on mobile) and click “Add to VIPs”

Previous Solution

As a reminder, here’s our last question

Write a function to calculate the square root of a non-negative integer.

You should truncate any decimal digits in your answer and just return the integer.

Here’s the question in LeetCode

Solution

We can solve this question in logarithmic time using Binary Search.

We’ll set low and high variables and initialize them to 0 and x.

Then, our mid will be the average of low and high.

If mid is the truncated square root of x, then we can return mid. We can test for this by squaring mid and by squaring (mid + 1).

If the square of mid is less than or equal to x and the square of (mid + 1) is greater than x, then mid is the truncated square root of x.

We can just return mid in that scenario.

Otherwise, if mid is less than x, we’ll set low to mid + 1 and try again.

If mid is greater than x, then we’ll set high to mid - 1 and try again.

Here’s the Python 3 code.