Introduction to Storage Systems in Data Architectures

We'll talk about OLTP databases, data warehouses, data lakes/swaps, lakehouses and more. Plus, load balancing strategies, resources for CTOs and more.

March 30, 2024

Hey Everyone!

Today we’ll be talking about

An Introduction to Storage Systems in Big Data Architectures
- OLTP Databases and their key properties
- Data Warehouses, Cloud Data Warehouses and ETL
- Data Lakes and Data Swamps
- The Data Lakehouse Pattern
How Booking Searches Through Millions of Locations in Milliseconds
- One of Booking.com’s features is a map, where they show points of interest within a certain location
- The data structure powering this map is a Quadtree, a tree data structure that’s used extensively for spatial indexing, image processing and more
- We’ll talk about how Booking.com builds this Quadtree and how they search through it to find location markers
- For Quadtrees storing more than 300,000 markers, Booking achieves a p99 lookup time of less than 5.5 milliseconds
Tech Snippets
- Load Balancing with the Best-of-K algorithm
- A Curated List of Resources for CTOs
- Cache Eviction Strategies - FIFO vs. LRU
- A Collection of Debugging Stories
- Deep Dive on Caching in System Design

An Introduction to Storage Systems in Big Data Architectures

In past Quastor articles, we’ve delved into big data architectures and how you can build a system that processes petabytes of data.

We’ve discussed how

Shopify ingests transaction data from millions of merchants for their Black Friday dashboard
Pinterest processes and stores logs from hundreds of millions of users to improve app performance
Facebook analyzes and stores billions of conversations from user device microphones to generate advertising recommendations (Just kidding. Hopefully they don’t do this)

In these systems, there are several key components:

Sources - origins of data. This can be from a payment processor (Stripe), google analytics, a CRM (salesforce or hubspot), etc.
Integration - transfers data between different components in the system and minimizes coupling between the different parts. Tools include Kafka, RabbitMQ, Kinesis, etc.
Storage - store the data so it’s easy to access. There’s many different types of storage systems depending on your requirements
Processing - run calculations on the data in an efficient way. The data is typically distributed across many nodes, so the processing framework needs to account for this. Apache Spark is the most popular tool used for this; we did a deep dive on Spark that you can read here.
Consumption - the data (and insights generated from it) gets passed to tools like Tableau and Grafana so that end-users (business intelligence, data analysts, data scientists, etc.) can explore and understand what’s going on

In this article, we’ll delve into storage.

We’ll be talking about

OLTP databases
Data Warehouses
Data Lakes
Data Lakehouses

We’ll have a ton of buzzwords in this article, so don’t forget to fill out your Buzzword Bingo scorecards as you read this article. First person to reply with BINGO gets a prize 😉.

OLTP Databases

OLTP stands for Online Transaction Processing. This is your relational/NoSQL database that you’re using to store data like customer info, recent transactions, login credentials, etc.

Popular databases in this category include Postgres, MySQL, MongoDB, DynamoDB, etc.

The typical access pattern is that you’ll have some kind of key (user ID, transaction number, etc.) and you want to immediately get/modify data for that specific key. You might need to decrement the cash balance of a customer for a banking app, get a person’s name, address and phone number for a CRM app, etc.

Based on this access pattern, some key properties of OLTP databases are

Real Time - As indicated by the “Online” in OLTP, these databases needs to respond to queries within a couple hundred milliseconds.
Concurrency - You might have hundreds or even thousands of requests being sent every minute. The database should be able to handle these requests concurrently and process them in the right order.
Transactions - You should be able to do multiple database reads/writes in a single operation. Relational databases often provide ACID guarantees around transactions.
Row Oriented - We have a detailed article delving into row vs. column oriented databases here but this just refers to how data is written to disk. Not all OLTP databases are row oriented but most of the commonly used ones are. Examples of row-oriented databases include Postgres, MySQL, Microsoft SQL server and more.

However, you also have document-oriented databases like MongoDB or NewSQL databases like CockroachDB (column-oriented internally)

OLTP databases are great for handling day-to-day operations, but what if you don’t have to constantly change the data? What if you have a huge amount of historical (mainly) read-only data that you need for generating reports, running ML algorithms, archives, etc.

This is where data warehouses and data lakes come into play.

Data Warehouse

Instead of OLTP, data warehouses are OLAP - Online Analytical Processing. They’re used for analytics workloads on large amounts of data.

Data warehouses hold structured data, so the values are stored in rows and columns. Data analysts/scientists can use the warehouse for tasks like analyzing trends, monitoring inventory levels, generating reports, and more.

Using the production database (OLTP database) to run these analytical tasks/generate reports is generally not a good idea.

You need your production database to serve customer requests as quickly as possible.

Analytical queries (like going through every single customer transaction and aggregating the data to calculate daily profit/loss) can be extremely compute-intensive and the OLTP databases aren’t designed for these kinds of workloads.

Having data analysts run compute-intensive queries on the production database would end up degrading the user experience for all your customers.

OLAP databases are specifically designed to handle these tasks.

Therefore, you need a process called ETL (Extract, Transform, Load) to

Extract the exact data you need from the OLTP database and any other sources like CRM, payment processor (stripe), logging systems, etc.
Transform it into the form you need (restructuring it, removing duplicates, handling missing values, etc.). This is done in your data pipelines or with a tool like Apache Airflow.
Load it into the data warehouse (where the data warehouse is optimized for complex analytical queries)

The key properties of data warehouses are

Column Oriented - We mentioned that OLTP databases are mainly row-oriented. This means that all the data from the same row is stored together on disk.

Data warehouses (and other OLAP systems) are generally column-oriented, where all the data from the same column is stored together on disk. This minimizes the number of disk reads when you want to run computations on all the values in a certain column.

We did a deep dive on Row vs. Column oriented databases that you can check out here.
Highly Scalable - Data warehouses are designed to be distributed so they can scale to store terabytes/petabytes of data

Cloud Data Warehouses

Prior to 2010, only large, sophisticated companies had data warehouses. These systems were all on-premises and they cost millions of dollars to install and maintain (i.e. this is why Larry Ellison is worth $160 billion dollars).

This changed in 2012 with AWS Redshift and Google BigQuery. Snowflake came along in 2014.

Now, you could spin up a data warehouse in the cloud and increase/decrease capacity on-demand. Many more companies have been able to take advantage of data warehouses by using cloud services.

With the rise of Cloud Data warehouses, an alternative to ETL has become popular - ELT (Extract, Load, Transform)

Extract the data you need from all your sources (this is the same as ETL)
Load the raw data into a staging area in your data warehouse (Redshift calls this a staging table)
Transform the data from the staging area into all the different views/tables that you need.

This is the first part of our pro article on storage systems in big data architectures. The entire article is 2500+ words and 10+ pages.

In the full article, we’ll cover things like

Pros/Cons of ELT vs. ETL
Data Lakes with HDFS/S3/Azure
Data Swamp Anti-pattern
Data Lakehouse pattern

Get access to the full Quastor Pro archive (along with Spaced-Repetition Anki Flash cards on all Quastor content) here.

Tech Snippets

Load Balancing with Best-of-K

Best-of-K load balancing is where you randomly select k servers from your pool and then send the request to the least-loaded machine amongst those k.

It’s significantly better than random allocation but it performs significantly less efficiently when

brooker.co.za/blog/2024/03/25/needles.html

Resources for CTOs

This is a great github repo with resources for CTOs (or aspiring CTOs).

It contains resources on software development processes, architecture, product management, hiring and much more!

github.com/kuchin/awesome-cto

Cache Eviction Strategies - FIFO vs. LRU

When thinking about Cache Eviction, your first though is probably of LRU - getting rid of the least recently used item when the cache is full.

This article delves into FIFO (first in, first out) as a potential alternative to LRU. FIFO requires less metadata whereas the LRU data structure requires an update on each cache hit (you have to update the most recently used item).

The issue with just using FIFO is that it will remove popular items, so the author delves into modifications you can make to the algorithm (lazy promotion) to avoid this.

blog.jasony.me/system/cache/2023/06/24/fifo-lru

A Collection of Debugging Stories

This is a collection of stories that people have published detailing their journey of debugging a certain problem.

Reading these can help you hone your own debugging skills and it can also be fun.

github.com/danluu/debugging-stories

Deep Dive on Caching in System Design

This is a fantastic blog post that delves into Caching. It talks about common strategies (read-through, write-through, write-behind) and challenges.

The article also delves into potential pitfalls like stale data and cache stampedes. It also introduces CQRS as a strategy to eliminate the need for a cache.

medium.com/ssense-tech/cache-me-if-you-can-a-look-at-common-caching-strategies-and-how-cqrs-can-replace-the-need-in-the-65ec2b76e9e

Premium Content

Subscribe to Quastor Pro for long-form articles on concepts in system design and backend engineering.

Past article content includes

System Design Concepts

Measuring Availability
API Gateways
Database Replication
Load Balancing
API Paradigms
Database Sharding
Caching Strategies
Event Driven Systems
Database Consistency
Chaos Engineering
Distributed Consensus

Tech Dives

Redis
Postgres
Kafka
DynamoDB
gRPC
Apache Spark
HTTP
DNS
B Trees & LSM Trees
OLAP Databases
Database Engines

When you subscribe, you’ll also get Spaced Repetition (Anki) Flashcards for reviewing all the main concepts discussed in past Quastor articles

How Booking.com’s Map Feature Works

Booking.com is an online travel agency that connects travelers with hotels, hostels, vacation rentals and more. Hundreds of millions of users visit their website every month to search for accommodations for their next vacation.

A key feature Booking provides on their website is their map. There are tens of millions of listed properties on their marketplace, and users can search the map to see

A rental property’s location
Nearby interesting places (museums, beaches, historical landmarks, etc.)
Distance between the rental property and the interesting places

This map needs to load quickly and their backend needs to search millions of different points across the world.

Igor Dotsenko is a software engineer at Booking and he wrote a great blog post delving into how they built this.

In case this article gets clipped by your email provider, you can read the full article on the Quastor website

Searching on the Map

When the user opens a map for a certain property, a bounding box on the map is shown. Points of interest within this bounding box need to be displayed on the map so Booking needs to quickly find the most important points of interest within that bounding box.

Quadtrees

The underlying data structure that powers this is a Quadtree.

A Quadtree is a tree data structure that’s used extensively when working with 2-D spatial data like maps, images, video games, etc. You can perform efficient insertion/deletion of points, fast range queries, nearest-neighbor searches (find closest object to a given point) and more.

Like any other tree data structure, you have nodes which are parents/children of other nodes. For a Quadtree, internal nodes will always have four children (internal nodes are nodes that aren’t leaf nodes. Leaf nodes are nodes with 0 children). The parent node represents a specific region of the 2-D space, and each child node represents a quadrant of that region.

When you’re dealing with mapping data, the parent node will represent some region in your map. The four children of that parent will represent the northwest, northeast, southwest and southeast quadrants of the parent’s region.

For Booking, each node represented a certain bounding box in their map. Users can change the visible bounding box by zooming in or panning on the map. Each child of the node holds the northwest, northeast, southwest, southeast bounding box within the larger bounding box of the parent.

Each node will also hold a small number of markers (representing points of interest). Each marker is assigned an importance score. Markers that are considered more important are assigned to nodes higher up in the tree (so markers in the root node are considered the most important). The marker for the Louvre Museum in Paris will go in a higher node than a marker for a Starbucks in the same area.

Now, we’ll go through how Booking searches through the Quadtree and also how they build/update it.

Quadtree Searching

When the user selects a certain bounding box, Booking wants to retrieve the most important markers for that bounding box from the Quadtree. Therefore, they use a Breadth-First Search to do this.

They’ll start with the root node and search through its markers to find if any intersect with the selected bounding box.

If they need more markers, then they’ll check which child nodes intersect with the bounding box and add them to the queue.

Nodes will be removed from the queue (in first-in-first-out order) and the function will search through their markers for any that intersect with the bounding box.

Once the BFS has found the requested number of markers, it will terminate and send them to the user, so they can be displayed on the map.

Building the Quadtree

The Quadtree is kept in-memory, and re-built from time-to-time to account for new markers that are added to the map (or account for changes in marker importance).

When building the data structure, the algorithm first starts with the root node. This node represents the entire geographic area (Booking scales the Quadtrees horizontally, so they have different Quadtree data structures for different geographic regions).

If each node can only hold 10 markers, then Booking will find the 10 most important markers from the entire list and insert them into the root node.

After, they create 4 children for the northeast/northwest/southwest/southeast regions of the root node’s area and repeat the process for each of those children.

This is repeated until all the markers are inserted into the Quadtree.

Results

Quadtrees are quite simple but extremely powerful, making them very popular across the industry. If you’re preparing for system design interviews then they’re a good data structure to be aware of.

Booking scales this system horizontally, by creating more Quadtrees and having each one handle a certain geographic area.

For Quadtrees storing more than 300,000 markers, the p99 lookup time is less than 5.5 milliseconds (99% of requests take under 5.5 milliseconds).

For more details, you can read the full blog post here.