Introduction to Storage Systems in Data Architectures (Quastor Pro)

We'll talk about OLTP databases, data warehouses, data lakes/swaps, lakehouses and more. Plus, load balancing strategies, resources for CTOs and more.

March 30, 2024

Hey Everyone!

Today we’ll be talking about

An Introduction to Storage Systems in Big Data Architectures
- OLTP Databases and their key properties
- Data Warehouses, Cloud Data Warehouses and ETL
- Data Lakes and Data Swamps
- The Data Lakehouse Pattern
How Booking Searches Through Millions of Locations in Milliseconds
- One of Booking.com’s features is a map, where they show points of interest within a certain location
- The data structure powering this map is a Quadtree, a tree data structure that’s used extensively for spatial indexing, image processing and more
- We’ll talk about how Booking.com builds this Quadtree and how they search through it to find location markers
- For Quadtrees storing more than 300,000 markers, Booking achieves a p99 lookup time of less than 5.5 milliseconds
Tech Snippets
- Load Balancing with the Best-of-K algorithm
- A Curated List of Resources for CTOs
- Cache Eviction Strategies - FIFO vs. LRU
- A Collection of Debugging Stories
- Deep Dive on Caching in System Design

An Introduction to Storage Systems in Big Data Architectures

In past Quastor articles, we’ve delved into big data architectures and how you can build a system that processes petabytes of data.

We’ve discussed how

Shopify ingests transaction data from millions of merchants for their Black Friday dashboard
Pinterest processes and stores logs from hundreds of millions of users to improve app performance
Facebook analyzes and stores billions of conversations from user device microphones to generate advertising recommendations (Just kidding. Hopefully they don’t do this)

In these systems, there are several key components:

Sources - origins of data. This can be from a payment processor (Stripe), google analytics, a CRM (salesforce or hubspot), etc.
Integration - transfers data between different components in the system and minimizes coupling between the different parts. Tools include Kafka, RabbitMQ, Kinesis, etc.
Storage - store the data so it’s easy to access. There’s many different types of storage systems depending on your requirements
Processing - run calculations on the data in an efficient way. The data is typically distributed across many nodes, so the processing framework needs to account for this. Apache Spark is the most popular tool used for this; we did a deep dive on Spark that you can read here.
Consumption - the data (and insights generated from it) gets passed to tools like Tableau and Grafana so that end-users (business intelligence, data analysts, data scientists, etc.) can explore and understand what’s going on

In this article, we’ll delve into storage.

We’ll be talking about

OLTP databases
Data Warehouses
Data Lakes
Data Lakehouses

We’ll have a ton of buzzwords in this article, so don’t forget to fill out your Buzzword Bingo scorecards as you read this article. First person to reply with BINGO gets a prize 😉.

OLTP Databases

OLTP stands for Online Transaction Processing. This is your relational/NoSQL database that you’re using to store data like customer info, recent transactions, login credentials, etc.

Popular databases in this category include Postgres, MySQL, MongoDB, DynamoDB, etc.

The typical access pattern is that you’ll have some kind of key (user ID, transaction number, etc.) and you want to immediately get/modify data for that specific key. You might need to decrement the cash balance of a customer for a banking app, get a person’s name, address and phone number for a CRM app, etc.

Based on this access pattern, some key properties of OLTP databases are

Real Time - As indicated by the “Online” in OLTP, these databases needs to respond to queries within a couple hundred milliseconds.
Concurrency - You might have hundreds or even thousands of requests being sent every minute. The database should be able to handle these requests concurrently and process them in the right order.
Transactions - You should be able to do multiple database reads/writes in a single operation. Relational databases often provide ACID guarantees around transactions.
Row Oriented - We have a detailed article delving into row vs. column oriented databases here but this just refers to how data is written to disk. Not all OLTP databases are row oriented but most of the commonly used ones are. Examples of row-oriented databases include Postgres, MySQL, Microsoft SQL server and more.

However, you also have document-oriented databases like MongoDB or NewSQL databases like CockroachDB (column-oriented internally)

OLTP databases are great for handling day-to-day operations, but what if you don’t have to constantly change the data? What if you have a huge amount of historical (mainly) read-only data that you need for generating reports, running ML algorithms, archives, etc.

This is where data warehouses and data lakes come into play.

Data Warehouse

Instead of OLTP, data warehouses are OLAP - Online Analytical Processing. They’re used for analytics workloads on large amounts of data.

Data warehouses hold structured data, so the values are stored in rows and columns. Data analysts/scientists can use the warehouse for tasks like analyzing trends, monitoring inventory levels, generating reports, and more.

Using the production database (OLTP database) to run these analytical tasks/generate reports is generally not a good idea.

You need your production database to serve customer requests as quickly as possible.

Analytical queries (like going through every single customer transaction and aggregating the data to calculate daily profit/loss) can be extremely compute-intensive and the OLTP databases aren’t designed for these kinds of workloads.

Having data analysts run compute-intensive queries on the production database would end up degrading the user experience for all your customers.

OLAP databases are specifically designed to handle these tasks.

Therefore, you need a process called ETL (Extract, Transform, Load) to

Extract the exact data you need from the OLTP database and any other sources like CRM, payment processor (stripe), logging systems, etc.
Transform it into the form you need (restructuring it, removing duplicates, handling missing values, etc.). This is done in your data pipelines or with a tool like Apache Airflow.
Load it into the data warehouse (where the data warehouse is optimized for complex analytical queries)

The key properties of data warehouses are

Column Oriented - We mentioned that OLTP databases are mainly row-oriented. This means that all the data from the same row is stored together on disk.

Data warehouses (and other OLAP systems) are generally column-oriented, where all the data from the same column is stored together on disk. This minimizes the number of disk reads when you want to run computations on all the values in a certain column.

We did a deep dive on Row vs. Column oriented databases that you can check out here.
Highly Scalable - Data warehouses are designed to be distributed so they can scale to store terabytes/petabytes of data

Cloud Data Warehouses

Prior to 2010, only large, sophisticated companies had data warehouses. These systems were all on-premises and they cost millions of dollars to install and maintain (i.e. this is why Larry Ellison is worth $160 billion dollars).

This changed in 2012 with AWS Redshift and Google BigQuery. Snowflake came along in 2014.

Now, you could spin up a data warehouse in the cloud and increase/decrease capacity on-demand. Many more companies have been able to take advantage of data warehouses by using cloud services.

With the rise of Cloud Data warehouses, an alternative to ETL has become popular - ELT (Extract, Load, Transform)

Extract the data you need from all your sources (this is the same as ETL)
Load the raw data into a staging area in your data warehouse (Redshift calls this a staging table)
Transform the data from the staging area into all the different views/tables that you need.

Pros of ELT over ETL

Frankly, the specific benefits really depend on how complex your transformation step is, which cloud provider you’re using, how much data you need to transform, etc.

Doing the transform step on the data warehouse itself means a simpler architecture. You don’t need data pipelines in-between the sources and the data warehouse for transforming. Instead, you just load the raw data directly to the data warehouse and run your transformations there.

A potential benefit is additional flexibility depending on how long you keep the data in the staging tables. If the data is still there, then you can create new views to store the data in various ways in your data warehouse.

Another benefit is lower latency with how long it takes from data being generated to the data to be ready for analysis in the data warehouse. Depending on your architecture, extracting, loading and transforming on the data warehouse can be faster than having the logic in a bunch of data pipelines.

The main issue with data warehouses is that they’re only for structured data. You should already have an idea of how the data can be split into rows/columns and you should also know what your access patterns will be.

If you don’t know this (or if you need to store videos, images or other data that can’t be split into rows/columns) then a data warehouse might not be a good fit.

Data Lake

Data warehouses hold structured data where you have a predefined schema and write data that conforms to the schema.

However, what if you have a bunch of structured data that you could need but don’t have a current use for? Perhaps this could be a dump of customer feedback you’ve pulled from Twitter, Reddit and Facebook. You might want to run ad-hoc sentiment analysis ML algorithms on this but the data is unstructured and random which makes it tough to put in a data warehouse.

A data lake is the solution.

Data lakes hold unstructured data, where you're storing raw data in its natural format. This is great if you’re unsure of what access patterns you’ll need for the data.

Data lakes are often referred to as schema-on-read because there is no predefined schema when you’re writing. You’re just dumping the objects on the data lakes. Then, you can clean/aggregate/structure the data and continue to store it there or move it to a data warehouse.

Storing data in a data lake tends to be significantly cheaper than using a data warehouse for large datasets.

Technologies

Data Lakes are typically built on top of file or object-based storage technologies. So, things like

Hadoop Distributed File System (HDFS)
AWS S3
Azure Data Lake Storage
Google Cloud Storage

Each of these ecosystems (hadoop ecosystem, AWS, Azure, and GCP) offers tons of ML integrations that you can use to explore and understand whatever data you upload to your data lake.

Companies will typically use a two-tiered architecture, where the first tier is the data lake (with S3, Azure Data Lake or Google Cloud Storage). A subset of the data in the data lake is then cleaned/structured and sent to the second tier, the data warehouse (Redshift, Snowflake, etc.)

Data Swamp

A common anti-pattern with data lakes is letting the lake become a swamp. When you don’t set proper data governance/organization practices, then a data lake can quickly become unorganized and difficult to navigate. At an extreme, the data can become completely useless if no one understands the context behind it. You could just end up paying AWS for terabytes of unused data.

Instead, the pattern is to partition your data lake into layers (also called zones).

Examples of potential layers include

Raw - Raw data that comes straight from data sources. This can be in different formats, missing values, incomplete, etc.
Cleaned - data that has been checked for missing/incorrect values. Someone has made sure the data is complete.
Conformed - the data is converted from different formats (CSV, JSON, TXT, etc.) into one single format (typically parquet) and joined together in meaningful ways.
Exploration - A layer that data analysts/scientists can use to mess around with. Keeping this isolated can help prevent accidents where you modify data and accidentally delete the original copy.
Presentation - Data that can be used by other end-users for machine learning, visualization, loading into a warehouse, etc.

Data Lakehouse

What if you want the flexibility that comes with a data lake but also want some of your data to be in a more tightly constrained environment with an enforced schema?

As mentioned previously, most companies solve this problem with a two-tiered architecture, where they have both a data lake and a data warehouse. Data from the data lake (and other sources) goes to the warehouse.

Maintaining two systems can add extra complexity and create additional failure modes. Data lakehouses attempt to solve this by providing a single system that combines the best of both data lakes and data warehouses.

Data Lakehouse Storage

Data lakehouses use a single storage layer. This is the same as what you’d use for a data lake, so you might use AWS S3, Azure Data Lake Storage, Google Cloud Storage, etc. You load all your data from your sources into this lake and store it in a standard file format, like Apache Parquet.

Similar to a data lake, the data is separated into multiple layers where the layers can roughly represent

Raw - Data from the sources is extracted and loaded here. This is raw data, so the contents are the same as what was outputted by the sources.
Intermediate - The raw data is cleaned, filtered, aggregated, etc. It’s also converted to a common data format like Parquet.
Processed - The data is processed for specific use cases. So, if you need to calculate the daily revenue based on transaction data, then you might take the transaction data from the intermediate layer and split it up and aggregate it so it answers your query. Queries and ML models are typically run from the processed layer.
Archived - Many organizations may require data retention for regulatory/compliance purposes. This layer would store that data. You might use something like AWS Glacier Tier for this.

Data Lakehouse Processing

In order to process this data and run data warehouse queries or ML algorithms, the data lake has several metadata and processing layers on top of it.

Delta Lake is the open source metadata layer that you can run on top of your data lake. It will keep track of when data was written/accessed, different data schemas, transaction logs, and more.

Having Delta Lake allows you to use the unstructured data lake like a data warehouse. You can read about Delta Lake here.

For data processing (and running ML algorithms), you use a tool like Spark on top of Delta Lake. Cloud providers like AWS, Azure and GCP all have integrations built out so you can integrate Spark with Delta lake if you’re trying to build the Lakehouse Pattern.

Integrations on top of Delta Lake

Using Delta Lake on top of your Data Lake

Tech Snippets

Load Balancing with Best-of-K

Best-of-K load balancing is where you randomly select k servers from your pool and then send the request to the least-loaded machine amongst those k.

It’s significantly better than random allocation but it performs significantly less efficiently when

brooker.co.za/blog/2024/03/25/needles.html

Resources for CTOs

This is a great github repo with resources for CTOs (or aspiring CTOs).

It contains resources on software development processes, architecture, product management, hiring and much more!

github.com/kuchin/awesome-cto

Cache Eviction Strategies - FIFO vs. LRU

When thinking about Cache Eviction, your first though is probably of LRU - getting rid of the least recently used item when the cache is full.

This article delves into FIFO (first in, first out) as a potential alternative to LRU. FIFO requires less metadata whereas the LRU data structure requires an update on each cache hit (you have to update the most recently used item).

The issue with just using FIFO is that it will remove popular items, so the author delves into modifications you can make to the algorithm (lazy promotion) to avoid this.

blog.jasony.me/system/cache/2023/06/24/fifo-lru

A Collection of Debugging Stories

This is a collection of stories that people have published detailing their journey of debugging a certain problem.

Reading these can help you hone your own debugging skills and it can also be fun.

github.com/danluu/debugging-stories

Deep Dive on Caching in System Design

This is a fantastic blog post that delves into Caching. It talks about common strategies (read-through, write-through, write-behind) and challenges.

The article also delves into potential pitfalls like stale data and cache stampedes. It also introduces CQRS as a strategy to eliminate the need for a cache.

medium.com/ssense-tech/cache-me-if-you-can-a-look-at-common-caching-strategies-and-how-cqrs-can-replace-the-need-in-the-65ec2b76e9e

How Booking.com’s Map Feature Works

Booking.com is an online travel agency that connects travelers with hotels, hostels, vacation rentals and more. Hundreds of millions of users visit their website every month to search for accommodations for their next vacation.

A key feature Booking provides on their website is their map. There are tens of millions of listed properties on their marketplace, and users can search the map to see

A rental property’s location
Nearby interesting places (museums, beaches, historical landmarks, etc.)
Distance between the rental property and the interesting places

This map needs to load quickly and their backend needs to search millions of different points across the world.

Igor Dotsenko is a software engineer at Booking and he wrote a great blog post delving into how they built this.

In case this article gets clipped by your email provider, you can read the full article on the Quastor website

Searching on the Map

When the user opens a map for a certain property, a bounding box on the map is shown. Points of interest within this bounding box need to be displayed on the map so Booking needs to quickly find the most important points of interest within that bounding box.

Quadtrees

The underlying data structure that powers this is a Quadtree.

A Quadtree is a tree data structure that’s used extensively when working with 2-D spatial data like maps, images, video games, etc. You can perform efficient insertion/deletion of points, fast range queries, nearest-neighbor searches (find closest object to a given point) and more.

Like any other tree data structure, you have nodes which are parents/children of other nodes. For a Quadtree, internal nodes will always have four children (internal nodes are nodes that aren’t leaf nodes. Leaf nodes are nodes with 0 children). The parent node represents a specific region of the 2-D space, and each child node represents a quadrant of that region.

When you’re dealing with mapping data, the parent node will represent some region in your map. The four children of that parent will represent the northwest, northeast, southwest and southeast quadrants of the parent’s region.

For Booking, each node represented a certain bounding box in their map. Users can change the visible bounding box by zooming in or panning on the map. Each child of the node holds the northwest, northeast, southwest, southeast bounding box within the larger bounding box of the parent.

Each node will also hold a small number of markers (representing points of interest). Each marker is assigned an importance score. Markers that are considered more important are assigned to nodes higher up in the tree (so markers in the root node are considered the most important). The marker for the Louvre Museum in Paris will go in a higher node than a marker for a Starbucks in the same area.

Now, we’ll go through how Booking searches through the Quadtree and also how they build/update it.

Quadtree Searching

When the user selects a certain bounding box, Booking wants to retrieve the most important markers for that bounding box from the Quadtree. Therefore, they use a Breadth-First Search to do this.

They’ll start with the root node and search through its markers to find if any intersect with the selected bounding box.

If they need more markers, then they’ll check which child nodes intersect with the bounding box and add them to the queue.

Nodes will be removed from the queue (in first-in-first-out order) and the function will search through their markers for any that intersect with the bounding box.

Once the BFS has found the requested number of markers, it will terminate and send them to the user, so they can be displayed on the map.

Building the Quadtree

The Quadtree is kept in-memory, and re-built from time-to-time to account for new markers that are added to the map (or account for changes in marker importance).

When building the data structure, the algorithm first starts with the root node. This node represents the entire geographic area (Booking scales the Quadtrees horizontally, so they have different Quadtree data structures for different geographic regions).

If each node can only hold 10 markers, then Booking will find the 10 most important markers from the entire list and insert them into the root node.

After, they create 4 children for the northeast/northwest/southwest/southeast regions of the root node’s area and repeat the process for each of those children.

This is repeated until all the markers are inserted into the Quadtree.

Results

Quadtrees are quite simple but extremely powerful, making them very popular across the industry. If you’re preparing for system design interviews then they’re a good data structure to be aware of.

Booking scales this system horizontally, by creating more Quadtrees and having each one handle a certain geographic area.

For Quadtrees storing more than 300,000 markers, the p99 lookup time is less than 5.5 milliseconds (99% of requests take under 5.5 milliseconds).

For more details, you can read the full blog post here.