Tech Dive - Postgres

This is a technical dive on Postgres. We'll be talking about it's key features, storage engine, MVCC, extensions, criticisms and more.

January 30, 2024

Hey Everyone!

Today we’ll be talking about

Tech Dive on Postgres
- History of Postgres
- Postgres’ Key Features & Selling Points
- What is an Object Relational Databases
- Postgres’ Storage Engine
- Multi-Version Concurrency Control Explained
- The Postgres Extensions Ecosystem
- Commonly Mentioned Issues with Postgres
- Working with Postgres in 2024
Tech Snippets
- How Cell-Based Architectures Work
- Are we at Peak Vector Database?
- Finding the Right Company to Reach Staff Engineer At
- Tips on Writing Documentation for Your Team

Postgres Tech Dive

If you take a look at the Stack Overflow Developer Survey, one of the technologies that you’ll always see at the top of the list is Postgres.

So… why is that? In this article we’ll delve into Postgres’ design and what makes it special.

This article should give you a solid understanding of Postgres’ strengths and weaknesses and will help you evaluate whether it’s the right tool for your workload.

At the very least, it’ll also help you sound smarter in your interviews when someone asks you about databases!

The following is the first section of the tech dive. The full article is available for Quastor Pro readers.

History of Postgres

Postgres was created in the late 1980s at UC Berkeley with the first release in 1989.

At the time, Ingres was another relational database that began in the 1970s at UC Berkeley and gained a lot of traction by the 80s. It was a proprietary database and competed with Oracle in the early 1980s for market share. By 1985 however, it became clear that Oracle would be the winner and that Ingres was dropping in market share.

Michael Stonebraker was the leader of the Ingres team and he left to go back to Berkeley in 1986 to create a new database that would address the issues with Ingres. He came up with the name POSTGRES (POST-Ingres).

Some of the primary benefits of Postgres over Ingres are

Better Support for Complex Objects - Ingres had limited support for complex objects (that encapsulated multiple related data elements) and made querying them inefficient (you’d have to make many queries to get all the data).
User Extensibility - Stonebraker wanted Postgres to provide extensibility for the user to create new data types (other than integers, floats, strings, etc.) and access methods (more than just B-Trees) and more. This wasn’t well supported in Ingres.
SQL Compliance - In the late 70s and early 80s, people still weren’t sure on which query language would win out. Ingres used QUEL whereas competitors like Oracle were using SQL. As you’re probably aware, SQL ended up winning out. Unfortunately, Ingres was slow to shift away from QUEL to SQL. This was a big factor in it’s demise.

However, adding SQL was not a major goal of Stonebraker’s. Instead,
SQL was added to Postgres in 1995. Postgres would later tack on SQL to the name (renamed it to PostgreSQL) so that potential users could be sure that it had strong support for SQL.

For more on this, you can check out the original paper on the design of Postgres here.

Postgres’ Key Selling Points

The design choices I talked about above were from the 80s. Starting a new relational database now with a main selling feature being SQL compliance is probably not a good idea (although, it still is a selling point for many NoSQL databases like Clickhouse or ScyllaDB).

So, what distinguishes Postgres nowadays?

Some of the main selling points are

Battle Tested - Tens of thousands of companies use Postgres and there’s been countless tests on benchmarks, scalability and more. Postgres is used (or has been used) at companies like Reddit, Uber, Skype, Spotify, etc.

With this, there’s a massive wealth of knowledge around potential issues, common bugs, pitfalls, etc. on forums like Stack Overflow, Slack/IRC, mailing threads and more.
Open Source & Community - Postgres has been open source since 1995 with a liberal license that’s similar to the BSD and MIT license. With this, it’s built up a vibrant community of developers who help teach the database and provide support for people with issues. Postgres also has outstanding documentation.
Features & Tooling - Postgres has a wide range of data types to support things like JSON, UUIDs, Booleans, Arrays, XML, etc. There’s also a countless number of cool features that handle common problems (for example, Foreign Data Wrappers let a database access data stored in external Postgres servers as if it were stored locally).

You also have great tooling with things like psql and PgAdmin.
Extensibility & Interoperability - As discussed above, one of the initial design goals of Postgres was extensibility. Over it’s 30 year history, there’s been a countless number of extensions that have been developed to make Postgres more powerful.
A few examples include
- PostGIS (adds support for geospatial objects)
- Timescale (ingest/query time series data)
- hstore (store key-value pairs in a single column).

Postgres Architecture

Object Relational Database

Postgres is considered an object-relational database management system (ORDBMS) rather than just being a relational database management system (RDBMS). This is because Postgres provides the standard relational database features but also provides features from object-oriented languages like table inheritance and function overloading.

Table inheritance lets you create a table that inherits all the columns and data from one or more parent tables. This can make it easier to represent hierarchical data and can also simplify table partitioning (splitting up the rows of a table across multiple sub tables based on a certain column-value for example).

With Postgres, you can also create functions to reuse a series of operations/queries. These functions can be written in SQL, Python, Perl, C and more. They can also be overloaded so you can have functions with the same name/output but a different number of input parameters. Functions can be set to trigger anytime there’s a certain database event (INSERT, UPDATE, DELETE, etc.).

These are a few examples of Postgres’s “object-oriented capabilities”. In addition to this, Postgres also obviously has all the capabilities you’d expect of a relational database.

Storage Engine

Database management systems can be subdivided into several components. There’s a component that handles requests/responses from clients. Another component takes care of parsing and optimizing SQL queries. Another keeps track of stored procedures.

from Database Internals by Alex Petrov (amazing book!)

At the core of the DBMS is the storage engine. It’s job is to handle the create/read/update/delete operations on disk and to do so in a performant (sometimes multi-threaded) way. This component can become extremely complicated when your DBMS starts introducing things like transactions.

Some databases ( MySQL, MariaDB, MongoDB and more) offer pluggable storage engines. The idea is that storage engines make different trade-offs so you build the DBMS with a modular architecture and add the ability to swap in new storage engines.

Commonly used storage engines include MyISAM, RocksDB, LevelDB, InnoDB and more. If you want an LSM-Tree based storage engine then you might go with RocksDB. If you’d prefer B-Trees, then you can go with MyISAM. We talked about this in more depth in our past tech dive on storage engines.

With Postgres, you do not have the ability to swap storage engines. You’ll frequently see this come up as a criticism of the database. Because of this design choice, you must use the decisions made around things like

Multi-version Concurrency Control (MVCC) - MVCC allows multiple database transactions to view the same data without locking it. We’ll talk about how this is useful and also why it’s sometimes criticized.
Write Ahead Log - Postgres uses a Write Ahead Log (WAL) for data durability and crash recovery. Changes to data are first written to this log before they’re applied to the database.
Vacuuming - Due to MVCC, Postgres does not physically remove data from a table when it’s deleted/updated. Instead, Postgres marks it with a tombstone. Later, Postgres runs a vacuum process where it goes through and deletes all the data marked with a tombstone from the tables.

However, Postgres has a vibrant ecosystem so there are attempts to change this. OrioleDB is a project that is building a new storage engine for Postgres with the goal of providing better performance and scalability.

This is the first part of the tech dive on Postgres. In the rest of the article, we’ll delve into

Multi-Version Concurrency Control with Postgres
Postgres’ Extension Ecosystem
Issues with Postgres
Postgres in 2024 (Neon, OrioleDB and more)

For the rest of the article, you can check out Quastor Pro here.

Thanks for supporting Quastor!

Are we at peak Vector Database?

With the current craze around LLMs, there’s been a surge in vector databases. Vector databases are fantastic for similarity search with complex objects. You can encode text, images or videos as vectors and then use a vector database to quickly find similar text/images/videos.

With LLMs, they’re useful for Retrieval-Augmented Generation. This is where you encode all the text in your knowledge base into vectors and store them in a vector database.

When someone queries the LLM, a program will first take the user’s query, turn it into a vector and search the vector database for similar vectors (to find matching documents from your knowledgebase).

Then, it’ll take the matching documents and add that to the LLM query and then feed that to the LLM to generate output. This allows you to run “ChatGPT on your documents”.

This article delves into whether vector databases are currently overhyped.

softwaredoug.com/blog/2024/01/24/are-we-at-peak-vector-db

Writing documentation for your team

Good Documentation is one of the biggest productivity boosts for an engineering team.

Here’s some tips on writing great docs

1. Be Selective - Documenting everything leads to noise and a big maintenance burden. Be explicit about why something is worth documenting.

2. Layer your Communication - Use headings and subheadings liberally.

3. Link to Sources of Truth - This makes it easier to update the docs when things change.

www.shubhro.com/2020/01/06/writing-documentation-team

How Companies Do Hyperscale with Cell-based Architectures

This is a fantastic introduction to cell-based architectures. You design your systems into “cells” where each cell is a self-contained unit of the system with it’s own data and services. Each cell operates independently to maximize fault tolerance and simplify management.

Cell-based architectures have been used at companies like Slack, DoorDash and Amazon.

newsletter.systemdesign.one/p/cell-based-architecture

Finding the Right Company to Reach Staff Engineer

This is a great blog post that delves into how to pick the right company for career progression, especially if you’re aiming for Staff+ roles.

Things to consider are
- Is the company meriotcratic or proceduralist
- What type of Staff Engineers does the company hire
- Do you have adequate support at the company?

lethain.com/finding-the-right-company