The Architecture of Notion's Data Lake

We'll talk about how Notion built their own in-house data lake to power analytics and ML jobs. Plus, how to build a strong relationship with your manager, Instacart's text-to-image service and more.

Hey Everyone!

Today we’ll be talking about

  • The Architecture of Notion’s Data Lake

    • Notion has been experiencing exponential growth. The data they store has been doubling every 6-12 months

    • They needed to build a data pipeline to extract data from Postgres and store it in a data warehouse for analytics and ML jobs

    • They built their system using AWS S3, Snowflake, Fivetran, Apache Hudi, Spark, and more

    • We’ll talk about some problems they faced and how they resolved them

  • Tech Snippets

    • How to build a strong relationship with your manager

    • How Instacart built a text-to-image service to generate images for grocery stores

    • How to improve hiring quality at your organization

The Architecture of Notion’s Data Lake

Notion is a productivity application that serves as an “all-in-one” workspace. It’s incredibly powerful and you can use it to write notes, build spreadsheets, manage calendars, store documents, track timelines and much more.

The company was founded in 2013 and has since grown to over 35 million users and a $10 billion valuation.

With this exponential user growth comes massive scaling pains. The Notion team has had to deal with their user data doubling every 6-12 months.

Now, user content on Notion (notes, images, documents, etc.) has grown to hundreds of terabytes of compressed data.

The company uses a cluster of Postgres servers for storing this data with 96 physical machines.

Previously, they used Postgres for both their online and offline traffic. Online traffic consists of read/write requests from people using the Notion app. Offline traffic is from Notion employees running data analytics and machine learning jobs.

In 2021, Notion’s data engineers decided that they needed to build dedicated infrastructure for handling the offline traffic so that it wouldn’t interfere with requests from users. To solve this, the Notion engineers built an ELT pipeline.

Last week, they published a fantastic blog post talking about the design choices they made for their pipeline and data lake. They also delved into their tech stack and why they picked those specific technologies (Snowflake, S3, Fivetran and more). 

Brief Intro to ETL and ELT

The data for your app is probably in a bunch of different places. You might have payment data in Stripe, customer data in Postgres, website-usage data in Google Analytics, etc.

ETL and ELT are processes for extracting data from all the different sources and loading a cleaned version into your data warehouse.

ETL (Extract-Transform-Load) has been the traditional approach with data warehousing where you extract data from the sources, transform the data in your data pipelines (clean and aggregate it) and then load it into your data warehouse.

A newer paradigm is ELT (Extract-Load-Transform), where you extract the raw, unstructured data and load it into your data warehouse. Then, you run the transform step on the data in the data warehouse. With this approach, you can have more flexibility in how you do your data transformations compared to using data pipelines.

In order to effectively do ELT, you need to use a “modern data warehouse” that can ingest unstructured data and run complex queries and aggregations to clean and transform it. Examples of modern data warehouses include Snowflake, AWS Redshift, Google BigQuery and more.

Notion’s ELT Approach

In 2021, Notion built an ELT pipeline with Fivetran and Snowflake. Fivetran is a platform that moves data between your sources and the data warehouse you’re using. It comes with hundreds of pre-built connectors that can connect to Stripe, Postgres, Salesforce, MySQL, etc. to ingest data.

You can perform transformations for cleaning, validating, structuring, etc. in Fivetran and then load the data into your data warehouse (Redshift, Snowflake, etc.).

Notion connected Fivetran to ingest data from Postgres (by reading Postgres’ Write Ahead Log) and send it to Snowflake. They set up 480 Fivetran connectors to connect to each of the 480 logical Postgres shards Notion has in their fleet.

Scaling Challenges

Notion started facing issues with this ELT setup.

  1. Operability - monitoring and maintaining 480 Fivetran connectors was creating a significant on-call burden for the engineers. Developers would also have to resync them during Postgres upgrade/maintenance periods.

  2. Data Freshness and Speed - Ingesting data to Snowflake became slow and costly due to Notion’s specific workload (they have a very update-heavy workload). Most data warehouses are optimized for insert-heavy workloads instead of update-heavy.

  3. Use Case Support - the data transformation logic Notion wanted to implement became increasingly complex. It was challenging to implement this in the standard SQL interface offered by off-the-shelf data warehouses.  

The Architecture of Notion’s In-house Data Lake

To solve this, Notion decided to build their own data lake internally.

Here’s the high level architecture.

Notion shifted their pipeline to use Debezium (for transferring data between sources and the data warehouse), Apache Spark (for processing and transforming data), AWS S3 (for the data warehouse) and Apache Hudi (for writing updates from Kafka to S3).

Note - despite this shift, Notion continues to use Snowflake and Fivetran. They found Snowflake works well for insert-heavy workloads and that Fivetran is effective for non-update heavy tables and small dataset ingestion.

Debezium is an open-source platform for Change Data Capture (CDC). Notion has Debezium CDC connectors set up with their Postgres shards. As new data/changes are written to Postgres, these connectors will send the updates to Apache Kafka.

From Kafka, these updates are cleaned, aggregated and transformed using Apache Spark.

Spark is an open source data processing framework that can process hundreds of terabytes of data in a distributed way. It offers a ton of customization options around partitioning and resource allocation. We did a deep dive on Spark that you can read here.

After processing, the data is written to AWS S3 using Apache Hudi. Hudi is a data management framework that brings ACID transactional guarantees to data lakes like S3. It makes it much easier to make incremental updates to your data lake while having rollbacks, auditing and debugging capabilities.

With this new architecture, Notion was able to reduce the end-to-end ingestion time from over a day to minutes/hours. Additionally, it’s resulted in millions of dollars of savings and helped Notion integrate new AI features into the app.

Tech Snippets