The Architecture of Notion's Data Lake

We'll talk about how Notion built their own in-house data lake to power analytics and ML jobs. Plus, how to build a strong relationship with your manager, Instacart's text-to-image service and more.

Arpan KG
July 09, 2024

Hey Everyone!

Today we’ll be talking about

The Architecture of Notion’s Data Lake
- Notion has been experiencing exponential growth. The data they store has been doubling every 6-12 months
- They needed to build a data pipeline to extract data from Postgres and store it in a data warehouse for analytics and ML jobs
- They built their system using AWS S3, Snowflake, Fivetran, Apache Hudi, Spark, and more
- We’ll talk about some problems they faced and how they resolved them
Tech Snippets
- How to build a strong relationship with your manager
- How Instacart built a text-to-image service to generate images for grocery stores
- How to improve hiring quality at your organization

The Architecture of Notion’s Data Lake

Notion is a productivity application that serves as an “all-in-one” workspace. It’s incredibly powerful and you can use it to write notes, build spreadsheets, manage calendars, store documents, track timelines and much more.

The company was founded in 2013 and has since grown to over 35 million users and a $10 billion valuation.

With this exponential user growth comes massive scaling pains. The Notion team has had to deal with their user data doubling every 6-12 months.

Now, user content on Notion (notes, images, documents, etc.) has grown to hundreds of terabytes of compressed data.

The company uses a cluster of Postgres servers for storing this data with 96 physical machines.

Previously, they used Postgres for both their online and offline traffic. Online traffic consists of read/write requests from people using the Notion app. Offline traffic is from Notion employees running data analytics and machine learning jobs.

In 2021, Notion’s data engineers decided that they needed to build dedicated infrastructure for handling the offline traffic so that it wouldn’t interfere with requests from users. To solve this, the Notion engineers built an ELT pipeline.

Last week, they published a fantastic blog post talking about the design choices they made for their pipeline and data lake. They also delved into their tech stack and why they picked those specific technologies (Snowflake, S3, Fivetran and more).

Brief Intro to ETL and ELT

The data for your app is probably in a bunch of different places. You might have payment data in Stripe, customer data in Postgres, website-usage data in Google Analytics, etc.

ETL and ELT are processes for extracting data from all the different sources and loading a cleaned version into your data warehouse.

ETL (Extract-Transform-Load) has been the traditional approach with data warehousing where you extract data from the sources, transform the data in your data pipelines (clean and aggregate it) and then load it into your data warehouse.

A newer paradigm is ELT (Extract-Load-Transform), where you extract the raw, unstructured data and load it into your data warehouse. Then, you run the transform step on the data in the data warehouse. With this approach, you can have more flexibility in how you do your data transformations compared to using data pipelines.

In order to effectively do ELT, you need to use a “modern data warehouse” that can ingest unstructured data and run complex queries and aggregations to clean and transform it. Examples of modern data warehouses include Snowflake, AWS Redshift, Google BigQuery and more.

Notion’s ELT Approach

In 2021, Notion built an ELT pipeline with Fivetran and Snowflake. Fivetran is a platform that moves data between your sources and the data warehouse you’re using. It comes with hundreds of pre-built connectors that can connect to Stripe, Postgres, Salesforce, MySQL, etc. to ingest data.

You can perform transformations for cleaning, validating, structuring, etc. in Fivetran and then load the data into your data warehouse (Redshift, Snowflake, etc.).

Notion connected Fivetran to ingest data from Postgres (by reading Postgres’ Write Ahead Log) and send it to Snowflake. They set up 480 Fivetran connectors to connect to each of the 480 logical Postgres shards Notion has in their fleet.

Scaling Challenges

Notion started facing issues with this ELT setup.

Operability - monitoring and maintaining 480 Fivetran connectors was creating a significant on-call burden for the engineers. Developers would also have to resync them during Postgres upgrade/maintenance periods.
Data Freshness and Speed - Ingesting data to Snowflake became slow and costly due to Notion’s specific workload (they have a very update-heavy workload). Most data warehouses are optimized for insert-heavy workloads instead of update-heavy.
Use Case Support - the data transformation logic Notion wanted to implement became increasingly complex. It was challenging to implement this in the standard SQL interface offered by off-the-shelf data warehouses.

The Architecture of Notion’s In-house Data Lake

To solve this, Notion decided to build their own data lake internally.

Here’s the high level architecture.

Notion shifted their pipeline to use Debezium (for transferring data between sources and the data warehouse), Apache Spark (for processing and transforming data), AWS S3 (for the data warehouse) and Apache Hudi (for writing updates from Kafka to S3).

Note - despite this shift, Notion continues to use Snowflake and Fivetran. They found Snowflake works well for insert-heavy workloads and that Fivetran is effective for non-update heavy tables and small dataset ingestion.

Debezium is an open-source platform for Change Data Capture (CDC). Notion has Debezium CDC connectors set up with their Postgres shards. As new data/changes are written to Postgres, these connectors will send the updates to Apache Kafka.

From Kafka, these updates are cleaned, aggregated and transformed using Apache Spark.

Spark is an open source data processing framework that can process hundreds of terabytes of data in a distributed way. It offers a ton of customization options around partitioning and resource allocation. We did a deep dive on Spark that you can read here.

After processing, the data is written to AWS S3 using Apache Hudi. Hudi is a data management framework that brings ACID transactional guarantees to data lakes like S3. It makes it much easier to make incremental updates to your data lake while having rollbacks, auditing and debugging capabilities.

With this new architecture, Notion was able to reduce the end-to-end ingestion time from over a day to minutes/hours. Additionally, it’s resulted in millions of dollars of savings and helped Notion integrate new AI features into the app.

Tech Snippets

How to Build a Strong Relationship with your Manager

It’s no secret that the key to career growth is having a great relationship with your manager. However, the how-to of building this relationship is rarely taught.

Jordan Cutler is a senior software engineer at Pinterest and he wrote a fantastic post delving into his strategies for nurturing this relationship.

The principles are to:

1. Reduce Uncertainty - help your manager understand where you’re at by outlining your objectives and regularly briefing them on your progress.

2. Managing your Priorities - understand your team’s work priorities and make sure your work is always aligned with that.

3. Show a Growth Mindset - Proactively seek feedback and actively respond to it.

Read the full blog post for all of Jordan’s strategies.

read.highgrowthengineer.com/p/managing-up-realizations

How Instacart built a text-to-image service to generate Images for Grocery Stores on the app

After analyzing user data, Instacart’s data scientists saw that adding food images can massively improve the customer experience. However, generating images for every possible add-on, topping, filling, etc., can be extremely time-consuming for the grocery stores on Instacart.

To solve this, they built a text-to-image service to create customizable food images for their grocery order management system. The system relies on Google Imagen for image generation.

Retailers can create prompts like “sliced monterey jack cheese on a white background” and then see multiple image variations to choose the best result.

Read the full blog post to learn about how Instacart built this.

tech.instacart.com/enhancing-foodstorm-with-ai-image-generation-d76a74867fa4

How to Improve Hiring Quality

As a company scales in size, it’s important to maintain the quality and consistency of hiring. Tactics to do ensure this include adding a “bar raiser” interviews, having a hiring committee, using structured approval and more.

In this article, Will Lethain talks about different strategies he’s seen implemented and his personal opinion on each. He also goes through questions you should ask to evaluate how well your org is implementing the approach.

His personal recommendation is to use a structured approval process where a candidate goes through a clear set of stages. Each stage has a single person who’s responsible (recruiter, hiring manager, head of engineering, etc.) for passing/failing the candidate.

lethain.com/bar-raising-hiring-committees-hiring-quality