The Architecture of Notion's Data Lake

We'll talk about how Notion built their own in-house data lake to power analytics and ML jobs. Plus, how to build a strong relationship with your manager, Instacart's text-to-image service and more.

Hey Everyone!

Today we’ll be talking about

  • The Architecture of Notion’s Data Lake

    • Notion has been experiencing exponential growth. The data they store has been doubling every 6-12 months

    • They needed to build a data pipeline to extract data from Postgres and store it in a data warehouse for analytics and ML jobs

    • They built their system using AWS S3, Snowflake, Fivetran, Apache Hudi, Spark, and more

    • We’ll talk about some problems they faced and how they resolved them

  • Tech Snippets

    • How to build a strong relationship with your manager

    • How Instacart built a text-to-image service to generate images for grocery stores

    • How to improve hiring quality at your organization

Many engineering roles today need developers to get involved in product decisions, talk to users and analyze usage data. However, understanding how to use analytics data to make the right decision is hard.

Product for Engineers wrote a fantastic blog post delving into some of the mistakes devs make when they’re trying to make product decisions based on analytics data.

Some of the mistakes include

  • Making it too Complicated - It’s easy to get overwhelmed by the huge swath of data tools. Instead, start small. Pick a specific feature and track its usage with trends and retention. Use that to iterate. 

  • Not Using Session Replays - Session replays are a fantastic tool for uncovering bugs, unexpected behavior and UX issues. They have a very high information density and aren’t just for PMs or marketers.

  • Only focusing on the Numbers - Relying on data alone is like tying one arm behind your back. You also need qualitative data like surveys and user interviews. Combining the two will help you build better products.

For the rest of the mistakes, check out Product for Engineers. It’s a fantastic newsletter by PostHog that helps developers learn how to build apps that users love.

To hone your product skills and read more articles like this, check out Product for Engineers below.

partnership

The Architecture of Notion’s Data Lake

Notion is a productivity application that serves as an “all-in-one” workspace. It’s incredibly powerful and you can use it to write notes, build spreadsheets, manage calendars, store documents, track timelines and much more.

The company was founded in 2013 and has since grown to over 35 million users and a $10 billion valuation.

With this exponential user growth comes massive scaling pains. The Notion team has had to deal with their user data doubling every 6-12 months.

Now, user content on Notion (notes, images, documents, etc.) has grown to hundreds of terabytes of compressed data.

The company uses a cluster of Postgres servers for storing this data with 96 physical machines.

Previously, they used Postgres for both their online and offline traffic. Online traffic consists of read/write requests from people using the Notion app. Offline traffic is from Notion employees running data analytics and machine learning jobs.

In 2021, Notion’s data engineers decided that they needed to build dedicated infrastructure for handling the offline traffic so that it wouldn’t interfere with requests from users. To solve this, the Notion engineers built an ELT pipeline.

Last week, they published a fantastic blog post talking about the design choices they made for their pipeline and data lake. They also delved into their tech stack and why they picked those specific technologies (Snowflake, S3, Fivetran and more). 

Brief Intro to ETL and ELT

The data for your app is probably in a bunch of different places. You might have payment data in Stripe, customer data in Postgres, website-usage data in Google Analytics, etc.

ETL and ELT are processes for extracting data from all the different sources and loading a cleaned version into your data warehouse.

ETL (Extract-Transform-Load) has been the traditional approach with data warehousing where you extract data from the sources, transform the data in your data pipelines (clean and aggregate it) and then load it into your data warehouse.

A newer paradigm is ELT (Extract-Load-Transform), where you extract the raw, unstructured data and load it into your data warehouse. Then, you run the transform step on the data in the data warehouse. With this approach, you can have more flexibility in how you do your data transformations compared to using data pipelines.

In order to effectively do ELT, you need to use a “modern data warehouse” that can ingest unstructured data and run complex queries and aggregations to clean and transform it. Examples of modern data warehouses include Snowflake, AWS Redshift, Google BigQuery and more.

Notion’s ELT Approach

In 2021, Notion built an ELT pipeline with Fivetran and Snowflake. Fivetran is a platform that moves data between your sources and the data warehouse you’re using. It comes with hundreds of pre-built connectors that can connect to Stripe, Postgres, Salesforce, MySQL, etc. to ingest data.

You can perform transformations for cleaning, validating, structuring, etc. in Fivetran and then load the data into your data warehouse (Redshift, Snowflake, etc.).

Notion connected Fivetran to ingest data from Postgres (by reading Postgres’ Write Ahead Log) and send it to Snowflake. They set up 480 Fivetran connectors to connect to each of the 480 logical Postgres shards Notion has in their fleet.

Scaling Challenges

Notion started facing issues with this ELT setup.

  1. Operability - monitoring and maintaining 480 Fivetran connectors was creating a significant on-call burden for the engineers. Developers would also have to resync them during Postgres upgrade/maintenance periods.

  2. Data Freshness and Speed - Ingesting data to Snowflake became slow and costly due to Notion’s specific workload (they have a very update-heavy workload). Most data warehouses are optimized for insert-heavy workloads instead of update-heavy.

  3. Use Case Support - the data transformation logic Notion wanted to implement became increasingly complex. It was challenging to implement this in the standard SQL interface offered by off-the-shelf data warehouses.  

The Architecture of Notion’s In-house Data Lake

To solve this, Notion decided to build their own data lake internally.

Here’s the high level architecture.

Notion shifted their pipeline to use Debezium (for transferring data between sources and the data warehouse), Apache Spark (for processing and transforming data), AWS S3 (for the data warehouse) and Apache Hudi (for writing updates from Kafka to S3).

Note - despite this shift, Notion continues to use Snowflake and Fivetran. They found Snowflake works well for insert-heavy workloads and that Fivetran is effective for non-update heavy tables and small dataset ingestion.

Debezium is an open-source platform for Change Data Capture (CDC). Notion has Debezium CDC connectors set up with their Postgres shards. As new data/changes are written to Postgres, these connectors will send the updates to Apache Kafka.

From Kafka, these updates are cleaned, aggregated and transformed using Apache Spark.

Spark is an open source data processing framework that can process hundreds of terabytes of data in a distributed way. It offers a ton of customization options around partitioning and resource allocation. We did a deep dive on Spark that you can read here.

After processing, the data is written to AWS S3 using Apache Hudi. Hudi is a data management framework that brings ACID transactional guarantees to data lakes like S3. It makes it much easier to make incremental updates to your data lake while having rollbacks, auditing and debugging capabilities.

With this new architecture, Notion was able to reduce the end-to-end ingestion time from over a day to minutes/hours. Additionally, it’s resulted in millions of dollars of savings and helped Notion integrate new AI features into the app.

Feature flags make it simple to switch a new feature on/off without requiring a redeploy. They lower the stakes when releasing new features by making rollbacks extremely easy.

Here are some tips on using Feature Flags

  1. Ensure that each feature flag has a clear and documented purpose. A single flag should not be used for multiple, unrelated features

  2. Flags should be temporary and removed from the codebase once they’re no longer needed

  3. Use a centralized system or tool to manage feature flags

For more tips on feature flags, running A/B tests, talking to users, pricing for SaaS and much more, check out Product for Engineers.

This is an awesome newsletter that teaches product skills to software developers.

Having a great sense of product will help you ship more impactful features at work so you can get promoted faster.

sponsored

Tech Snippets