Notion faced challenges with rapidly growing data volume (10x in 3 years) and needed to support new AI features. They built a scalable data lake infrastructure using Apache Hudi, Kafka, Debezium CDC, and Spark to handle their update-heavy workload, reducing costs by over a million dollars and improving data freshness from days to minutes/hours. This infrastructure became crucial for successfully rolling out Notion AI features and their Search and AI Embedding RAG infrastructure.
# Building and Scaling Data Infrastructure for AI at Notion
## Company Overview and Challenge
Notion, a collaborative workspace platform, experienced massive data growth with their content doubling every 6-12 months, reaching hundreds of terabytes of compressed data across hundreds of billions of blocks. The company needed to build robust data infrastructure to support their AI features while managing this explosive growth efficiently.
## Technical Infrastructure Evolution
### Initial Setup
- Started with single Postgres instance
- Evolved to sharded architecture
### Data Warehouse Architecture 2021
- Used Fivetran for ELT pipeline
- Ingested data from Postgres WAL to Snowflake
- 480 hourly-run connectors for different shards
- Faced significant scaling challenges:
## New Data Lake Architecture
### Design Principles and Decisions
### Storage and Processing
- Selected S3 as primary data repository
- Chose Spark as main processing engine
### Data Ingestion Strategy
- Implemented hybrid approach
- Used Kafka Debezium CDC connector
- Selected Apache Hudi for Kafka to S3 ingestion
### Implementation Details
### CDC and Kafka Configuration
- One Debezium CDC connector per Postgres host
- Deployed in AWS EKS cluster
- Single Kafka topic per Postgres table
- Handles tens of MB/sec of row changes
### Hudi Optimization
- Used COPY_ON_WRITE table type with UPSERT operation
- Implemented partition/shard data using Postgres shard scheme
- Sorted data based on last updated time
- Utilized bloom filter index type
- Achieved few minutes to 2-hour delay for data freshness
### Spark Processing Setup
- Leveraged PySpark for majority of jobs
- Used Scala Spark for complex operations
- Implemented separate handling for large and small shards
- Utilized multi-threading and parallel processing
## AI and ML Infrastructure Support
### Enabling AI Features
- Built foundation for Notion AI features
- Supported Search and AI Embedding RAG infrastructure
- Enabled efficient vector database operations
- Provided fresh, denormalized data for AI models
### Performance Improvements
- Reduced ingestion time from days to minutes/hours
- Achieved 24-hour re-sync capability
- Saved over $1 million in costs for 2022
- Improved data freshness for AI/ML pipelines
## Production Operations
### Monitoring and Management
- Continuous optimization of infrastructure components
- Regular scaling of EKS and Kafka clusters
- Automated frameworks for self-service
- Monitoring of data freshness and quality
### Best Practices
- Maintained raw data as single source of truth
- Implemented staged data processing
- Used intermediate S3 storage for processed data
- Limited downstream system load through selective ingestion
## Future Developments
### Ongoing Improvements
- Building automated frameworks
- Developing self-service capabilities
- Enhancing product use cases
- Scaling infrastructure for continued growth
### Infrastructure Evolution
- Supporting new AI feature rollouts
- Improving data processing efficiency
- Expanding support for machine learning use cases
- Enhancing RAG infrastructure capabilities
Start your new ML Project today with ZenML Pro
Join 1,000s of members already deploying models with ZenML.