Zilliz, the company behind the open-source Milvus vector database, shares their approach to scaling vector search to handle billions of vectors. They employ a multi-tier storage architecture spanning from GPU memory to object storage, enabling flexible trade-offs between performance, cost, and data freshness. The system uses GPU acceleration for both index building and search, implements real-time search through a buffer strategy, and handles distributed consistency challenges at scale.
This case study explores how Zilliz approaches the challenges of running vector search at massive scale in production environments through their open-source Milvus vector database. The insights come from Charles Xie, founder and CEO of Zilliz, who previously worked as a founding engineer of Oracle's 12c cloud database.
The core challenge addressed is scaling vector search to handle billions of vectors while maintaining performance and managing costs. Some of their largest deployments handle around 100 billion vectors, approaching internet-scale vector search. The vectors themselves have grown in dimensionality from 512 dimensions a few years ago to 1,000-2,000 dimensions today, creating additional scaling challenges.
Key Technical Approaches:
* Multi-tier Storage Architecture
The system implements a sophisticated four-layer storage hierarchy:
- GPU Memory: Fastest tier, storing ~1% of data for maximum performance
- RAM: Second tier, storing ~10% of frequently accessed data
- Local SSD: High-speed disk storage for larger data volumes
- Object Storage (like S3): Bottom tier for cost-effective bulk storage
This architecture allows users to make explicit trade-offs between performance, cost, and data volume. Organizations can configure the ratio of data across tiers based on their specific requirements - whether they need maximum performance (more data in higher tiers) or cost efficiency (more data in lower tiers).
* Real-time Search Capabilities
To handle the challenge of making newly written data immediately searchable without waiting for index building:
- New data goes into a buffer that uses brute force search
- When the buffer reaches a threshold, it triggers background index building
- Search queries combine results from both the buffer and main indexed storage
- This approach is similar to LSM trees in traditional databases but adapted for vector search
* GPU Acceleration
The system leverages GPUs for both index building and search operations:
- Collaborated with NVIDIA to develop GPU-optimized indexing algorithms
- Implements sophisticated data prefetching and caching between CPU and GPU
- Most effective for high-throughput scenarios (10,000-50,000 queries per second)
- Required careful optimization of data transmission between memory tiers
* Distributed System Challenges
Running at scale required solving several distributed systems challenges:
- Implemented Raft consensus protocol for distributed consistency
- Supports different consistency levels (eventual vs strong) based on use case needs
- Uses distributed write-ahead logging through Kafka or Apache Pulsar
- Handles data partitioning and load balancing across nodes
* Emerging Capabilities and Research
The team is working on several advanced features:
- Support for new embedding types like ColBERT and Matryoshka embeddings
- Integration with traditional search techniques like BM25
- Graph embeddings for knowledge base search
- Self-learning indices that adapt to specific data distributions and query patterns
- Autonomous tuning of index parameters using machine learning
* Production Considerations
Several key factors influence production deployments:
- Different organizations have varying requirements from early-stage startups (prioritizing ease of use) to enterprises (focusing on performance and compliance)
- Vector size impacts network transfer and storage requirements significantly
- System needs to handle both batch and real-time operations
- Trade-offs between consistency, latency, and throughput must be carefully managed
The case study demonstrates how modern vector search systems must balance multiple competing concerns in production:
- Performance vs Cost: Through tiered storage and flexible deployment options
- Freshness vs Latency: Via buffer-based real-time search
- Scalability vs Complexity: Through distributed system design
- Consistency vs Performance: By offering configurable consistency levels
The system's architecture shows how production vector search requires going beyond just the algorithmic challenges of similarity search to address the full spectrum of operational requirements. This includes handling real-time updates, managing system resources efficiently, and providing the flexibility to optimize for different use cases.
Start your new ML Project today with ZenML Pro
Join 1,000s of members already deploying models with ZenML.