Grab implemented an LLM-based data classification system called Gemini to automate the tagging of sensitive data across their PetaByte-scale data infrastructure. The system uses GPT-3.5 alongside existing third-party classification services to automatically generate metadata tags for data entities, replacing a manual process that was becoming unsustainable. The solution successfully processed over 20,000 data entities within its first month, achieving high accuracy with less than one tag correction needed per table on average, and saving an estimated 360 man-days per year in manual classification effort.
# LLM-Powered Data Classification at Grab
## Overview
Grab, a leading Southeast Asian super-app platform, implemented a sophisticated LLM-powered data classification system to handle their massive PetaByte-scale data infrastructure. The system automates the classification of sensitive data and generation of governance-related metadata, replacing what was previously a manual and time-consuming process.
## Technical Implementation
### System Architecture
- Orchestration Service (Gemini)
### LLM Integration Considerations
- Context Length Management
- Resource Management
### Prompt Engineering Strategy
- Clear Task Definition
- Best Practices Implementation
### Data Processing Pipeline
- Input Processing
- Output Handling
### Monitoring and Evaluation
- Accuracy Metrics
- Performance Tracking
- Cost Efficiency
## Production Deployment
### Integration Points
- Metadata Management Platform
- Production Database Management Platform
- Data Governance Systems
- Attribute-based Access Control (ABAC) Systems
### Scalability Considerations
- Request Batching
- Rate Limiting
### Verification Workflow
- User Review Process
## Challenges and Solutions
### Initial Challenges
- Manual Classification Limitations
- Third-party Tool Issues
### LLM-based Solutions
- Natural Language Interface
- Automated Classification
## Future Developments
### Planned Improvements
- Enhanced Prompt Engineering
- Evaluation Framework
### Scaling Plans
- Platform Integration
- Use Case Expansion
## Impact and Results
### Quantitative Benefits
- Time Savings
- Processing Efficiency
### Qualitative Improvements
- Enhanced Data Governance
- Operational Efficiency
### Business Value
- Cost-effective scaling solution
- Improved data security and governance
- Enhanced operational efficiency
- Foundation for advanced data management features
# LLM Implementation for Enterprise Data Classification at Grab
## Background and Problem Statement
Grab, a leading Southeast Asian super-app platform, faced significant challenges in managing and classifying their PetaByte-scale data infrastructure. The company handles countless data entities including database tables and Kafka message schemas, requiring careful classification of sensitive information for proper data governance.
- Initial challenges:
## Technical Solution Architecture
### Orchestration Service (Gemini)
The system implements a sophisticated orchestration service with several key components:
- Core Components:
- Technical Constraints Management:
### LLM Implementation Details
The system employs several sophisticated prompt engineering techniques:
- Prompt Design Principles:
- Classification Process:
### Integration and Data Flow
The system implements a comprehensive data flow:
- Data Processing Pipeline:
## Operational Aspects
### Performance Metrics
The system has demonstrated impressive operational metrics:
- Processing Capacity:
- Efficiency Gains:
### Quality Control and Monitoring
The implementation includes several quality control mechanisms:
- Verification Workflow:
### Infrastructure and Scaling
The system is designed for enterprise-scale operations:
- Technical Infrastructure:
## Future Developments
The team has outlined several areas for future enhancement:
- Prompt Engineering Improvements:
- Performance Monitoring:
- Scale-Out Plans:
## Best Practices and Learnings
Key insights from the implementation:
- Prompt Engineering:
- System Design:
- Integration Considerations:
The implementation serves as a model for enterprise-scale LLM operations, demonstrating how careful system design, prompt engineering, and operational considerations can create an effective, production-grade LLM-based classification system.
Start your new ML Project today with ZenML Pro
Join 1,000s of members already deploying models with ZenML.