Grab developed an automated data classification system using LLMs to replace manual tagging of sensitive data across their PetaByte-scale data infrastructure. They built an orchestration service called Gemini that integrates GPT-3.5 to classify database columns and generate metadata tags, significantly reducing manual effort in data governance. The system successfully processed over 20,000 data entities within a month of deployment, with 80% user satisfaction and minimal need for tag corrections.
# LLM-Powered Data Classification at Grab
## Background and Problem Statement
Grab, a major Southeast Asian super-app platform, faced significant challenges in managing and classifying their PetaByte-scale data infrastructure:
- Manual classification of data sensitivity was inefficient and inconsistent
- Half of all schemas were marked as highest sensitivity (Tier 1) due to conservative classification
- Table-level classification was not feasible manually due to data volume and velocity
- Inconsistent interpretation of data classification policies among developers
- Initial third-party classification tools had limitations in customization and accuracy
## Technical Solution Architecture
### Orchestration Service (Gemini)
- Built a dedicated orchestration service for metadata generation
- Key components:
### LLM Implementation Details
- Used GPT-3.5 with specific technical constraints:
- Prompt Engineering Techniques:
### System Integration and Workflow
- Integration with:
- Automated workflow:
## Performance and Results
### System Performance
- Processing capacity:
### Business Impact
- Time savings:
- Cost efficiency:
### Production Monitoring and Quality Control
- Weekly user verification process
- Plans to remove verification mandate once accuracy threshold reached
- Building analytical pipelines for prompt performance metrics
- Tracking version control for prompts
## Technical Challenges and Solutions
### Rate Limiting and Scaling
- Implemented service-level rate limiting
- Message queue for request aggregation
- Mini-batch processing for optimal throughput
### Output Format Control
- Structured JSON output format
- Schema enforcement in prompts
- Error handling for malformed outputs
### Integration Challenges
- Multiple data platform integration
- Kafka-based event architecture
- User feedback loop implementation
## Future Development Plans
### Technical Improvements
- Sample data integration in prompts
- Confidence level output implementation
- Automated verification based on confidence scores
- Analytical pipeline development for prompt evaluation
- Version control for prompts
### Scale and Expansion
- Plans to extend to more data platforms
- Development of downstream applications
- Integration with security and data discovery systems
## Production Best Practices
- Prompt versioning and evaluation
- User feedback integration
- Performance monitoring
- Cost optimization
- Output validation
- Rate limiting and quota management
- Error handling and recovery
- Integration testing
- Monitoring and alerting setup
## Regulatory Compliance
- Successfully demonstrated in Singapore government's regulatory sandbox
- Compliance with data protection requirements
- Safe handling of PII and sensitive data
- Integration with attribute-based access control
- Support for dynamic data masking
Start your new ML Project today with ZenML Pro
Join 1,000s of members already deploying models with ZenML.