Grab developed a scalable LLM-based system called Gemini to automate the classification of sensitive data and PII across their massive data infrastructure. The system replaced manual classification processes with an orchestration service that uses GPT-3.5 to analyze and tag data entities, achieving high accuracy with minimal human verification needed. The solution has processed over 20,000 data entities within a month of deployment, saving an estimated 360 man-days per year while maintaining high classification accuracy.
# LLM-Powered Data Classification at Grab
## Overview and Context
Grab, a leading Southeast Asian super-app platform, implemented an LLM-powered system to automate data classification across their PetaByte-scale data infrastructure. The project emerged from a critical need to efficiently manage and protect sensitive data while improving data discovery processes. The system, named Gemini, represents a sophisticated implementation of LLMOps principles in a production environment.
## Technical Implementation
### System Architecture
- Orchestration Service (Gemini)
### LLM Integration Considerations
- Context Length Management
- Resource Management
### Prompt Engineering Strategy
- Carefully crafted prompts for data classification tasks
- Key techniques implemented:
- JSON output format enforcement for downstream processing
### Production Pipeline Components
- Data Platforms
- Message Queue System
- Verification System
## Operational Aspects
### Performance Metrics
- Processing Capacity
- Accuracy Measurements
### Cost Optimization
- Contrary to common assumptions, the system proved highly cost-effective
- Scalable solution capable of handling increased data entity coverage
### Production Safeguards
- Rate limiting implementation
- Token quota management
- Error handling mechanisms
- Batch processing optimization
## Governance and Compliance
### Data Security Measures
- Personal Identifiable Information (PII) detection
- Sensitivity tier classification
- Integration with Attribute-based Access Control (ABAC)
- Dynamic Data Masking implementation
### Verification Workflow
- Regular user verification cycles
- Feedback collection system
- Iterative prompt improvement process
- Compliance with regulatory requirements
## Future Developments
### Planned Enhancements
- Sample data integration for improved accuracy
- Confidence level output implementation
- Automated verification threshold system
- Analytical pipeline for prompt performance tracking
### Scaling Initiatives
- Extension to additional data platforms
- Development of downstream applications
- Security and data discovery integrations
## Technical Impact and Benefits
### Efficiency Gains
- 360 man-days saved annually
- Reduced manual classification effort
- Improved data governance processes
- Enhanced data discovery capabilities
### System Reliability
- High classification accuracy
- Minimal human verification needed
- Robust error handling
- Scalable architecture
## Implementation Learnings
### Best Practices
- Clear prompt engineering principles
- Effective rate limiting strategies
- Robust verification workflows
- Cost-effective scaling approaches
### Technical Challenges Addressed
- Context length limitations
- Token quota management
- Output format standardization
- Integration with existing systems
## Production Monitoring
### Key Metrics
- Processing volume tracking
- Accuracy measurements
- Cost monitoring
- User feedback collection
### Quality Assurance
- Regular verification cycles
- Prompt performance tracking
- User feedback integration
- Continuous improvement process
The implementation demonstrates a sophisticated approach to LLMOps in a production environment, successfully balancing automation, accuracy, and cost-effectiveness while maintaining robust governance standards. The system's success in handling large-scale data classification while reducing manual effort showcases the practical benefits of well-implemented LLM solutions in enterprise environments.
Start your new ML Project today with ZenML Pro
Join 1,000s of members already deploying models with ZenML.