LLM-Powered Data Classification at Grab
Background and Problem Statement
Grab, a major Southeast Asian super-app platform, faced significant challenges in managing and classifying their PetaByte-scale data infrastructure:
- Manual classification of data sensitivity was inefficient and inconsistent
- Half of all schemas were marked as highest sensitivity (Tier 1) due to conservative classification
- Table-level classification was not feasible manually due to data volume and velocity
- Inconsistent interpretation of data classification policies among developers
- Initial third-party classification tools had limitations in customization and accuracy
Technical Solution Architecture
Orchestration Service (Gemini)
- Built a dedicated orchestration service for metadata generation
- Key components:
LLM Implementation Details
- Used GPT-3.5 with specific technical constraints:
- Prompt Engineering Techniques:
System Integration and Workflow
- Integration with:
- Automated workflow:
Performance and Results
System Performance
Business Impact
- Time savings:
- Cost efficiency:
Production Monitoring and Quality Control
- Weekly user verification process
- Plans to remove verification mandate once accuracy threshold reached
- Building analytical pipelines for prompt performance metrics
- Tracking version control for prompts
Technical Challenges and Solutions
Rate Limiting and Scaling
- Implemented service-level rate limiting
- Message queue for request aggregation
- Mini-batch processing for optimal throughput
Output Format Control
- Structured JSON output format
- Schema enforcement in prompts
- Error handling for malformed outputs
Integration Challenges
- Multiple data platform integration
- Kafka-based event architecture
- User feedback loop implementation
Future Development Plans
Technical Improvements
- Sample data integration in prompts
- Confidence level output implementation
- Automated verification based on confidence scores
- Analytical pipeline development for prompt evaluation
- Version control for prompts
Scale and Expansion
- Plans to extend to more data platforms
- Development of downstream applications
- Integration with security and data discovery systems
Production Best Practices
- Prompt versioning and evaluation
- User feedback integration
- Performance monitoring
- Cost optimization
- Output validation
- Rate limiting and quota management
- Error handling and recovery
- Integration testing
- Monitoring and alerting setup
Regulatory Compliance
- Successfully demonstrated in Singapore government's regulatory sandbox
- Compliance with data protection requirements
- Safe handling of PII and sensitive data
- Integration with attribute-based access control
- Support for dynamic data masking