Company
Grab
Title
LLM-Powered Data Classification System for Enterprise-Scale Metadata Generation
Industry
Tech
Year
2023
Summary (short)
Grab developed an automated data classification system using LLMs to replace manual tagging of sensitive data across their PetaByte-scale data infrastructure. They built an orchestration service called Gemini that integrates GPT-3.5 to classify database columns and generate metadata tags, significantly reducing manual effort in data governance. The system successfully processed over 20,000 data entities within a month of deployment, with 80% user satisfaction and minimal need for tag corrections.
# LLM-Powered Data Classification at Grab ## Background and Problem Statement Grab, a major Southeast Asian super-app platform, faced significant challenges in managing and classifying their PetaByte-scale data infrastructure: - Manual classification of data sensitivity was inefficient and inconsistent - Half of all schemas were marked as highest sensitivity (Tier 1) due to conservative classification - Table-level classification was not feasible manually due to data volume and velocity - Inconsistent interpretation of data classification policies among developers - Initial third-party classification tools had limitations in customization and accuracy ## Technical Solution Architecture ### Orchestration Service (Gemini) - Built a dedicated orchestration service for metadata generation - Key components: ### LLM Implementation Details - Used GPT-3.5 with specific technical constraints: - Prompt Engineering Techniques: ### System Integration and Workflow - Integration with: - Automated workflow: ## Performance and Results ### System Performance - Processing capacity: ### Business Impact - Time savings: - Cost efficiency: ### Production Monitoring and Quality Control - Weekly user verification process - Plans to remove verification mandate once accuracy threshold reached - Building analytical pipelines for prompt performance metrics - Tracking version control for prompts ## Technical Challenges and Solutions ### Rate Limiting and Scaling - Implemented service-level rate limiting - Message queue for request aggregation - Mini-batch processing for optimal throughput ### Output Format Control - Structured JSON output format - Schema enforcement in prompts - Error handling for malformed outputs ### Integration Challenges - Multiple data platform integration - Kafka-based event architecture - User feedback loop implementation ## Future Development Plans ### Technical Improvements - Sample data integration in prompts - Confidence level output implementation - Automated verification based on confidence scores - Analytical pipeline development for prompt evaluation - Version control for prompts ### Scale and Expansion - Plans to extend to more data platforms - Development of downstream applications - Integration with security and data discovery systems ## Production Best Practices - Prompt versioning and evaluation - User feedback integration - Performance monitoring - Cost optimization - Output validation - Rate limiting and quota management - Error handling and recovery - Integration testing - Monitoring and alerting setup ## Regulatory Compliance - Successfully demonstrated in Singapore government's regulatory sandbox - Compliance with data protection requirements - Safe handling of PII and sensitive data - Integration with attribute-based access control - Support for dynamic data masking

Start your new ML Project today with ZenML Pro

Join 1,000s of members already deploying models with ZenML.