Grab: LLM-Powered Data Classification System for Large-Scale Enterprise Data Governance

LLMOps Database

Tech

Grab

Company

Grab

Title

LLM-Powered Data Classification System for Large-Scale Enterprise Data Governance

Industry

Tech

Link

https://engineering.grab.com/llm-powered-data-classification

Year

2023

Summary (short)

Grab implemented an LLM-based data classification system called Gemini to automate the tagging of sensitive data across their PetaByte-scale data infrastructure. The system uses GPT-3.5 alongside existing third-party classification services to automatically generate metadata tags for data entities, replacing a manual process that was becoming unsustainable. The solution successfully processed over 20,000 data entities within its first month, achieving high accuracy with less than one tag correction needed per table on average, and saving an estimated 360 man-days per year in manual classification effort.

Tags

regulatory_compliance

# LLM-Powered Data Classification at Grab ## Overview Grab, a leading Southeast Asian super-app platform, implemented a sophisticated LLM-powered data classification system to handle their massive PetaByte-scale data infrastructure. The system automates the classification of sensitive data and generation of governance-related metadata, replacing what was previously a manual and time-consuming process. ## Technical Implementation ### System Architecture - Orchestration Service (Gemini) ### LLM Integration Considerations - Context Length Management - Resource Management ### Prompt Engineering Strategy - Clear Task Definition - Best Practices Implementation ### Data Processing Pipeline - Input Processing - Output Handling ### Monitoring and Evaluation - Accuracy Metrics - Performance Tracking - Cost Efficiency ## Production Deployment ### Integration Points - Metadata Management Platform - Production Database Management Platform - Data Governance Systems - Attribute-based Access Control (ABAC) Systems ### Scalability Considerations - Request Batching - Rate Limiting ### Verification Workflow - User Review Process ## Challenges and Solutions ### Initial Challenges - Manual Classification Limitations - Third-party Tool Issues ### LLM-based Solutions - Natural Language Interface - Automated Classification ## Future Developments ### Planned Improvements - Enhanced Prompt Engineering - Evaluation Framework ### Scaling Plans - Platform Integration - Use Case Expansion ## Impact and Results ### Quantitative Benefits - Time Savings - Processing Efficiency ### Qualitative Improvements - Enhanced Data Governance - Operational Efficiency ### Business Value - Cost-effective scaling solution - Improved data security and governance - Enhanced operational efficiency - Foundation for advanced data management features # LLM Implementation for Enterprise Data Classification at Grab ## Background and Problem Statement Grab, a leading Southeast Asian super-app platform, faced significant challenges in managing and classifying their PetaByte-scale data infrastructure. The company handles countless data entities including database tables and Kafka message schemas, requiring careful classification of sensitive information for proper data governance. - Initial challenges: ## Technical Solution Architecture ### Orchestration Service (Gemini) The system implements a sophisticated orchestration service with several key components: - Core Components: - Technical Constraints Management: ### LLM Implementation Details The system employs several sophisticated prompt engineering techniques: - Prompt Design Principles: - Classification Process: ### Integration and Data Flow The system implements a comprehensive data flow: - Data Processing Pipeline: ## Operational Aspects ### Performance Metrics The system has demonstrated impressive operational metrics: - Processing Capacity: - Efficiency Gains: ### Quality Control and Monitoring The implementation includes several quality control mechanisms: - Verification Workflow: ### Infrastructure and Scaling The system is designed for enterprise-scale operations: - Technical Infrastructure: ## Future Developments The team has outlined several areas for future enhancement: - Prompt Engineering Improvements: - Performance Monitoring: - Scale-Out Plans: ## Best Practices and Learnings Key insights from the implementation: - Prompt Engineering: - System Design: - Integration Considerations: The implementation serves as a model for enterprise-scale LLM operations, demonstrating how careful system design, prompt engineering, and operational considerations can create an effective, production-grade LLM-based classification system.

Start your new ML Project today with ZenML Pro

Join 1,000s of members already deploying models with ZenML.

Learn more

Try Free