Grab

Company

Grab

Title

LLM-Powered Data Classification System for Enterprise-Scale Metadata Generation

Industry

Tech

Link

https://engineering.grab.com/llm-powered-data-classification

Year

2023

Summary (short)

Grab developed an automated data classification system using LLMs to replace manual tagging of sensitive data across their PetaByte-scale data infrastructure. They built an orchestration service called Gemini that integrates GPT-3.5 to classify database columns and generate metadata tags, significantly reducing manual effort in data governance. The system successfully processed over 20,000 data entities within a month of deployment, with 80% user satisfaction and minimal need for tag corrections.

LLM-Powered Data Classification at Grab

Background and Problem Statement

Grab, a major Southeast Asian super-app platform, faced significant challenges in managing and classifying their PetaByte-scale data infrastructure:

Manual classification of data sensitivity was inefficient and inconsistent
Half of all schemas were marked as highest sensitivity (Tier 1) due to conservative classification
Table-level classification was not feasible manually due to data volume and velocity
Inconsistent interpretation of data classification policies among developers
Initial third-party classification tools had limitations in customization and accuracy

Technical Solution Architecture

Orchestration Service (Gemini)

Built a dedicated orchestration service for metadata generation
Key components:

LLM Implementation Details

Used GPT-3.5 with specific technical constraints:
Prompt Engineering Techniques:

System Integration and Workflow

Integration with:
Automated workflow:

Performance and Results

System Performance

Processing capacity:

Business Impact

Time savings:
Cost efficiency:

Production Monitoring and Quality Control

Weekly user verification process
Plans to remove verification mandate once accuracy threshold reached
Building analytical pipelines for prompt performance metrics
Tracking version control for prompts

Technical Challenges and Solutions

Rate Limiting and Scaling

Implemented service-level rate limiting
Message queue for request aggregation
Mini-batch processing for optimal throughput

Output Format Control

Structured JSON output format
Schema enforcement in prompts
Error handling for malformed outputs

Integration Challenges

Multiple data platform integration
Kafka-based event architecture
User feedback loop implementation

Future Development Plans

Technical Improvements

Sample data integration in prompts
Confidence level output implementation
Automated verification based on confidence scores
Analytical pipeline development for prompt evaluation
Version control for prompts

Scale and Expansion

Plans to extend to more data platforms
Development of downstream applications
Integration with security and data discovery systems

Production Best Practices

Prompt versioning and evaluation
User feedback integration
Performance monitoring
Cost optimization
Output validation
Rate limiting and quota management
Error handling and recovery
Integration testing
Monitoring and alerting setup

Regulatory Compliance

Successfully demonstrated in Singapore government's regulatory sandbox
Compliance with data protection requirements
Safe handling of PII and sensitive data
Integration with attribute-based access control
Support for dynamic data masking

Start your new ML Project today with ZenML Pro

Join 1,000s of members already deploying models with ZenML.

Learn more

Try Free