Company
Grab
Title
LLM-Powered Data Classification System for Enterprise-Scale Metadata Generation
Industry
Tech
Year
2023
Summary (short)
Grab developed an automated data classification system using LLMs to replace manual tagging of sensitive data across their PetaByte-scale data infrastructure. They built an orchestration service called Gemini that integrates GPT-3.5 to classify database columns and generate metadata tags, significantly reducing manual effort in data governance. The system successfully processed over 20,000 data entities within a month of deployment, with 80% user satisfaction and minimal need for tag corrections.

LLM-Powered Data Classification at Grab

Background and Problem Statement

Grab, a major Southeast Asian super-app platform, faced significant challenges in managing and classifying their PetaByte-scale data infrastructure:

  • Manual classification of data sensitivity was inefficient and inconsistent
  • Half of all schemas were marked as highest sensitivity (Tier 1) due to conservative classification
  • Table-level classification was not feasible manually due to data volume and velocity
  • Inconsistent interpretation of data classification policies among developers
  • Initial third-party classification tools had limitations in customization and accuracy

Technical Solution Architecture

Orchestration Service (Gemini)

  • Built a dedicated orchestration service for metadata generation
  • Key components:

LLM Implementation Details

  • Used GPT-3.5 with specific technical constraints:
  • Prompt Engineering Techniques:

System Integration and Workflow

  • Integration with:
  • Automated workflow:

Performance and Results

System Performance

  • Processing capacity:

Business Impact

  • Time savings:
  • Cost efficiency:

Production Monitoring and Quality Control

  • Weekly user verification process
  • Plans to remove verification mandate once accuracy threshold reached
  • Building analytical pipelines for prompt performance metrics
  • Tracking version control for prompts

Technical Challenges and Solutions

Rate Limiting and Scaling

  • Implemented service-level rate limiting
  • Message queue for request aggregation
  • Mini-batch processing for optimal throughput

Output Format Control

  • Structured JSON output format
  • Schema enforcement in prompts
  • Error handling for malformed outputs

Integration Challenges

  • Multiple data platform integration
  • Kafka-based event architecture
  • User feedback loop implementation

Future Development Plans

Technical Improvements

  • Sample data integration in prompts
  • Confidence level output implementation
  • Automated verification based on confidence scores
  • Analytical pipeline development for prompt evaluation
  • Version control for prompts

Scale and Expansion

  • Plans to extend to more data platforms
  • Development of downstream applications
  • Integration with security and data discovery systems

Production Best Practices

  • Prompt versioning and evaluation
  • User feedback integration
  • Performance monitoring
  • Cost optimization
  • Output validation
  • Rate limiting and quota management
  • Error handling and recovery
  • Integration testing
  • Monitoring and alerting setup

Regulatory Compliance

  • Successfully demonstrated in Singapore government's regulatory sandbox
  • Compliance with data protection requirements
  • Safe handling of PII and sensitive data
  • Integration with attribute-based access control
  • Support for dynamic data masking

Start your new ML Project today with ZenML Pro

Join 1,000s of members already deploying models with ZenML.