Uber: DragonCrawl: Uber's Journey to AI-Powered Mobile Testing Using Small Language Models

LLMOps Database

Automotive

Uber

Company

Uber

Title

DragonCrawl: Uber's Journey to AI-Powered Mobile Testing Using Small Language Models

Industry

Automotive

Link

https://www.uber.com/en-GB/blog/generative-ai-for-high-quality-mobile-testing/

Year

2024

Summary (short)

Uber developed DragonCrawl, an innovative AI-powered mobile testing system that uses a small language model (110M parameters) to automate app testing across multiple languages and cities. The system addressed critical challenges in mobile testing, including high maintenance costs and scalability issues across Uber's global operations. Using an MPNet-based architecture with a retriever-ranker approach, DragonCrawl achieved 99%+ stability in production, successfully operated in 85 out of 89 tested cities, and demonstrated remarkable adaptability to UI changes without requiring manual updates. The system proved particularly valuable by blocking ten high-priority bugs from reaching customers while significantly reducing developer maintenance time. Most notably, DragonCrawl exhibited human-like problem-solving behaviors, such as retrying failed operations and implementing creative solutions like app restarts to overcome temporary issues.

Tags

legacy_system_integration

# Notes on Uber's DragonCrawl Implementation ## Company/Use Case Overview - Problem: Mobile testing at scale (3000+ simultaneous experiments, 50+ languages) - Challenge: 30-40% of engineer time spent on test maintenance - Solution: AI-powered testing system mimicking human behavior - Scale: Global deployment across Uber's mobile applications ## Technical Implementation ### Model Architecture - Base Model: MPNet (110M parameters) - Embedding Size: 768 dimensions - Evaluation Approach: Retrieval-based task - Performance Metrics: ### Key Features - Language Model Integration - Testing Capabilities ## Production Results ### Performance Metrics - 99%+ stability in production - Success in 85 out of 89 cities - Zero maintenance requirements - Cross-device compatibility - Blocked 10 high-priority bugs ### Key Achievements - Automated testing across different: - Significant reduction in maintenance costs - Improved bug detection ## Technical Challenges & Solutions ### Hallucination Management - Small model size (110M parameters) to limit complexity - Ground truth validation from emulator - Action verification system - Loop detection and prevention ### Adversarial Cases - Detection of non-optimal paths - Implementation of guardrails - Solution steering mechanisms - Backtracking capabilities ### System Integration - GPS location handling - Payment system integration - UI change adaptation - Error recovery mechanisms ## Notable Behaviors ### Adaptive Problem Solving - Persistent retry mechanisms - App restart capability - Creative navigation solutions - Goal-oriented persistence ### Error Handling - Automatic retry on failures - Context-aware decision making - Alternative path finding - System state awareness ## Future Directions ### Planned Improvements - RAG applications development - Dragon Foundational Model (DFM) - Developer toolkit expansion - Enhanced testing capabilities ### Architectural Evolution - Smaller dataset utilization - Improved embedding quality - Enhanced reward modeling - Expanded use cases ## Key Learnings ### Success Factors - Small model advantages - Focus on specific use cases - Strong guardrails - Production-first approach ### Implementation Insights - Value of small, focused models - Importance of real-world testing - Benefits of goal-oriented design - Balance of automation and control ## Business Impact - Reduced maintenance costs - Improved test coverage - Enhanced bug detection - Accelerated development cycle

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.

Book a Demo

Use Open Source