ICE/NYSE developed a text-to-SQL application using structured RAG to enable business users to query financial data without needing SQL knowledge. The system leverages Databricks' Mosaic AI stack including Unity Catalog, Vector Search, Foundation Model APIs, and Model Serving. They implemented comprehensive evaluation methods using both syntactic and execution matching, achieving 77% syntactic accuracy and 96% execution match across approximately 50 queries. The system includes continuous improvement through feedback loops and few-shot learning from incorrect queries.
ICE/NYSE, one of the world's largest financial exchange operators, successfully implemented a production text-to-SQL system using structured RAG (Retrieval Augmented Generation) to democratize data access for business users. This case study demonstrates a comprehensive approach to deploying LLMs in production, with particular attention to evaluation, monitoring, and continuous improvement.
### System Architecture and Implementation
The system architecture leverages multiple components of the Databricks Mosaic AI stack:
* Unity Catalog for data storage and governance
* Vector Search for efficient metadata indexing and retrieval
* Foundation Model APIs for LLM access
* Model Serving for deployment
The text-to-SQL system is built with a modular architecture that allows for easy configuration and extension. The core components include:
**Metadata Management:**
The system maintains two critical metadata tables:
* table_definitions: Stores table metadata including column names, types, and descriptions
* sample_queries: Contains question-SQL pairs for few-shot learning
**Context Retrieval:**
The system uses two types of retrievers:
* ConfigRetriever: Reads context from YAML configuration for quick testing
* VectorSearchRetriever: Performs similarity search on indexed metadata for production use
**Prompt Engineering:**
The system constructs prompts by combining:
* User question
* Retrieved table DDLs with column descriptions
* Sample data from relevant tables
* Example question-SQL pairs
### Evaluation Framework
A particularly noteworthy aspect of this implementation is its robust evaluation framework, which combines multiple approaches:
**Syntactic Matching:**
* Evaluates SQL component accuracy using F1 scores
* Implements exact matching for complete query validation
* Adapts the Spider benchmark evaluation methodology
* Handles SQL query variations through preprocessing
**Execution Matching:**
* Compares actual query results
* Validates row counts, content, and column types
* Provides functional equivalence verification
### Continuous Improvement Pipeline
The system implements a sophisticated feedback loop:
1. All queries and responses are logged in Inference Tables
2. Ground truth SQL statements are collected through Label Studio
3. Automated evaluation identifies incorrect queries
4. Failed queries are added to the sample_queries table for few-shot learning
5. The system learns from mistakes and improves over time
### Production Deployment
The deployment strategy focuses on maintainability and scalability:
* Implemented as a Python library with modular components
* Configuration managed through YAML files
* Deployed as a Python wheel file in Unity Catalog
* Served through Model Serving endpoints
### Performance and Results
The system achieved impressive results:
* 77% syntactic accuracy
* 96% execution match rate
* Implementation completed in just 5 weeks
* Tested across approximately 50 queries of varying complexity
### Key Learnings and Best Practices
Several important insights emerged from this implementation:
**Data Quality and Documentation:**
* Comprehensive table and column metadata is crucial
* Well-documented sample queries improve accuracy
* Regular updates to metadata improve context relevance
**Evaluation and Monitoring:**
* Multiple evaluation metrics provide better insight
* Automated evaluation enables continuous improvement
* Regular monitoring of model performance is essential
**System Design:**
* Modular architecture enables easy updates
* Configuration-driven approach increases flexibility
* Vector search improves context retrieval efficiency
### Technical Challenges and Solutions
The team addressed several technical challenges:
**Query Standardization:**
* Implemented preprocessing to handle SQL variations
* Standardized table aliases and column references
* Removed unsupported SQL features
**Context Management:**
* Developed efficient indexing for metadata
* Implemented dynamic context retrieval
* Balanced context relevance and prompt size
**Production Deployment:**
* Containerized application for consistent deployment
* Implemented monitoring and logging
* Established evaluation pipelines
### Impact and Business Value
The implementation delivered significant business value:
* Enabled non-technical users to access data directly
* Reduced time to insight for business analysts
* Improved data accessibility across the organization
* Maintained high accuracy while simplifying data access
### Future Improvements
The team identified several areas for future enhancement:
* Expanding the range of supported SQL patterns
* Improving context retrieval accuracy
* Enhancing evaluation metrics
* Scaling the system to handle more complex queries
This case study demonstrates a well-architected approach to implementing LLMs in production, with particular attention to evaluation and continuous improvement. The success of this implementation shows how careful attention to metadata, evaluation, and feedback loops can create a robust and reliable AI system in a production environment.
Start your new ML Project today with ZenML Pro
Join 1,000s of members already deploying models with ZenML.