Pinterest developed a Text-to-SQL system to help data analysts convert natural language questions into SQL queries. The system evolved through two iterations: first implementing a basic LLM-powered SQL generator integrated into their Querybook tool, then enhancing it with RAG-based table selection to help users identify relevant tables from their vast data warehouse. The implementation showed a 35% improvement in task completion speed for SQL query writing, with first-shot acceptance rates improving from 20% to over 40% as the system matured.
Pinterest's implementation of a Text-to-SQL system provides an excellent case study in the practical challenges and solutions of deploying LLMs in a production environment. The case study is particularly valuable as it details both the initial implementation and subsequent iteration, showing how real-world usage and limitations informed system improvements.
## System Overview and Evolution
Pinterest built their Text-to-SQL system as part of Querybook, their open-source SQL query tool. The system underwent two major iterations, each addressing different aspects of the production challenges in implementing LLM-based solutions.
### Initial Implementation
The first version focused on the core Text-to-SQL conversion functionality. The architecture demonstrated several important LLMOps considerations:
* Response Streaming: They implemented WebSocket-based streaming to handle the long response times (tens of seconds) typical of LLM operations, preventing poor user experience from long wait times. They used Langchain's partial JSON parsing capabilities to handle structured streaming responses effectively.
* Context Window Management: To handle large table schemas that might exceed LLM context windows, they implemented several practical solutions:
* Created reduced versions of table schemas with only essential elements
* Implemented column pruning based on metadata tags
* Carefully managed prompt construction to maximize available context
* Data Quality Enhancement: They addressed the challenge of LLMs generating syntactically correct but practically incorrect queries by incorporating low-cardinality column values into the table schema information. This helped ensure generated queries used actual valid values (e.g., 'WEB' instead of 'web' for platform values).
### RAG Enhancement
The second iteration added RAG capabilities to assist with table selection, addressing a key user pain point. This enhancement showed sophisticated production deployment of embedding and vector search technologies:
* Vector Index Creation: They implemented a dual-approach vector index containing both table summarizations and query summarizations, using OpenSearch as the vector store.
* Offline Processing: They created a pipeline for generating and updating table and query embeddings, showing awareness of computational efficiency and system scalability.
* Weighted Scoring: They implemented a weighted scoring system that prioritized table summaries over query summaries in the similarity search results.
## Evaluation and Monitoring
Pinterest's approach to evaluation showed strong awareness of the limitations of academic benchmarks and the importance of real-world performance metrics:
* They conducted initial evaluations against the Spider dataset but recognized its limitations in representing their actual use case.
* They tracked practical metrics including:
* First-shot acceptance rate (improved from 20% to 40%)
* Task completion speed (35% improvement with AI assistance)
* Table search hit rates (improved from 40% to 90% with proper metadata incorporation)
* They acknowledged the limitations of their evaluation data, particularly noting when data came from periods before users were aware of new features.
## Production Considerations
The case study reveals several important production considerations:
* Data Governance: They leveraged their existing table standardization efforts and tiering system to improve system performance.
* Performance Optimization: They implemented various optimizations including:
* Selective indexing of only top-tier tables
* Careful management of context windows
* Streaming responses for better user experience
* Integration: The system was tightly integrated with existing tools and workflows through Querybook, showing attention to user experience and adoption.
## Lessons and Future Improvements
The case study candidly discusses several areas for improvement, showing mature understanding of production system evolution:
* Need for better metadata enhancement and real-time index updates
* Importance of query validation before execution
* Value of systematic user feedback collection
* Recognition that real-world performance differs significantly from benchmark performance
## Technical Implementation Details
The system makes use of several modern LLMOps technologies and practices:
* WebSocket implementation for streaming responses
* Langchain for JSON parsing and streaming
* OpenSearch for vector storage and similarity search
* Careful prompt engineering for different system components
* Integration of metadata and historical usage data
## Conclusion
This case study provides valuable insights into the practical challenges of implementing LLM-based systems in production. It demonstrates the importance of iterative improvement based on user feedback, the need to go beyond simple accuracy metrics in evaluation, and the value of combining multiple approaches (like RAG and streaming) to create effective production systems. The candid discussion of both successes and areas for improvement makes this particularly valuable for understanding real-world LLMOps challenges.
Start your new ML Project today with ZenML Pro
Join 1,000s of members already deploying models with ZenML.