Doordash implemented a RAG-based chatbot system to improve their Dasher support automation, replacing a traditional flow-based system. They developed a comprehensive quality control approach combining LLM Guardrail for real-time response verification, LLM Judge for quality monitoring, and an iterative improvement pipeline. The system successfully reduced hallucinations by 90% and severe compliance issues by 99%, while handling thousands of support requests daily and allowing human agents to focus on more complex cases.
This case study from Doordash provides an excellent example of implementing LLMs in a production environment with a strong focus on quality control and monitoring. The company faced the challenge of improving their support system for delivery contractors (Dashers) who needed quick and accurate assistance during their deliveries.
The core system architecture is based on a Retrieval Augmented Generation (RAG) approach that leverages their existing knowledge base articles. What makes this case study particularly interesting from an LLMOps perspective is their comprehensive approach to quality control and monitoring, implementing three main components: the RAG system itself, an LLM Guardrail system, and an LLM Judge for quality evaluation.
The RAG implementation follows a systematic approach:
* Conversation summarization to distill the core issue from multi-turn conversations
* Retrieval of similar historical cases and relevant knowledge base articles
* Template-based prompt construction incorporating the context and retrieved information
* Response generation using the LLM
The team identified several critical challenges in their LLM deployment:
* Ensuring response groundedness and relevance
* Maintaining accurate context summarization
* Achieving language consistency in responses
* Ensuring consistency between function calls and response text
* Managing latency issues
Their LLM Guardrail system is particularly noteworthy as an example of practical quality control in production. They initially tested a sophisticated model-based approach but found it too expensive and slow. Instead, they developed a two-tier system:
* First tier: A cost-effective shallow check using semantic similarity comparison
* Second tier: An LLM-based evaluator that only activates when the first tier flags potential issues
This pragmatic approach to guardrails demonstrates a good balance between quality control and system performance. The guardrail system validates responses for groundedness, coherence, and policy compliance. The implementation successfully reduced hallucinations by 90% and severe compliance issues by 99%.
The quality monitoring system (LLM Judge) evaluates five key aspects:
* Retrieval correctness
* Response accuracy
* Grammar and language accuracy
* Coherence to context
* Relevance to user requests
Their quality improvement pipeline is comprehensive and includes:
* Knowledge base improvements through systematic review and updates
* Retrieval optimization through query contextualization and embedding model selection
* Prompt engineering improvements following clear principles:
* Breaking down complex prompts
* Avoiding negative language
* Implementing chain-of-thought prompting
A particularly interesting aspect of their LLMOps approach is their regression prevention strategy. They've implemented an open-source evaluation tool similar to unit testing in software development. This allows them to quickly iterate on prompts while maintaining quality standards. Any prompt changes trigger predefined tests, and new issues are systematically added to test suites.
The case study also highlights important operational considerations around latency and fallback strategies. When the guardrail system introduces too much latency, they strategically default to human agents rather than compromise on response time or quality. This demonstrates a practical approach to balancing automation with user experience.
From an LLMOps perspective, their monitoring and improvement pipeline is particularly sophisticated. They combine automated evaluation through LLM Judge with human review of random transcript samples, ensuring continuous calibration between automated and human evaluation. This dual approach helps maintain high quality while scaling the system.
The system successfully handles thousands of support requests daily, but they maintain a realistic view of its limitations. They acknowledge that complex support scenarios still require human expertise and see the system as complementary to human agents rather than a complete replacement.
Looking forward, they're focused on:
* Developing more precise quality assessment methods
* Expanding the system's capabilities to handle increasing complexity
* Continuous data collection and analysis for system improvements
* Enhanced ontology and RAG system development
This case study provides valuable insights into practical LLMOps implementation, particularly in handling the challenges of quality control, monitoring, and continuous improvement in a production environment. Their approach demonstrates how to balance the benefits of LLM automation with the need for reliability and quality in a customer-facing application.
Start your new ML Project today with ZenML Pro
Join 1,000s of members already deploying models with ZenML.