Doordash implemented a RAG-based chatbot system to improve their Dasher support automation, replacing a traditional flow-based system. They developed a comprehensive quality control approach combining LLM Guardrail for real-time response verification, LLM Judge for quality monitoring, and an iterative improvement pipeline. The system successfully reduced hallucinations by 90% and severe compliance issues by 99%, while handling thousands of support requests daily and allowing human agents to focus on more complex cases.
This case study from Doordash provides an excellent example of implementing LLMs in a production environment with a strong focus on quality control and monitoring. The company faced the challenge of improving their support system for delivery contractors (Dashers) who needed quick and accurate assistance during their deliveries.
The core system architecture is based on a Retrieval Augmented Generation (RAG) approach that leverages their existing knowledge base articles. What makes this case study particularly interesting from an LLMOps perspective is their comprehensive approach to quality control and monitoring, implementing three main components: the RAG system itself, an LLM Guardrail system, and an LLM Judge for quality evaluation.
The RAG implementation follows a systematic approach:
The team identified several critical challenges in their LLM deployment:
Their LLM Guardrail system is particularly noteworthy as an example of practical quality control in production. They initially tested a sophisticated model-based approach but found it too expensive and slow. Instead, they developed a two-tier system:
This pragmatic approach to guardrails demonstrates a good balance between quality control and system performance. The guardrail system validates responses for groundedness, coherence, and policy compliance. The implementation successfully reduced hallucinations by 90% and severe compliance issues by 99%.
The quality monitoring system (LLM Judge) evaluates five key aspects:
Their quality improvement pipeline is comprehensive and includes:
A particularly interesting aspect of their LLMOps approach is their regression prevention strategy. They’ve implemented an open-source evaluation tool similar to unit testing in software development. This allows them to quickly iterate on prompts while maintaining quality standards. Any prompt changes trigger predefined tests, and new issues are systematically added to test suites.
The case study also highlights important operational considerations around latency and fallback strategies. When the guardrail system introduces too much latency, they strategically default to human agents rather than compromise on response time or quality. This demonstrates a practical approach to balancing automation with user experience.
From an LLMOps perspective, their monitoring and improvement pipeline is particularly sophisticated. They combine automated evaluation through LLM Judge with human review of random transcript samples, ensuring continuous calibration between automated and human evaluation. This dual approach helps maintain high quality while scaling the system.
The system successfully handles thousands of support requests daily, but they maintain a realistic view of its limitations. They acknowledge that complex support scenarios still require human expertise and see the system as complementary to human agents rather than a complete replacement.
Looking forward, they’re focused on:
This case study provides valuable insights into practical LLMOps implementation, particularly in handling the challenges of quality control, monitoring, and continuous improvement in a production environment. Their approach demonstrates how to balance the benefits of LLM automation with the need for reliability and quality in a customer-facing application.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.
Coinbase, a cryptocurrency exchange serving millions of users across 100+ countries, faced challenges scaling customer support amid volatile market conditions, managing complex compliance investigations, and improving developer productivity. They built a comprehensive Gen AI platform integrating multiple LLMs through standardized interfaces (OpenAI API, Model Context Protocol) on AWS Bedrock to address these challenges. Their solution includes AI-powered chatbots handling 65% of customer contacts automatically (saving ~5 million employee hours annually), compliance investigation tools that synthesize data from multiple sources to accelerate case resolution, and developer productivity tools where 40% of daily code is now AI-generated or influenced. The implementation uses a multi-layered agentic architecture with RAG, guardrails, memory systems, and human-in-the-loop workflows, resulting in significant cost savings, faster resolution times, and improved quality across all three domains.