Weights & Biases documented their journey refactoring Wandbot, their LLM-powered documentation assistant, achieving significant improvements in both accuracy (72% to 81%) and latency (84% reduction). The team initially attempted a "refactor-first, evaluate-later" approach but discovered the necessity of systematic evaluation throughout the process. Through methodical testing and iterative improvements, they replaced multiple components including switching from FAISS to ChromaDB for vector storage, transitioning to LangChain Expression Language (LCEL) for better async operations, and optimizing their RAG pipeline. Their experience highlighted the importance of continuous evaluation in LLM system development, with the team conducting over 50 unique evaluations costing approximately $2,500 to debug and optimize their refactored system.
Weights & Biases developed Wandbot, an LLM-powered documentation assistant designed to help users navigate their technical documentation. This case study documents a significant refactoring effort that aimed to address performance inefficiencies while maintaining or improving accuracy. The team’s journey provides valuable insights into the challenges of maintaining and improving production LLM systems, particularly around the importance of systematic evaluation and the hidden complexities of seemingly straightforward refactoring work.
The Wandbot system is a Retrieval Augmented Generation (RAG) pipeline that ingests documentation, stores it in a vector database, and uses LLM-based response synthesis to answer user queries. The system was deployed across multiple client applications including Slack, Discord, and Zendesk, making performance and reliability critical production concerns.
Before the refactoring effort, the team identified several key inefficiencies in their production system:
The team initially assumed that refactoring would not significantly impact system performance, planning to evaluate at the end of the refactor and address any performance degradation. This assumption proved to be dangerously incorrect, as they discovered when initial evaluations showed accuracy dropping from ~70% to ~23% after the refactor.
One of the most impactful changes was replacing FAISS (Facebook AI Similarity Search) with ChromaDB as the vector store. This migration delivered approximately 69% reduction in retrieval latency and enabled document-level metadata storage and filtering capabilities. The metadata filtering proved particularly valuable for improving retrieval relevance. Interestingly, the team found that their embedding model choice interacted with the vector store selection—text-embedding-small worked better with ChromaDB than text-embedding-ada-002, although this wasn’t initially apparent and required extensive evaluation to discover.
The team made substantial improvements to the data ingestion pipeline:
The team split the RAG pipeline into three major components: query enhancement, retrieval, and response synthesis. This modular architecture made it easier to tune each component independently and measure the impact of changes on evaluation metrics. The query enhancement stage was consolidated from multiple sequential LLM calls to a single call, improving both speed and performance.
A notable addition was the sub-query answering step in the response synthesis module. By breaking down complex queries into sub-queries and generating responses for each, then synthesizing a final answer, the system achieved improved completeness and relevance of generated responses.
The original implementation used a combination of Instructor and llama-index, which created coordination challenges with asynchronous API calls and multiple potential points of failure. The team transitioned to LangChain Expression Language (LECL), which natively supports asynchronous API calls, optimized parallel execution, retries, and fallbacks.
This transition was not straightforward. LECL did not directly replicate all functionality from the previous libraries. For example, Instructor featured Pydantic validators for running validations on function outputs with the ability to re-ask an LLM with validation errors—functionality not natively supported by LECL. The team developed a custom re-ask loop within the LangChain framework to address this gap.
The team also faced implementation challenges with LECL primitives like RunnableAssign and RunnableParallel, which were initially applied inconsistently, leading to errors and performance issues. As their understanding of these primitives improved, they were able to correct their approach and optimize performance.
When the team first evaluated their refactored branch with LiteLLM (added to make the system more configurable across vendors), they scored only ~23% accuracy compared to the deployed v1.1 system’s ~70% accuracy. Even attempting to reproduce v1.1 results with the refactored branch yielded only ~25% accuracy.
The team adopted a systematic cherry-picking approach, taking individual commits from the refactored branch and evaluating each one. When accuracy dropped, they either reverted the change or experimented with alternatives. This process was described as “tedious and time-consuming” but ultimately successful.
Key discoveries during this process included:
A critical operational insight was the importance of evaluation speed. Initial evaluations took an average of 2 hours and 17 minutes each, severely limiting iteration speed. By making the evaluation script purely asynchronous, they reduced evaluation time to below 10 minutes, enabling many more experiments per day.
The team ran nearly 50 unique evaluations at a cost of approximately $2,500 in LLM API calls to debug the refactored system. This investment ultimately paid off with a final accuracy of 81.63%, representing approximately 9% improvement over the baseline.
The team utilized W&B Weave for tracing and observability, using the lightweight weave.op() decorator to automatically trace functions and class methods. This enabled them to examine complex data transfer in intermediate steps and better debug the LLM-based system. The ability to observe intermediate steps and LLM calls was described as essential for debugging their complex pipeline.
The final results of the refactoring effort were impressive:
The team deployed the system on Replit, and the improvements enabled practical use across their integration channels (Slack, Discord, Zendesk).
The team’s experience yielded several important LLMOps lessons:
Evaluation as a Core Practice: The team learned the hard way that assuming refactoring won’t impact performance is dangerous. They strongly recommend making evaluation central to the development process and ensuring changes lead to measurable enhancements.
Evaluation Pipeline Performance: A slow evaluation pipeline becomes a bottleneck for experimentation. Investing time in optimizing the evaluation infrastructure paid significant dividends in iteration speed.
Non-Deterministic Evaluation: When evaluating LLM-based systems with an LLM as a judge, scores are not deterministic. The team recommends averaging across multiple evaluations while considering the costs of doing so.
Component Interactions: Changes to one component (like embedding models or LLM versions) can have non-linear and unexpected interactions with other components. What doesn’t work initially might work well later in combination with other changes, and vice versa.
Retaining Critical Components: During refactoring, it’s easy to accidentally remove critical components like few-shot prompts. Documenting and carefully tracking such configurations is essential to avoid unintentional performance degradation.
Iterative, Systematic Approach: The team recommends approaching refactoring iteratively with systematic testing and evaluation at each step to identify and address issues promptly rather than discovering major problems at the end.
This case study serves as a cautionary tale about the hidden complexity of LLM system refactoring while also demonstrating that systematic, evaluation-driven approaches can yield significant improvements in both accuracy and performance.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
DoorDash faced challenges in scaling personalization and maintaining product catalogs as they expanded beyond restaurants into new verticals like grocery, retail, and convenience stores, dealing with millions of SKUs and cold-start scenarios for new customers and products. They implemented a layered approach combining traditional machine learning with fine-tuned LLMs, RAG systems, and LLM agents to automate product knowledge graph construction, enable contextual personalization, and provide recommendations even without historical user interaction data. The solution resulted in faster, more cost-effective catalog processing, improved personalization for cold-start scenarios, and the foundation for future agentic shopping experiences that can adapt to real-time contexts like emergency situations.
This case study examines Cursor's implementation of reinforcement learning (RL) for training coding models and agents in production environments. The team discusses the unique challenges of applying RL to code generation compared to other domains like mathematics, including handling larger action spaces, multi-step tool calling processes, and developing reward signals that capture real-world usage patterns. They explore various technical approaches including test-based rewards, process reward models, and infrastructure optimizations for handling long context windows and high-throughput inference during RL training, while working toward more human-centric evaluation metrics beyond traditional test coverage.