Weights & Biases: Evaluation-Driven Refactoring: How W&B Improved Their LLM Documentation Assistant Through Systematic Testing

LLMOps Database

Tech

Weights & Biases

Company

Weights & Biases

Title

Evaluation-Driven Refactoring: How W&B Improved Their LLM Documentation Assistant Through Systematic Testing

Industry

Tech

Link

https://wandb.ai/wandbot/wandbot_public/reports/Refactoring-Wandbot-our-LLM-powered-document-assistant-for-improved-efficiency-and-speed--Vmlldzo3NzgyMzY4

Year

2024

Summary (short)

Weights & Biases documented their journey refactoring Wandbot, their LLM-powered documentation assistant, achieving significant improvements in both accuracy (72% to 81%) and latency (84% reduction). The team initially attempted a "refactor-first, evaluate-later" approach but discovered the necessity of systematic evaluation throughout the process. Through methodical testing and iterative improvements, they replaced multiple components including switching from FAISS to ChromaDB for vector storage, transitioning to LangChain Expression Language (LCEL) for better async operations, and optimizing their RAG pipeline. Their experience highlighted the importance of continuous evaluation in LLM system development, with the team conducting over 50 unique evaluations costing approximately $2,500 to debug and optimize their refactored system.

Tags

# Notes on W&B's Wandbot Refactoring Case Study ## Project Overview - Goal: Refactor Wandbot for improved efficiency and accuracy - Scope: Complete system overhaul including ingestion, retrieval, and response generation - Results: 84% latency reduction and 9% accuracy improvement - Cost: ~$2,500 in evaluation costs ## Technical Improvements ### Infrastructure Changes - Replaced FAISS with ChromaDB - Switched to LangChain Expression Language ### Pipeline Optimizations - Multiprocessing for ingestion - Improved document parsing - Enhanced chunking logic (512 token chunks) - Parent-child document relationships - Metadata-enriched document chunks ### Architecture Changes - Split RAG pipeline into components: - Modularized API structure - Enhanced metadata utilization - Improved async coordination ## Evaluation Process ### Initial Approach - Started without evaluation - Two-week refactoring period - Initial performance degradation - Systematic evaluation needed ### Systematic Testing - Reproduced baseline performance - Cherry-picked commits for testing - 50+ unique evaluations - Regular performance monitoring - Continuous improvement cycle ### Key Metrics - Answer correctness - Response latency - Retrieval speed - System reliability ## Challenges Encountered ### Technical Challenges - Async API coordination - Library transition complexities - Implementation inconsistencies - Evaluation continuity ### Process Challenges - Initial lack of evaluation - Complex debugging requirements - Performance regression issues - Integration complexity ## Key Learnings ### Best Practices - Make evaluation central to development - Optimize evaluation pipeline speed - Use systematic testing approaches - Maintain continuous monitoring - Document performance impacts ### Technical Insights - Value of efficient vector stores - Importance of chunk size optimization - Benefits of parallel processing - Impact of metadata utilization ## Success Factors ### Architectural Decisions - Modular pipeline design - Enhanced async processing - Improved metadata handling - Efficient retrieval system ### Process Improvements - Systematic evaluation - Continuous monitoring - Iterative optimization - Documentation of changes ## Future Directions ### Planned Improvements - Dynamic configuration options - Enhanced metadata usage - Improved scalability - Advanced tuning capabilities ### Development Focus - Continued evaluation integration - Performance optimization - System reliability - User experience enhancement

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.

Book a Demo

Use Open Source