Weights & Biases details their evaluation-driven development approach in upgrading Wandbot to version 1.1, showcasing how systematic evaluation can guide LLM application improvements. The case study describes the development of a sophisticated auto-evaluation framework aligned with human annotations, implementing comprehensive metrics across response quality and context assessment. Key improvements include enhanced data ingestion with better MarkdownX parsing, a query enhancement system using Cohere for language detection and intent classification, and a hybrid retrieval system combining FAISS, BM25, and web knowledge integration. The new version demonstrated significant improvements across multiple metrics, with GPT-4-1106-preview-v1.1 showing superior performance in answer correctness, relevancy, and context recall compared to previous versions.