The case study details Weights & Biases' comprehensive evaluation of their production LLM system Wandbot, achieving a baseline accuracy of 66.67% through manual evaluation. The study offers valuable insights into LLMOps practices, demonstrating the importance of systematic evaluation, clear metrics, and expert annotation in production LLM systems. It highlights key challenges in areas like language handling, retrieval accuracy, and hallucination prevention, while also showcasing practical solutions using tools like Argilla.io for annotation management. The findings emphasize the need for continuous improvement cycles and the critical role of high-quality documentation in LLM system performance, providing a practical template for other organizations deploying LLMs in production.
# Executive Summary
Weights & Biases (W&B) conducted a detailed evaluation of their production LLM system "Wandbot" - a technical documentation assistant deployed across Discord, Slack, Zendesk and ChatGPT. The study presents valuable insights into establishing quality metrics, conducting systematic evaluations, and improving production LLM systems through manual annotation and analysis.
## Use Case Overview
- Wandbot is W&B's technical support bot that helps users with documentation queries
- Deployed across multiple platforms (Discord, Slack, Zendesk, ChatGPT)
- Uses RAG (Retrieval Augmented Generation) architecture
- Retrieves relevant documentation chunks and uses LLM to generate responses
- Primary goal: Provide accurate technical assistance to W&B users
## Key Metrics & Results
- Manual evaluation accuracy score: 66.67% (88 out of 132 responses correct)
- Adjusted accuracy (excluding unsure cases): 73.3%
- Link hallucination rate: 10.606%
- Query relevancy rate: 88.636%
## Evaluation Methodology
- Created gold-standard evaluation dataset with 132 real user queries
- Used [Argilla.io](http://argilla.io/) for manual annotation platform
- Employed in-house Machine Learning Engineers as expert annotators
- Distributed annotation load:
- Zero overlap between annotators due to domain expertise
## Annotation Criteria
- Response Accuracy:
- Link Validity:
- Query Relevance:
## Major Issues Identified
- Wrong Language Issues:
- Documentation Limitations:
- Query Processing:
- Out of Scope Handling:
- Hallucination Problems:
## Technical Implementation Details
- Used [Argilla.io](http://argilla.io/) for annotation infrastructure:
- Annotation Interface Features:
## Improvement Areas Identified
- Retriever Enhancement:
- System Prompting:
- Documentation:
- Architecture:
## Best Practices & Recommendations
- Use domain experts for annotation when possible
- Establish clear evaluation criteria before starting
- Run trial annotation sessions
- Keep annotation interface simple and intuitive
- Track both primary and meta metrics
- Document annotator notes for future improvements
- Use systematic approach to categorize issues
- Plan for continuous evaluation and improvement
## Tools & Infrastructure
- [Argilla.io](http://argilla.io/) chosen for:
- Alternative tools considered:
- Infrastructure decisions:
## Future Directions
- Testing different embedding models
- Exploring alternative retrieval methods
- Refining annotation guidelines
- Implementing automated evaluation metrics
- Continuous documentation improvement
- Regular re-evaluation cycles
- Enhanced error categorization system
## Lessons for LLMOps Teams
- Importance of systematic evaluation
- Value of domain expert annotators
- Need for clear evaluation criteria
- Benefits of structured annotation process
- Significance of meta metrics
- Role of continuous improvement
- Balance between automation and human evaluation
- Documentation quality impact on LLM performance
Start your new ML Project today with ZenML Pro
Join 1,000s of members already deploying models with ZenML.