Company
Dosu
Title
Evaluation Driven Development for LLM Reliability at Scale
Industry
Tech
Year
2024
Summary (short)
Dosu, a company providing an AI teammate for software development and maintenance, implemented Evaluation Driven Development (EDD) to ensure reliability of their LLM-based product. As their system scaled to thousands of repositories, they integrated LangSmith for monitoring and evaluation, enabling them to identify failure modes, maintain quality, and continuously improve their AI assistant's performance through systematic testing and iteration.
This case study presents a detailed examination of how Dosu, a company that developed an AI teammate for software development and maintenance, implemented and scaled their LLMOps practices with a focus on reliability and continuous improvement. Dosu's core product is an AI assistant designed to help software maintainers and developers by handling support tasks, issue triage, and general development assistance. The company arose from the common challenge in software development where maintainers spend excessive time on support rather than feature development, with industry statistics showing up to 85% of developers' time being consumed by non-coding tasks. The case study primarily focuses on their journey to ensure reliability at scale through Evaluation Driven Development (EDD), which represents a systematic approach to LLM application development and deployment. Here's a detailed breakdown of their LLMOps journey: Early Development Phase: Initially, Dosu's approach to quality assurance was manual and labor-intensive. They relied on basic tools like grep and print statements to monitor their system's performance. While this approach was manageable with low volume, it helped them establish fundamental understanding of: * User interaction patterns * Areas of strong performance * Common failure points * Impact of prompt modifications on system behavior This initial phase revealed a critical challenge in LLM application development: the non-linear relationship between changes and outcomes. They discovered that minor prompt adjustments could lead to improvements in some areas while causing regressions in others. Evaluation Driven Development Implementation: To address these challenges, Dosu developed an EDD framework consisting of: * Creation of initial evaluation benchmarks for new features * Controlled release to users * Production monitoring for failure modes * Integration of discovered failure cases into offline evaluations * Iterative improvement based on expanded test cases * Controlled redeployment Scaling Challenges: As Dosu grew to serve thousands of repositories, their initial monitoring approaches became inadequate. This led them to seek more sophisticated monitoring solutions with specific requirements: * Version control integration for prompts * Code-level tracing capabilities * Data export functionality * Customization and extensibility options LangSmith Integration: After evaluating various options, Dosu implemented LangSmith as their monitoring and evaluation platform. Key aspects of the implementation included: * Simple integration through decorator patterns (@traceable) * Combined function and LLM call tracing * Comprehensive visibility into system activity * Advanced search capabilities for failure detection Failure Mode Detection: Dosu developed a multi-faceted approach to identifying system failures, looking at: * Direct user feedback (thumbs up/down) * User sentiment analysis in GitHub interactions * Technical error monitoring * Response time analysis * Custom metadata tracking The system helped identify unexpected failure modes, such as performance issues with large log files and embedding data, as well as occasional off-topic responses (like the amusing case of the system discussing concert plans instead of handling pull request labeling). Continuous Improvement Process: Their current LLMOps workflow involves: * Using LangSmith to identify problematic patterns * Collecting similar examples through advanced search * Expanding evaluation datasets * Iterating on improvements * Controlled deployment of updates Future Developments: Dosu is working on advancing their LLMOps practices by: * Automating evaluation dataset collection from production traffic * Developing custom dataset curation tools * Creating segment-specific evaluation criteria * Building feedback loops between development and deployment The case study demonstrates the evolution from manual monitoring to sophisticated LLMOps practices, highlighting the importance of systematic evaluation and monitoring in building reliable LLM applications. It shows how proper tooling and processes can help manage the complexity of LLM-based systems at scale, while maintaining quality and enabling continuous improvement. A particularly noteworthy aspect is their treatment of prompts as code, requiring version control and systematic testing, which represents a mature approach to LLMOps. The case also illustrates the importance of having multiple feedback mechanisms and the value of automated monitoring tools in scaling LLM applications effectively.

Start your new ML Project today with ZenML Pro

Join 1,000s of members already deploying models with ZenML.