Company
Coda
Title
Building a Systematic LLM Evaluation Framework from Scratch
Industry
Tech
Year
2023
Summary (short)
Coda's journey in developing a robust LLM evaluation framework, evolving from manual playground testing to a comprehensive automated system. The team faced challenges with model upgrades affecting prompt behavior, leading them to create a systematic approach combining automated checks with human oversight. They progressed through multiple phases using different tools (OpenAI Playground, Coda itself, Vellum, and Brain Trust), ultimately achieving scalable evaluation running 500+ automated checks weekly, up from 25 manual evaluations initially.
This case study explores Coda's journey in developing and scaling their LLM evaluation framework throughout 2023, offering valuable insights into the challenges and solutions in implementing LLMs in production environments. Coda is a workspace platform that combines document flexibility with spreadsheet structure and application capabilities, enhanced with AI features. Their AI implementation includes content generation assistants and a RAG-based system for document querying. The case study focuses on their evolution in developing robust evaluation systems for these AI features. The journey began when Coda faced a significant challenge during the transition from OpenAI's Davinci model to GPT-3.5 Turbo in March 2023. They discovered that their existing prompts weren't working as expected with the new model, with responses becoming more conversational rather than instructional. This unexpected behavior change highlighted the need for systematic evaluation approaches for LLMs in production. Their evaluation framework development proceeded through four distinct phases: Phase 1 - OpenAI Playground: Initially, they used OpenAI's playground for experimentation and iteration. While this approach was accessible to non-technical team members, it didn't scale beyond 10 data points due to manual copy-paste requirements and lacked proper tracking capabilities. Phase 2 - Coda-Based Solution: They created an internal evaluation system using Coda's own platform, implementing tables for benchmarks and automated response fetching. This eliminated manual work but became difficult to maintain as conversation-based models emerged, requiring more complex prompt structures. Phase 3 - Vellum Integration: To address maintenance challenges, they adopted Vellum, which provided better dataset and prompt management capabilities. This solution allowed for easier collaboration and evaluation sharing among team members. However, they encountered limitations as their evaluation needs grew, particularly the disconnect between playground testing and actual application behavior. Phase 4 - Brain Trust Implementation: The final phase involved implementing Brain Trust for scaled evaluation, providing robust APIs and visualization capabilities without requiring SQL expertise. This solution enabled them to track AI quality trends over time and investigate specific issues efficiently. Key Technical Lessons: 1. Code Integration: * Evaluation systems should closely mirror production environments * Testing should happen in the same programming language as the application * Playground testing alone is insufficient for production quality assurance 2. Human-in-the-Loop Necessity: * Automated checks are crucial but can't completely replace human oversight * Human reviewers help identify unexpected edge cases and failure modes * They implemented a feedback loop where human observations inform automated check development 3. Benchmark Dataset Management: * Started with 25 examples, grew to over 5,000 data points * Sources include: * PM and designer input for core use cases * Alpha/beta user feedback * Customer-facing team insights * Internal usage data from development environments Technical Implementation Details: * Integration with Snowflake for data storage * Mode for initial reporting capabilities * Brain Trust API for result visualization and tracking * Automated checks running over 500 evaluation jobs weekly * Combination of automated and manual evaluation processes Results and Metrics: * Scaled from 1 to 5+ engineers involved in AI capabilities * Increased from 25 to 500+ weekly evaluation jobs * Implemented 50+ automated checks * Support for 15+ AI-powered features * Significant reduction in manual evaluation work The case study emphasizes the importance of starting evaluation early, even with small datasets, and gradually building more comprehensive testing frameworks. They recommend keeping evaluation close to production code, maintaining human oversight, and investing in quality benchmark datasets. An interesting technical insight is their approach to developing automated checks by comparing outputs from different models and vendors, using the preferences and differences to establish evaluation criteria. This practical approach helps teams identify what aspects of the output should be automatically verified. The framework they developed allows them to confidently deploy model updates and prompt changes while maintaining consistent output quality. Their experience shows that successful LLM operations require a combination of automated testing, human oversight, and comprehensive benchmark data, all integrated closely with the production environment.

Start your new ML Project today with ZenML Pro

Join 1,000s of members already deploying models with ZenML.