This case study explores how Notion, a leading workspace collaboration platform, developed and evolved their AI feature evaluation system to support their growing suite of AI-powered capabilities. The company's journey into AI began early, with experiments starting after GPT-2's release in 2019, leading to the successful launch of several AI features including their writing assistant, AI Autofill, and Q&A functionality.
The core of this case study focuses on Notion's transformation of their AI evaluation processes, which proved crucial for maintaining high-quality AI features at scale. This transformation represents a significant evolution in their LLMOps practices and offers valuable insights for organizations looking to implement robust AI evaluation systems.
Initially, Notion faced several challenges with their evaluation workflow:
* Their test datasets were managed through JSONL files in git repositories, which proved unwieldy for collaboration and versioning
* Their evaluation process relied heavily on manual human review, creating a slow and expensive feedback loop
* The complexity of their AI features, particularly the Q&A system which needed to handle unstructured input and output, demanded more sophisticated evaluation methods
The company partnered with Braintrust to develop a more scalable and efficient evaluation system. The new workflow demonstrates several key LLMOps best practices:
Data Management and Versioning:
* Instead of managing test data through raw JSONL files, they implemented a structured system for dataset curation and versioning
* They maintain hundreds of evaluation datasets, continuously growing their test coverage
* Test cases come from both real-world usage logs and manually crafted examples, ensuring comprehensive coverage of edge cases and common scenarios
Evaluation Framework:
* They implemented a multi-faceted evaluation approach combining heuristic scorers, LLM-as-judge evaluations, and human review
* Custom scoring functions were developed to assess various aspects including tool usage, factual accuracy, hallucination detection, and recall
* The system allows for immediate assessment of performance changes and regression analysis
Continuous Improvement Cycle:
* The team implemented a rapid iteration cycle for improvements
* Each improvement cycle begins with a clear objective, whether adding new features or addressing user feedback
* The system enables quick comparison of different experiments through side-by-side output diffing
* Results can be analyzed both at a high level for overall performance and drilled down for specific improvements or regressions
Results and Metrics:
* The new system dramatically improved their issue resolution capacity from 3 to 30 issues per day
* This increased efficiency enabled faster deployment of new AI features
* The improved evaluation system supported the successful launch of multiple AI products, including their Q&A feature and workspace search
Technical Implementation Details:
* The system integrates with their production environment to automatically log real-world examples
* It supports flexible definition of custom scoring functions for different types of evaluations
* The infrastructure allows for parallel evaluation of multiple experiments
* Results are stored and versioned, enabling historical comparison and trend analysis
Risk Management and Quality Assurance:
* The system includes safeguards against regressions through comprehensive testing
* It enables systematic investigation of failures and edge cases
* The evaluation process helps identify potential issues before they reach production
One particularly noteworthy aspect of Notion's approach is their emphasis on narrowly defined, focused evaluations. Rather than trying to create all-encompassing tests, they break down their evaluation criteria into specific, well-defined components. This approach allows for more precise identification of issues and more targeted improvements.
Looking forward, Notion's evaluation system appears well-positioned to support future AI feature development. The scalability and flexibility of their new workflow suggest it can adapt to new types of AI features and increasing complexity in their AI offerings.
This case study demonstrates the critical importance of robust evaluation systems in AI product development. It shows how investment in LLMOps infrastructure, particularly in testing and evaluation, can dramatically improve development velocity and product quality. The transformation from a manual, limited evaluation process to a sophisticated, automated system represents a mature approach to AI development that other organizations can learn from.