The Evaluation Playbook: Making LLMs Production-Ready

As large language models (LLMs) increasingly power production applications, the importance of robust evaluation and quality assurance practices has never been more critical. LLM evaluation, a cornerstone of LLMOps, presents unique challenges compared to traditional machine learning evaluation due to the inherent subjectivity, lack of ground truth, and non-deterministic nature of LLM outputs.

In this blog post, we dive deep into the practical lessons learned from real-world case studies in our LLMOps Database, showcasing how industry leaders are tackling the complexities of LLM evaluation head-on. From defining clear metrics to combining automated and human evaluation strategies, these insights provide a roadmap for ensuring the reliability and performance of LLM applications at scale.

All our posts in this series will include NotebookLM podcast ‘summaries’ that capture the main themes of each focus. Today’s blog is about evaluation in production so this podcast focuses on some of the core case studies and how specific companies developed and deployed application(s) where this was a core focus.‍

To learn more about the database and how it was constructed read this launch blog. Read this post if you're interested in an overview of the key themes that come out of the database as a whole. To see all the other posts in the series, click here. What follows is a slice around how evaluation was found in the production applications of the database.

The Shift from Traditional ML to LLM Evaluation

The transition from traditional machine learning evaluation to LLM evaluation is not just a technical shift, but a philosophical one. As highlighted in the "From MVP to Production" panel discussion, this transition brings a host of new challenges:

Ambiguous objective functions and lack of clear datasets
Difficulty measuring the quality of generated content
Need for domain-specific evaluation criteria

Ellipsis's 15-month journey building LLM agents further emphasizes the unique challenges of LLM evaluation, such as dealing with prompt brittleness, managing complex agent compositions, and handling error cases.

Defining Clear Evaluation Metrics 📏

A recurring theme across the case studies is the importance of establishing clear, measurable evaluation metrics aligned with business goals. Canva's systematic LLM evaluation framework is a prime example of this principle in action. By codifying success criteria into quantitative metrics across dimensions like information preservation, intent alignment, and format consistency, Canva created a robust foundation for assessing the performance of their Magic Switch feature.

Two types of LLM evaluators: rule-based for simple metrics and LLM-based for complex criteria

Other notable examples of metric-driven evaluation include:

Nextdoor and OLX's use of business-focused metrics like conversion rates and customer satisfaction
Weights & Biases, in their comprehensive evaluation of Wandbot, tracked a range of LLM-specific metrics, including accuracy, fluency, and hallucination rate, demonstrating a holistic approach to quality assurance. Meanwhile, Microsoft's incident management system evaluation focused on metrics like BLEU-4, ROUGE-L, and others, comparing LLM-generated summaries to reference texts. They also incorporated direct feedback from on-call engineers on the usefulness of the LLM's recommendations in real-world incident scenarios.
Perplexity AI's search engine employed composite metrics that combined multiple factors. Assembled implicitly used a composite evaluation approach, balancing the time saved by developers with the quality and comprehensiveness of the generated tests, ultimately aiming for increased engineering velocity without compromising code quality.

Tools like Weights & Biases also play a crucial role in enabling the tracking and analysis of these diverse evaluation metrics.

Tools like Weights & Biases play a crucial role in enabling the tracking and analysis of these diverse evaluation metrics. For teams evaluating different tooling options in this space, our detailed comparison "LLM Evaluation & Prompt Tracking Showdown" provides valuable insights into the current landscape.

Consider Nextdoor's innovative use of AI-generated email subject lines. They didn't just rely on abstract quality metrics; they directly measured the impact on user engagement, tracking sessions, weekly active users, and even ad revenue. This direct connection to business outcomes ensured that their evaluation efforts were focused on what truly mattered for the product. Similarly, OLX, when evaluating their LLM-powered job role extraction system, focused on "successful events" metrics, demonstrating a clear alignment between LLM performance and business value.

While traditional metrics like accuracy are still relevant, LLMOps often requires more nuanced evaluation measures. Weights & Biases, in their evaluation of Wandbot, tracked not only answer correctness but also relevancy, factfulness, and similarity to ensure their documentation chatbot provided helpful and accurate information. Microsoft, in their work on cloud incident management, used a range of metrics including BLEU-4, ROUGE-L, and even human evaluation scores from on-call engineers to assess the quality and usefulness of LLM-generated recommendations. Choosing the right LLM-specific metrics is crucial for gaining a holistic understanding of model performance.

Automated Evaluation Techniques 🤖

Automated evaluation techniques offer scalability and consistency, but come with their own set of benefits and limitations. For a detailed comparison of available industry tools in this space, check out our comprehensive analysis "LLM Evaluation & Prompt Tracking Showdown." LLM-based evaluation, where one model assesses the output of another, has emerged as a powerful approach. SumUp's financial crime report evaluation, Canva's content generation assessment, and Perplexity's answer ranking all leverage this "LLM as judge" paradigm.

While many companies leverage traditional NLP techniques, Alaska Airlines demonstrates a more cutting-edge approach, utilizing Google Cloud's Gemini LLM for semantic search in their natural language flight search. This allowed them to handle complex, multi-constraint queries that traditional keyword-based systems struggled with.

For more complex LLM agents, unit and integration testing become essential. Ellipsis, Rexera (using LangSmith), and Replit showcase the importance of comprehensive testing frameworks to ensure the reliability of agent-based systems.

SumUp's approach to evaluating financial crime reports highlights the ingenuity of LLM-based evaluation. They used a separate LLM as a 'judge,' training it on a set of benchmark checks and a scoring rubric. This allowed them to automate the evaluation of complex, unstructured text outputs, significantly reducing manual review efforts. Similarly, Canva leveraged LLM evaluators for assessing the quality of their AI-generated content, focusing on aspects like intent alignment, coherence, and tone. This approach allowed them to scale their evaluation process and ensure consistent quality across a large volume of generated content.

While LLMs can be powerful evaluators, simpler heuristics can also be effective, especially for specific tasks. Stripe, for example, developed a 'match rate' metric to quickly assess the similarity between LLM-generated support responses and actual agent replies. This provided a cost-effective way to monitor the system's performance in real-time without relying on expensive LLM calls for every evaluation.

Human Evaluation Strategies 👥

While automated evaluation offers scale, human judgment remains irreplaceable in assessing the nuanced aspects of LLM outputs. As emphasized in the "LLM in Production Round Table" discussion, human evaluation takes different forms depending on the use case and domain.

Expert annotation, involving domain specialists, is crucial for high-stakes applications. Digits' question generation system, built using TensorFlow Extended (TFX), and Grammarly's CoEdit, which leverages expert linguists, highlight the role of subject matter expertise in evaluation.

Crowd-sourced evaluation, as mentioned in the "Building Product Copilots" case study, offers a more scalable approach to human assessment. By leveraging large, diverse pools of annotators, crowd-sourcing enables more comprehensive evaluation of LLM outputs.

User feedback, the most direct form of human evaluation, is central to systems like Instacart's AI Assistant and Holiday Extras' ChatGPT Enterprise deployment. By continuously incorporating real-world user input, these systems can adapt and improve over time.

Tools like Argilla, used by Weights & Biases, streamline the management of human evaluation workflows, enabling more efficient annotation and analysis.

Digits exemplifies the value of expert annotation in their question generation system. Instead of relying on large, potentially noisy datasets, they used a smaller, carefully curated set of examples labeled by experienced accountants. This ensured high-quality training data and improved the accuracy of their LLM. Grammarly’s CoEdit, a specialized LLM for text editing, similarly leveraged expert linguists for evaluation, recognizing the nuanced nature of language and the need for specialized knowledge in assessing text quality.

Integrating user feedback directly into the evaluation process is crucial for building user-centric LLM applications. Instacart's internal AI assistant, Ava, includes features like thumbs up/down ratings and a prompt exchange system, allowing users to directly contribute to the improvement of the system. Holiday Extras, in their company-wide ChatGPT deployment, prioritized measurable business outcomes. They tracked user adoption rates and observed a significant increase in their Net Promoter Score (NPS), rising from 60% to 70%. This suggests that the improved efficiency and customer service enabled by ChatGPT had a positive impact on customer loyalty.

Combining Automated and Human Evaluation

The most effective LLM evaluation strategies strike a balance between automated and human assessment, leveraging the strengths of both approaches.

Zillow's Fair Housing Guardrails system is a prime example of this combined approach, using a combination of prompt engineering, stop lists, and a custom classifier to ensure compliance with legal requirements. The classifier provides automated detection of potentially problematic content, while human experts verify the system's outputs.

A left-to-right flowchart showing three fair housing evaluation strategies arranged vertically on the left side. Each strategy (Prompt Engineering, Stop List, and Classifier Model) branches right into two categories: Advantages (shown in green) and Disadvantages (shown in red). Prompt Engineering's advantages include high recall and simple implementation, with disadvantages of being non-deterministic and having inconsistent behavior. Stop List shows advantages of fast execution and 100% deterministic results, with disadvantages of being context-blind with limited coverage. The Classifier Model features advantages of tunable precision/recall and context-awareness, while disadvantages include requiring training data and being computationally intensive.

Harvard Business School's ChatLTV assistant also employs a hybrid evaluation strategy, combining manual testing with OpenAI's automated scoring capabilities to assess the quality of generated responses.

Microsoft's Sales Content Recommendation engine takes a similar approach, using automated relevance scoring alongside human judgments to ensure the appropriateness of recommended content.

The "Banking GenAI Implementation" case study highlights the use of technical writers and agent feedback to refine LLM outputs in the financial domain, while Echo.ai's partnership with Log10 demonstrates the power of combining automated evaluation with human expertise.

The Echo.ai and Log10 partnership provides a compelling example of how automated feedback loops can enhance human evaluation. Log10's platform not only provided automated evaluation tools but also enabled Echo.ai to generate synthetic training data and fine-tune their models based on human feedback. This closed-loop system resulted in a remarkable 20-point increase in their F1 score, demonstrating the synergistic power of combining automated and human intelligence. This case study demonstrates how combining the strengths of both automated and human evaluation can lead to substantial performance gains.

Zillow’s Fair Housing Guardrails system provides a powerful example of the importance of human oversight in high-stakes applications. While their system uses a BERT-based classifier to automatically detect potentially discriminatory language, human experts review and validate the classifier’s outputs to ensure compliance with Fair Housing regulations. This combination of automated detection and human review provides a crucial safety net, mitigating the risks associated with relying solely on LLM outputs.

Continuous Evaluation and Improvement 🔄

LLM development is an inherently iterative process, requiring continuous monitoring, evaluation, and refinement. The "MLops Maturity Levels" discussion emphasizes this cyclical nature of LLM improvement, stressing the importance of incremental progress over radical transformation.

Honeycomb's Query Assistant and Salesforce's Slack Summarization system exemplify the critical role of continuous monitoring and feedback loops in maintaining LLM performance. By constantly assessing model outputs and incorporating user feedback, these systems can adapt to evolving requirements and catch potential issues early.

eBay's three-track approach to LLM deployment provides a compelling example of iterative improvement driven by diverse feedback mechanisms. For GitHub Copilot, they relied on telemetry data, including code acceptance rates. With eBayCoder, their internally developed LLM, they leveraged feedback from their engineering team on code quality and maintainability. Finally, their internal knowledge base GPT incorporated user feedback through an RLHF (Reinforcement Learning from Human Feedback) system, allowing them to continuously refine the system's responses. This multi-faceted approach to feedback collection demonstrates how organizations can tailor their evaluation strategies to different LLM applications.

Conclusion 🎯

The insights from our LLMOps Database paint a clear picture: effective LLM evaluation requires a multifaceted, iterative approach grounded in clear metrics, diverse assessment strategies, and a commitment to continuous improvement.

By combining automated techniques like LLM-based scoring and heuristic evaluation with human judgment from experts, crowd-workers, and end-users, organizations can create robust evaluation frameworks that ensure the quality and reliability of their LLM applications.

The case studies featured in this post provide a wealth of practical guidance and inspiration for teams looking to uplevel their LLM evaluation practices. By adapting these strategies to their unique contexts and use cases, practitioners can navigate the challenges of LLM evaluation with greater confidence and effectiveness.

As the LLMOps landscape continues to evolve, one thing remains constant: the importance of rigorous, ongoing evaluation in ensuring the success of LLM applications. By embracing the lessons from these pioneering organizations and committing to a culture of continuous assessment and refinement, the MLOps community can unlock the full potential of LLMs in production settings.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.

Book a Demo

Use Open Source