ZenML

Building Robust Evaluation Systems for Auto-Generated Video Titles

Loom 2025
View original source

Loom developed a systematic approach to evaluating and improving their AI-powered video title generation feature. They created a comprehensive evaluation framework combining code-based scorers and LLM-based judges, focusing on specific quality criteria like relevance, conciseness, and engagement. This methodical approach to LLMOps enabled them to ship AI features faster and more confidently while ensuring consistent quality in production.

Industry

Tech

Technologies

Overview

Loom is a video communication platform used by millions of professionals globally. With the emergence of generative AI capabilities, Loom’s engineering team identified an opportunity to enhance user experience through LLM-powered features, specifically automatic video title generation. This case study focuses on how Loom approached the LLMOps challenge of evaluating and ensuring quality for AI-generated content in a production environment.

The core challenge Loom faced was not just building the AI feature itself, but establishing a robust methodology for evaluating whether the auto-generated titles were actually good. This is a fundamental LLMOps concern: how do you systematically measure and improve the quality of LLM outputs at scale? The team’s solution involved developing custom scoring functions within the Braintrust evaluation platform, following a structured approach to evaluation design.

The Evaluation Philosophy

Loom’s approach to LLM evaluation centers on a user-centric philosophy rather than a model-centric one. Instead of asking “Is the model following my instructions correctly?”, they reframe the question as “Is the output ideal for our users, regardless of how it was generated?” This subtle but important distinction allows for higher-level improvements to features and keeps the focus on real-world user value rather than technical compliance.

This philosophy acknowledges that prompt engineering and model fine-tuning are means to an end—the end being user satisfaction. By evaluating outputs against user expectations rather than instruction-following metrics, teams can more easily identify when fundamental changes (like switching models, adjusting prompts, or restructuring the entire approach) might be needed.

The Four-Step Evaluation Process

Loom developed a structured four-step process for creating evaluation scorers for any AI feature:

Step 1: Identifying Traits of Great Outputs

The process begins by examining the inputs (what data or prompt the model receives) and the expected outputs (what the model should generate). For the auto-generated video titles feature, the team identified several key traits of a great title:

This upfront trait identification provides a clear roadmap for building scorers before any actual evaluation code is written.

Step 2: Checking for Common Quality Measures

The team maintains awareness of quality measures that apply broadly across LLM use cases. These include relevance/faithfulness (accuracy to source material), readability, structure/formatting compliance, factuality (especially for RAG-based systems), bias/toxicity/brand safety considerations, and correct language output. While not every measure applies to every use case, doing a quick assessment helps ensure nothing important is overlooked.

Importantly, Loom customizes these general measures for each specific feature rather than applying generic scorers. By tailoring scorers to the task at hand, they reduce noise in prompts and achieve more reliable, actionable results.

Step 3: Implementing Objective Measures with Code

A key LLMOps best practice highlighted in this case study is the use of deterministic, code-based scorers whenever possible. For objective checks—such as “Does the output contain exactly one emoji at the end?” or “Does a JSON response contain all required keys?”—code-based validation is preferred over LLM-based evaluation.

This approach offers several advantages in production environments:

This hybrid approach—using code where possible and LLMs only where necessary—represents a mature LLMOps practice that balances capability with reliability and cost.

Step 4: Creating and Iterating on Scorers

With feature-specific criteria and common measures identified, the team sets up initial scorers in Braintrust. For the video title feature, these included scorers for relevance (focused on the main video topic), conciseness, engagement potential, clarity, and correct language.

The iteration process involves feeding approximately 10-15 test examples through the evaluation pipeline, inspecting results carefully, and refining scorers based on observations. Once the team is satisfied that scorers are properly calibrated, they scale up to running evaluations on larger datasets through online evaluations.

Best Practices and Technical Details

Several specific best practices emerged from Loom’s experience:

Chain-of-thought for LLM-as-a-judge scorers: When using LLMs to evaluate other LLM outputs, enabling chain-of-thought reasoning is described as crucial. This produces a “rationale” explaining why a particular score was given, which is invaluable for calibrating and debugging scorers. Without this visibility, it would be difficult to understand whether a scorer is actually measuring what it’s intended to measure.

Single-aspect focus per scorer: Each scorer should evaluate one distinct aspect of the output (e.g., factual correctness, style, or emoji presence) rather than attempting to assess multiple qualities at once. This separation makes it easier to identify specific areas needing improvement and provides clearer signals for optimization.

Weighted scoring for prioritization: When certain aspects matter more than others (factual accuracy over stylistic preferences, for example), weighted averages can combine individual scores into meaningful aggregate metrics. This acknowledges that not all quality dimensions are equally important in production.

Appropriate scoring granularity: Sometimes binary (yes/no) scoring is sufficient, while other situations benefit from 3-point or 5-point scales that capture more nuance. Choosing the right granularity depends on the specific quality being measured and how actionable the resulting data needs to be.

The Evaluation Cycle

Loom describes their overall approach as a cycle: define → implement → evaluate → refine. This iterative process combines prompt tuning, scorer adjustment, and dataset refinement until the team is confident that the evaluation system captures all essential qualities of the feature.

By integrating both objective (code-based) and subjective (LLM-based) measures, the team can quickly identify both technical issues and quality concerns. The emphasis on starting small and iterating quickly helps avoid “analysis paralysis”—a common trap when teams try to build perfect evaluation systems before shipping anything.

Results and Business Impact

The article claims that Loom has established a “repeatable, reliable system for shipping features faster and more confidently.” The specific benefits mentioned include the ability to run large-scale evaluations more quickly, ensuring features reliably meet user needs, and shipping improvements with confidence that they’ve been thoroughly tested.

It’s worth noting that this case study is presented through Braintrust’s blog, so there is inherent bias toward presenting the methodology (and the evaluation platform) favorably. The article does not provide specific quantitative results—such as improvement percentages, user satisfaction metrics, or comparisons with previous approaches—which would help validate the claimed benefits more concretely.

Critical Assessment

While the methodology described is sound and represents genuine LLMOps best practices, readers should note several caveats:

That said, the four-step scorer creation process and the emphasis on combining deterministic code-based checks with LLM-based evaluation represents a pragmatic, cost-effective approach to LLM evaluation that would be applicable across many use cases. The principle of evaluating from a user perspective rather than a model-compliance perspective is particularly valuable for teams building user-facing AI features.

Applicability to Other Use Cases

The methodology described is explicitly positioned as applicable beyond video title generation. The article suggests it would work for chat-based assistants, summarization, content generation, and other LLM applications. The structured approach to identifying quality traits, implementing objective checks in code, and iteratively refining LLM-based scorers provides a template that other engineering teams could adapt for their own features.

More Like This

Enterprise-Scale AI-First Translation Platform with Agentic Workflows

Smartling 2025

Smartling operates an enterprise-scale AI-first agentic translation delivery platform serving major corporations like Disney and IBM. The company addresses challenges around automation, centralization, compliance, brand consistency, and handling diverse content types across global markets. Their solution employs multi-step agentic workflows where different model functions validate each other's outputs, combining neural machine translation with large language models, RAG for accessing validated linguistic assets, sophisticated prompting, and automated post-editing for hyper-localization. The platform demonstrates measurable improvements in throughput (from 2,000 to 6,000-7,000 words per day), cost reduction (4-10x cheaper than human translation), and quality approaching 70% human parity for certain language pairs and content types, while maintaining enterprise requirements for repeatability, compliance, and brand voice consistency.

translation content_moderation multi_modality +44

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Global News Organization's AI-Powered Content Production and Verification System

Reuters 2023

Reuters has implemented a comprehensive AI strategy to enhance its global news operations, focusing on reducing manual work, augmenting content production, and transforming news delivery. The organization developed three key tools: a press release fact extraction system, an AI-integrated CMS called Leon, and a content packaging tool called LAMP. They've also launched the Reuters AI Suite for clients, offering transcription and translation capabilities while maintaining strict ethical guidelines around AI-generated imagery and maintaining journalistic integrity.

translation content_moderation regulatory_compliance +22