ZenML

LLMOps Best Practices and Success Patterns Across Multiple Companies

HumanLoop
View original source

A comprehensive analysis of successful LLM implementations across multiple companies including Duolingo, GitHub, Fathom, and others, highlighting key patterns in team composition, evaluation strategies, and tooling requirements. The study emphasizes the importance of domain experts in LLMOps, proper evaluation frameworks, and the need for comprehensive logging and debugging tools, showcasing concrete examples of companies achieving significant ROI through proper LLMOps implementation.

Industry

Tech

Technologies

Overview

This case study is derived from a conference talk by a representative of HumanLoop, which describes itself as “probably the first LLMOps platform.” The talk synthesizes lessons learned from working with hundreds of companies—both startups and enterprises—to help them deploy LLM-based applications in production. Rather than focusing on a single deployment, the speaker draws patterns from multiple successful (and unsuccessful) implementations to identify what separates teams that succeed from those that fail.

The overarching thesis is that we have moved past the experimentation phase of LLM adoption. Real revenue and cost savings are being generated now, not in some hypothetical future. The speaker cites Filevine, a legal tech company and HumanLoop customer, as a concrete example: they launched six AI products in a year and roughly doubled their revenue—a significant achievement for a late-stage, fast-growing startup operating in a regulated industry.

LLM Application Architecture Philosophy

The speaker presents a simplified view of LLM application architecture, arguing that most applications consist of just four key components chained together in various ways:

The speaker emphasizes that what makes LLM applications difficult is not the architectural complexity but rather making each of these components actually good. This is where most of the work lies. Interestingly, the speaker predicts that as models improve, systems will become simpler rather than more complex—much of the current chaining and complexity is a workaround for model limitations in areas like tool selection.

GitHub Copilot is cited as an example of this architecture in action: it uses a fine-tuned base model (for latency), a data selection strategy that looks at the previous code behind the cursor and the most similar code from the last 10 files touched, and rigorous evaluation—all following this same fundamental structure.

Team Composition: The “People, Ideas, Machines” Framework

Borrowing from Kel Boyd’s Pentagon adage of “people, ideas, machines in that order,” the speaker argues that team composition is the first and most critical factor in LLMOps success.

Less ML Expertise Than Expected

The teams that succeed tend to be staffed more by generalist full-stack product engineers rather than traditional machine learning specialists. The term “AI engineer” is beginning to capture this shift—these are people who care about products, understand prompting, and know about models, but they are not fundamentally focused on model training. The expertise needed is more about understanding the API layer and above, not the model internals below.

Domain Experts Are Critical

The most underappreciated insight is how important domain experts are to success. Traditionally in software, product managers or domain experts produce specifications that engineers implement. LLMs have fundamentally changed this dynamic by enabling domain experts to contribute directly to the building of applications—they can help create prompts, define evaluations, and provide feedback that directly shapes the product.

Several examples illustrate this point:

The Right Mix

The ideal team composition appears to be: lots of generalist engineers, lots of subject matter experts, and a smaller amount of machine learning expertise. The ML expertise is still valuable—someone needs to understand concepts like building representative test sets and thinking about evaluation—but they don’t need to be doing hardcore model training. A good data science background is sufficient; PhDs with extensive training experience are not necessary.

Evaluation as the Core Discipline

The speaker argues that evaluation must be central to LLMOps practice from day one. Without good evaluation, teams spin their wheels making changes and eyeballing outputs, never trusting results enough to put them in production. Critically, defining evaluation criteria is essentially defining the spec—you’re articulating what “good” looks like.

Evaluation at Every Stage

The best teams incorporate evaluation throughout the entire development lifecycle:

End User Feedback is Invaluable

The ultimate ground truth for evaluation is user feedback, especially for subjective tasks like summarization or question answering. The speaker emphasizes that end user feedback is “priceless.”

GitHub Copilot exemplifies sophisticated feedback collection: they track not just whether suggestions are accepted, but whether the suggested code stays in the codebase and for how long, at various intervals. This creates a rich signal about actual value delivered.

Common feedback mechanisms include:

The challenge is that end user feedback tends to be lower volume than desired and isn’t available during development, so it can’t be the only evaluation approach.

Building Evaluation Scorecards

Successful teams build scorecards with multiple evaluator types. The key differentiator between high-performing teams and others is the extent to which they break down subjective criteria into small, independently testable components.

LLM-as-judge can work well or poorly depending on how it’s used. Asking a model “is this good writing?” produces noisy, ambiguous results. But asking specific questions like “is the tone appropriate for a child?” or “does this text contain these five required points?” works much better.

Teams should expect to use a mix of:

The speaker notes that you’re optimizing on a Pareto frontier rather than a single metric. Unlike traditional ML where you might optimize a single number, product experience is multifaceted—one system might be more expensive but significantly better in helpfulness, and that trade-off is a product decision.

Hex is cited as an example: their head of AI described breaking down evaluation criteria into small, essentially binary pieces that can be scored independently and then aggregated. He explicitly warned against seeking a “single god metric.”

Vant operates in a regulated space and relies on a mixture of automated evaluation and substantial human feedback because the stakes are too high to rely solely on automation.

Tooling and Infrastructure

Once team composition and evaluation strategy are in place, teams need to think about tooling. Three requirements emerge as critical:

Optimize for Team Collaboration

Prompts are natural language artifacts that act like code, but if you store them in a codebase and treat them as normal code, you alienate the domain experts who should be deeply involved. Systems should be designed so domain experts can participate in both prompt engineering and evaluation. They may not drive the technical process of building test sets, but they know what good looks like.

Evaluation at Every Stage

Tooling should support lightweight evaluation during prototyping, production monitoring, and regression testing—not just one or the other.

Comprehensive Logging

Ideally, teams should capture inputs and outputs at every stage with the ability to replay runs and promote data points from production logs into test sets of edge cases. This creates a virtuous cycle where production issues become regression tests.

The Ironclad/Rivet Example

The speaker shares a compelling story about Ironclad, which built an open-source library called Rivet. Their CTO reportedly said they almost gave up on agents before having proper tooling. They started building agents with function calls—it worked with one, worked with two, but when they added a third and fourth, the system started failing catastrophically.

An engineer built logging and rerun infrastructure as a “secret weekend project.” Only after having the ability to debug traces did they realize they could achieve production-grade performance. Now, for their biggest customers, roughly 50% of contracts are auto-negotiated—a capability that wouldn’t exist without that debugging infrastructure.

Notion’s Approach

The speaker references Lonus from Notion, who gave a separate talk about their logging practices—particularly the ability to find any AI run from production and replay it with modifications.

Key Takeaways and Caveats

It’s worth noting that this talk comes from a vendor (HumanLoop) selling LLMOps tooling, so the emphasis on tooling should be taken with appropriate skepticism. That said, the examples cited are from real companies, some of which built tooling themselves (like Ironclad’s Rivet, which is open source), suggesting the lessons transcend any particular product.

The central message—that LLM applications are now generating real ROI—is supported by specific claims (Filevine doubling revenue, Ironclad auto-negotiating 50% of contracts) but these should be understood as self-reported outcomes from HumanLoop customers, not independently verified results.

The framework of “people, ideas, machines” provides a useful mental model: get team composition right first (center domain experts, don’t over-hire ML specialists), then focus on evaluation criteria and feedback loops, and finally invest in tooling that supports collaboration and debugging. Teams that succeed appear to follow this sequence, while teams that fail often jump straight to tooling or over-invest in ML expertise at the expense of domain knowledge.

More Like This

Building Production-Ready AI Agent Systems: Multi-Agent Orchestration and LLMOps at Scale

Galileo / Crew AI 2025

This podcast discussion between Galileo and Crew AI leadership explores the challenges and solutions for deploying AI agents in production environments at enterprise scale. The conversation covers the technical complexities of multi-agent systems, the need for robust evaluation and observability frameworks, and the emergence of new LLMOps practices specifically designed for non-deterministic agent workflows. Key topics include authentication protocols, custom evaluation metrics, governance frameworks for regulated industries, and the democratization of agent development through no-code platforms.

customer_support code_generation document_processing +41

Modernizing DevOps with Generative AI: Challenges and Best Practices in Production

Various (Bundesliga, Harness, Trice) 2025

A panel of experts from various organizations discusses the current state and challenges of integrating generative AI into DevOps workflows and production environments. The discussion covers how companies are balancing productivity gains with security concerns, the importance of having proper testing and evaluation frameworks, and strategies for successful adoption of AI tools in production DevOps processes while maintaining code quality and security.

code_generation code_interpretation regulatory_compliance +24

Building and Scaling Enterprise LLMOps Platforms: From Team Topology to Production

Various 2023

A comprehensive overview of how enterprises are implementing LLMOps platforms, drawing from DevOps principles and experiences. The case study explores the evolution from initial AI adoption to scaling across teams, emphasizing the importance of platform teams, enablement, and governance. It highlights the challenges of testing, model management, and developer experience while providing practical insights into building robust AI infrastructure that can support multiple teams within an organization.

code_generation high_stakes_application regulatory_compliance +32