Echo AI, leveraging Log10's platform, developed a system for analyzing customer support interactions at scale using LLMs. They faced the challenge of maintaining accuracy and trust while processing high volumes of customer conversations. The solution combined Echo AI's conversation analysis capabilities with Log10's automated feedback and evaluation system, resulting in a 20-point F1 score improvement in accuracy and the ability to automatically evaluate LLM outputs across various customer-specific use cases.
This case study, presented as a joint talk between Echo AI and Log10, demonstrates a real-world production LLM deployment focused on customer conversation analytics at enterprise scale. Echo AI is a platform that connects to various customer communication channels (support tickets, chat, phone calls) and uses generative AI to extract insights, categorize conversations, and surface actionable information for customer-facing teams. The partnership with Log10 addresses one of the most critical challenges in LLMOps: maintaining accuracy and trust when deploying LLMs at scale.
Echo AI serves enterprises dealing with exceptionally high volumes of customer interactions. The core insight motivating their platform is that most companies only manually review a small sample (around 5%) of customer conversations, leaving the vast majority unanalyzed. Traditional approaches involve:
All of these approaches are fundamentally reactive—they happen after problems have already occurred. As one speaker noted, “everything is after fires had formed, you have no sense of where the smoke is.”
The promise of generative AI is 100% coverage—analyzing every conversation rather than sampling, and surfacing insights that weren’t explicitly programmed to look for. However, this introduces significant LLMOps challenges around accuracy, trust, and ongoing quality management.
Echo AI’s system follows a pipeline architecture that is common in production LLM applications:
Data Ingestion and Normalization: The platform connects to various contact systems and ticket systems, pulling in customer conversations. This data is normalized, cleaned, and compressed to be passed efficiently into LLM prompts. While described as “non-AI boring stuff,” this ETL layer is critical infrastructure for any production LLM system.
Multiple Analysis Pipelines: Echo AI runs dozens of pipelines that assess conversations in different ways. These are configurable by users/customers, who can specify what they care about and what they’re looking for. Notably, customers work with Echo AI to write prompts, and eventually take ownership of prompt management over time. This represents a mature approach to prompt engineering in production—treating prompts as configurable, customer-specific assets rather than fixed system components.
Extracted Insights: From a single customer message, the system extracts multiple dimensions:
Self-Hosted Models: Due to the immense volume of prompts and throughput requirements, Echo AI does “quite a bit of self-hosting” and is “constantly training new models to better handle different domains of our customer base.” This highlights a key production consideration—when scaling LLM applications, self-hosting and fine-tuning can become necessary for cost and latency reasons.
The presentation emphasizes that enterprise customers are primarily concerned with accuracy. There is significant hesitation in the market around whether generative AI insights can be trusted—whether they’re better than what human business analysts, CX leaders, or sales VPs could produce.
Echo AI’s approach to building trust involves:
This commercial pressure around accuracy is what makes the Log10 partnership critical.
Log10 provides what they describe as an “infrastructure layer to improve LLM accuracy.” Their vision is building self-improving systems where LLM applications can improve prompts and models themselves. While acknowledging they’re “not there as a field yet,” they’ve made progress with their Auto Feedback system.
The Problem with LLM-as-Judge: The presentation cites several well-known issues with using LLMs to evaluate LLM outputs:
Auto Feedback Research: Log10 conducted research on building auto feedback models using three approaches:
Key research findings include:
The Log10 platform integrates via a “seamless one-line integration” that sits between the LLM application and the LLM SDK. It supports OpenAI, Anthropic, Gemini, and open-source models, plus framework integrations.
For Echo AI specifically, the integration enables:
Engineer Debugging: When outputs are problematic (the demo showed a summarization failure that just reiterated system prompt instructions), engineers can quickly investigate by examining the generated prompts and understanding failure modes.
Solution Engineer Workflow: Solution engineers working directly with customers can view auto-generated feedback scores and provide human overrides when needed. The interface allows changing point values and accepting corrections. This creates an “effortless” way to collect high-fidelity human feedback at scale.
Monitoring Use Cases: The auto feedback system enables:
Several production-specific challenges are addressed in this case study:
Prompt Diversity: Because every customer is different and each brings different requirements, there’s an “immense number of prompts” that must be managed. This creates unique challenges for quality assurance—you can’t just evaluate a single system prompt, you need tooling that scales across customer-specific configurations.
Summarization as Critical Infrastructure: Echo AI relies on summarization not just as a user-facing feature but as input for “a variety of different downstream analysis.” This cascade dependency makes summarization accuracy particularly important—errors propagate through the system.
Trust and Maintenance: The system requires ongoing maintenance to “achieve the utmost trust” with customers. This isn’t a deploy-and-forget situation; there’s continuous work to monitor quality and improve models.
The partnership claims a 20 F1-point improvement in accuracy for specific use cases. While the exact baseline and methodology aren’t detailed in the presentation, this represents a significant claimed improvement in production accuracy.
A concrete customer example mentioned was Wine Enthusiasts, a company selling high-end wine refrigerators. Echo AI’s real-time analysis surfaced a manufacturing defect that could have “gone on for weeks and weeks and weeks” before detection through traditional methods.
This case study represents a joint presentation from a vendor (Log10) and customer (Echo AI), so claims should be considered in that context. The accuracy improvement metrics (20 F1 points, 45% evaluation accuracy improvement) are presented without detailed methodology. The “95% accuracy” target for Echo AI is acknowledged as involving “a lot of sampling and figuring out,” suggesting the measurement itself is challenging.
That said, the case study presents a realistic picture of LLMOps challenges at scale: the need for customer-specific prompt configuration, the importance of automated evaluation to supplement limited human review capacity, the challenge of maintaining quality across model updates, and the commercial pressure to build and maintain customer trust in AI-generated insights. The emphasis on human-in-the-loop feedback collection and the acknowledgment that the field isn’t yet at truly “self-improving systems” reflects a mature understanding of current LLM limitations.
The technical approaches discussed—fine-tuning evaluation models, bootstrap data generation, open-source model deployment—represent practical production strategies rather than theoretical frameworks, making this a useful reference for teams building similar systems.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Predibase, a fine-tuning and model serving platform, announced its acquisition by Rubrik, a data security and governance company, with the goal of combining Predibase's generative AI capabilities with Rubrik's secure data infrastructure. The integration aims to address the critical challenge that over 50% of AI pilots never reach production due to issues with security, model quality, latency, and cost. By combining Predibase's post-training and inference capabilities with Rubrik's data security posture management, the merged platform seeks to provide an end-to-end solution that enables enterprises to deploy generative AI applications securely and efficiently at scale.
Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.