Honeycomb implemented a natural language querying interface for their observability product and faced challenges in maintaining and improving it post-launch. They solved this by implementing comprehensive observability practices, capturing everything from user inputs to LLM responses using distributed tracing. This approach enabled them to monitor the entire user experience, isolate issues, and establish a continuous improvement flywheel, resulting in higher product retention and conversion rates.
Honeycomb, an observability company, developed a natural language querying interface for their product that allows users to query their observability data using natural language rather than learning the company’s query interface. This case study, presented by Phillip Carter (a Honeycomb engineer and OpenTelemetry maintainer), provides an honest and detailed look at the challenges of making LLM-powered features production-ready beyond an initial beta release.
The core insight from this case study is that while it’s relatively easy to ship an LLM-powered feature quickly (the speaker suggests a product engineering team with a month and OpenAI API keys can produce something useful), the real challenge lies in iterating and improving the feature over time without introducing regressions. The speaker emphasizes that “pretty good is not production ready” and that after the initial marketing splash, users will expect much more from the product.
The presentation highlights several fundamental challenges with making LLMs reliable in production settings:
Traditional software engineering tools fall short when working with LLMs. Unit tests, IDE debugging, and regression tests are described as “exceptionally difficult with LLMs if not impossible depending on what you’re doing.” This is a significant departure from standard software development practices and requires new approaches.
Prompt engineering introduces fragility. The speaker notes that “very very subtle changes in your prompt can massively influence the way that the LLM is going to behave.” This makes it easy to introduce regressions, sometimes for behaviors that the team didn’t even realize were working in the first place. This unpredictability is a core challenge of LLMOps.
RAG pipelines add complexity. As the system pulls in additional context and data for each request, each component becomes another variable that can affect LLM behavior. The Honeycomb system has approximately 40 steps in their RAG pipeline, and tweaking any one of these can dramatically change outputs.
Honeycomb’s approach was to apply observability practices more extensively than they had for any other product feature. The key was capturing literally everything about each request:
What They Capture:
This comprehensive data capture allows the team to isolate specific problems cleanly. The speaker shows how they can filter to view every instance where they failed to produce a query for a user, grouped by user input and LLM response. This makes it immediately clear what inputs are causing problems and what the LLM is (or isn’t) producing.
A key architectural decision was treating the LLM system as a distributed system, which it fundamentally is—involving Honeycomb’s application, OpenAI’s API, and their querying engine as separate components. They use distributed tracing to capture the full user experience.
The traces are connected to their entire application, not just the LLM call in isolation. Each trace dealing with the LLM portion is approximately 48 spans in length, with the RAG pipeline making up the majority. One collapsed section alone contains over 20 spans, representing the decisions made before any request to OpenAI is sent.
This level of detail allows the team to:
The monitoring approach focuses on the full end-to-end user experience rather than just technical metrics like latency and errors. Every one of the 48 spans counts as a bucket of events that they monitor. When failures occur, they can examine all dimensions of the data to isolate specific error patterns.
For example, if they see an error like “ML response does not contain valid JSON,” they can group by that field, query for specific instances of that problem, and dig into individual requests to understand root causes.
Perhaps the most valuable insight from this case study is the establishment of a continuous improvement flywheel:
The team identifies problems through observability data, makes a fix in their prompt engineering, deploys that fix, and then examines the past 24 hours of data to determine whether their rate of success or failure improved. They deploy daily and continuously look at what happened over the previous day. Over time, this iterative process is how they made their LLMs reliable.
This approach treats LLM improvement as an empirical, data-driven process rather than relying on intuition or limited testing.
The feature launch was around May 3rd of the relevant year, and the initial release solved approximately 80% of use cases—described as “pretty awesome” for a first release. However, the remaining 20% represented a long tail of important edge cases that paying customers cared about.
After implementing their observability-driven improvement process, they achieved:
The sales team could simply direct prospects to type what they want in a text box and hit “get query,” significantly streamlining the onboarding experience.
The speaker emphasizes that this approach is accessible today using existing tools:
The data collected serves multiple purposes: it’s the source for prompt engineering improvements, fine-tuning decisions, and any evaluations being performed.
Looking forward, the speaker (who is an OpenTelemetry project maintainer) notes that the community is developing semantic conventions for AI and vector database instrumentation, automatic instrumentation libraries, and best practices for manual instrumentation. Within approximately six months from the presentation, getting started with this level of observability should become significantly easier.
This case study offers several important lessons for teams building with LLMs in production:
The initial launch is the easy part. The hard work begins after users have expectations and you need to iterate without breaking things that already work. Teams should plan for this phase from the beginning.
Traditional software testing approaches have limited applicability to LLM systems. New approaches centered on observability and real-world behavior analysis are necessary.
Treat LLM systems as distributed systems from the start. Use distributed tracing to capture the full context of each request across all components.
Capture everything, not just the obvious metrics. User inputs, full prompts, every RAG pipeline decision, LLM responses, parsing results, validation outcomes, and user feedback all provide valuable signal for improvement.
Establish a daily improvement cycle. Deploy changes, observe results over 24 hours, and use that data to guide the next iteration. This empirical approach is more effective than trying to anticipate all edge cases in advance.
The case study is notable for its honesty about the challenges involved—the speaker acknowledges that their blog posts about “the hard stuff nobody’s talking about” struck a nerve with the LLM development community, suggesting that many teams struggle with similar issues but don’t publicly discuss them.
This case study captures insights from Lance Martin, ML engineer at Langchain, discussing the evolution from traditional ML to LLM-based systems and the emerging engineering discipline of building production GenAI applications. The discussion covers key challenges including the shift from model training to model orchestration, the need to continuously rearchitect systems as foundation models rapidly improve, and the critical importance of context engineering to manage token usage and prevent context degradation. Solutions explored include workflow versus agent architectures, the three-part context engineering playbook (reduce, offload, isolate), and evaluation strategies that emphasize user feedback and tracing over static benchmarks. Results demonstrate that teams like Manis have rearchitected their systems five times since March 2025, and that simpler approaches with proper observability often outperform complex architectures, with the understanding that today's solutions must be rebuilt as models improve.
OpenAI's applied evaluation team presented best practices for implementing LLMs in production through two case studies: Morgan Stanley's internal document search system for financial advisors and Grab's computer vision system for Southeast Asian mapping. Both companies started with simple evaluation frameworks using just 5 initial test cases, then progressively scaled their evaluation systems while maintaining CI/CD integration. Morgan Stanley improved their RAG system's document recall from 20% to 80% through iterative evaluation and optimization, while Grab developed sophisticated vision fine-tuning capabilities for recognizing road signs and lane counts in Southeast Asian contexts. The key insight was that effective evaluation systems enable rapid iteration cycles and clear communication between teams and external partners like OpenAI for model improvement.
This case study examines Cursor's implementation of reinforcement learning (RL) for training coding models and agents in production environments. The team discusses the unique challenges of applying RL to code generation compared to other domains like mathematics, including handling larger action spaces, multi-step tool calling processes, and developing reward signals that capture real-world usage patterns. They explore various technical approaches including test-based rewards, process reward models, and infrastructure optimizations for handling long context windows and high-throughput inference during RL training, while working toward more human-centric evaluation metrics beyond traditional test coverage.