ZenML

Natural Language Query Interface with Production LLM Integration

Honeycomb 2023
View original source

Honeycomb implemented a natural language query interface for their observability platform to help users more easily analyze their production data. Rather than creating a chatbot, they focused on a targeted query translation feature using GPT-3.5, achieving a 94% success rate in query generation. The feature led to significant improvements in user activation metrics, with teams using the query assistant being 2-3x more likely to create complex queries and save them to boards.

Industry

Tech

Technologies

Overview

Honeycomb is an observability company that provides data analysis tools for understanding applications running in production. Their platform is essentially an evolution of traditional APM (Application Performance Monitoring) tools, allowing users to query telemetry data generated by their applications. Philip Carter, a Principal Product Manager at Honeycomb and maintainer in the OpenTelemetry project, shared their journey of building and deploying a natural language query assistant powered by LLMs.

The core challenge Honeycomb faced was user activation. While their product had excellent product-market fit with SREs and platform engineers, average software developers and product managers struggled with the query interface. Every major pillar of their product involves querying in some form, but the unfamiliar tooling caused many users to simply leave without successfully analyzing their data. This was the largest drop-off point in their product-led growth funnel.

The Solution Architecture

When ChatGPT launched and OpenAI subsequently reduced API prices by two orders of magnitude, Honeycomb saw an opportunity to address this activation problem through natural language querying. They set an ambitious one-month deadline to ship a production feature to all users.

The system works by taking natural language input and producing a JSON specification that matches their query engine’s format. Unlike SQL translation, Honeycomb queries are represented as JSON objects with specific rules and constraints—essentially a visual programming language serialized as structured data. The system also incorporates dataset definitions as context, including canonical representations of common columns like error fields. This allows the model to understand, for example, whether an error column is a Boolean flag or a string containing error messages, and select the appropriate one based on user intent.

One interesting architectural decision was the use of embeddings to reduce prompt size. Rather than passing entire schemas to the model, they use embeddings to pull out a subset of relevant fields (around the top 50), capturing metadata about this selection process for observability purposes.

Iterative Improvement Through Production Data

A key insight from their experience was that working with real production data was far faster than hypothetical prompt engineering. Once in production, they could directly examine user inputs, LLM outputs, parsing results, and validation errors. This data-driven approach helped them escape the trap of optimizing for hypothetical scenarios.

They instrumented the system thoroughly, capturing around two dozen fields per request in spans within traces, including: user input, LLM output, OpenAI errors, parsing/validation errors, embedding metadata, and user feedback (yes/no/unsure). This allowed them to group by error types and identify patterns. For example, they discovered that many users asking “what is my error rate” would trigger a common parsing error, which they fixed through prompt engineering and saw a 6% improvement in success rate within a week.

Their service level objective (SLO) approach was particularly interesting. They initially set a 75% success rate target over seven days, essentially accepting that a quarter of inputs would fail. The actual initial rate was slightly better at 76-77%. Through iterative improvements combining prompt engineering and manual corrections (statically fixing outputs that were “basically correct”), they improved to approximately 94% success rate.

Cost and ROI Analysis

The cost analysis was refreshingly practical. Carter projected costs based on the volume of manual queries (about a million per month) and estimated approximately $100,000 per year in OpenAI costs. This was framed as less than the cost of a single engineer and comparable to conference sponsorship costs. Importantly, this was only viable because they used GPT-3.5; GPT-4 would have cost tens of millions per year, making it economically unfeasible.

Rate limiting was noted as an indirect cost concern—OpenAI’s tokens-per-minute limits per API key would eventually become a constraint if the feature became successful enough. This motivated work to reduce prompt sizes and input volumes.

The ROI was measured through two channels. First, qualitative feedback from the sales team indicated the feature was reducing the hand-holding required during the sales process. Second, and more quantifiably, they tracked activation metrics: the percentage of new teams creating “complex queries” (with group-bys and filters) jumped from 15% to 36% for query assistant users, and the percentage adding queries to boards jumped from 6% to nearly 17%. These metrics were identified as leading indicators that correlate strongly with conversion to paying customers.

UI Design Decisions

An important product decision was deliberately not building a chatbot. Early prototypes included chat capabilities that could explain concepts, suggest hypothetical queries for missing data, and engage in multi-turn conversations. They scoped this out for several reasons.

First, sales team feedback indicated users wanted to get a Honeycomb query as fast as possible, not chat with something. Second, they observed that once the query UI was filled out by the assistant, users preferred to click and modify fields directly rather than returning to text input. Third, and critically, a chatbot is “an end user reprogrammable system” that enables rapid iteration by bad actors attempting prompt injection attacks.

By constraining the interface to single-shot query generation rather than conversational interaction, they made attacks significantly more annoying to execute while still providing the core value proposition.

Security and Prompt Injection Mitigations

Honeycomb took prompt injection seriously given they handle sensitive customer data. They noted that their production data already showed attack patterns—unusual values containing script tags appeared in telemetry when grouped by least frequent unique values. Their multi-layered approach included:

The philosophy was making attacks “too difficult to get really juicy info out of” rather than claiming complete security—acknowledging there are easier targets for bad actors.

Challenges with Function Calling

When OpenAI introduced function calling, they tested it but found it unsuited for their use case. Their system needed to handle highly ambiguous inputs—users pasting error messages, trace IDs (16-byte hex values), or even expressions from their derived column DSL. The current prompting approach could generally produce something from these inputs, but function calling more frequently produced nothing because it couldn’t conform the output to the required shape. This highlighted how prompting techniques are not universally transferable between different OpenAI features.

Perspective on Open Source Models

At the time of the discussion, Honeycomb was not considering open source models as a replacement for OpenAI. The reasoning was pragmatic: fine-tuning and deploying open source models wasn’t yet easy enough to justify the investment, especially when OpenAI regularly releases improved models. They acknowledged motivations for self-hosting (control over model updates, latency improvements) but felt the ecosystem wasn’t mature enough for their needs. They expressed interest in a hypothetical marketplace of task-specific models with easy fine-tuning workflows, but this didn’t exist at the time.

Agents and Chaining

Carter explicitly stated they would never use agents or chaining for the query assistant feature. The reasoning was that accuracy compounds negatively with each step—a 90% accuracy rate becomes much worse across multiple chained calls. However, they did see potential for agents in other parts of their product where the trade-off is different: “compile time” features where latency of several minutes is acceptable in exchange for higher quality results, versus the “runtime” concern of query generation where speed matters and users can easily correct minor issues.

Organizational Dynamics

An interesting meta-observation was about the convergence of skills needed. Carter advocated for ML engineers becoming more product-minded—doing user interviews, identifying problems worth solving, and understanding business metrics—while product managers should become more data-literate, understanding embeddings, LLM limitations, and data pipelines. The ease of calling an LLM API has shifted the complexity from model training to data quality, instrumentation, and understanding when to snapshot production data for evaluation systems.

More Like This

Evaluation-Driven LLM Production Workflows with Morgan Stanley and Grab Case Studies

OpenAI 2025

OpenAI's applied evaluation team presented best practices for implementing LLMs in production through two case studies: Morgan Stanley's internal document search system for financial advisors and Grab's computer vision system for Southeast Asian mapping. Both companies started with simple evaluation frameworks using just 5 initial test cases, then progressively scaled their evaluation systems while maintaining CI/CD integration. Morgan Stanley improved their RAG system's document recall from 20% to 80% through iterative evaluation and optimization, while Grab developed sophisticated vision fine-tuning capabilities for recognizing road signs and lane counts in Southeast Asian contexts. The key insight was that effective evaluation systems enable rapid iteration cycles and clear communication between teams and external partners like OpenAI for model improvement.

document_processing question_answering classification +42

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Migration of Credit AI RAG Application from Multi-Cloud to AWS Bedrock

Octus 2025

Octus, a leading provider of credit market data and analytics, migrated their flagship generative AI product Credit AI from a multi-cloud architecture (OpenAI on Azure and other services on AWS) to a unified AWS architecture using Amazon Bedrock. The migration addressed challenges in scalability, cost, latency, and operational complexity associated with running a production RAG application across multiple clouds. By leveraging Amazon Bedrock's managed services for embeddings, knowledge bases, and LLM inference, along with supporting AWS services like Lambda, S3, OpenSearch, and Textract, Octus achieved a 78% reduction in infrastructure costs, 87% decrease in cost per question, improved document sync times from hours to minutes, and better development velocity while maintaining SOC2 compliance and serving thousands of concurrent users across financial services clients.

document_processing question_answering summarization +45