Honeycomb implemented a Query Assistant powered by LLMs to help users better understand and utilize their observability platform's querying capabilities. The feature was developed rapidly with a "ship to learn" mindset, using GPT-3.5 Turbo and text embeddings. While the initial adoption varied across pricing tiers (82% Enterprise/Pro, 75% Self-Serve, 39% Free) and some metrics didn't meet expectations, it achieved significant successes: teams using Query Assistant showed 26.5% retention in manual querying vs 4.5% for non-users, higher complex query creation (33% vs 15.7%), and increased board creation (11% vs 3.6%). Notably, the implementation proved extremely cost-effective at around $30/month in OpenAI costs, demonstrated strong integration with existing workflows, and revealed unexpected user behaviors like handling DSL expressions and trace IDs. The project validated Honeycomb's approach to AI integration while providing valuable insights for future AI features.
Honeycomb, an observability platform company, developed Query Assistant, an LLM-powered feature that translates natural language into structured Honeycomb queries. This case study provides an unusually transparent look at the entire lifecycle of shipping an AI product feature, from initial development through production iteration, measuring real business impact, and managing operational costs. The case study is notable for its honest assessment of both successes and areas where the feature fell short of expectations.
Honeycomb’s core business value proposition depends on users actively querying their data. However, the platform has a notable learning curve, particularly for users without prior experience with observability or monitoring tools. Users often struggle to map their mental model of their data and questions into Honeycomb’s query interface. This learning curve directly impacts business metrics, as active querying correlates with users upgrading to paid pricing tiers and instrumenting more services.
The Query Assistant translates natural language inputs into Honeycomb Query objects. The technical stack includes:
The team invested significant effort in prompt engineering to reduce token usage. Each GPT-3.5 request uses approximately 1,800 input tokens and 100 response tokens, while embedding requests use at most 100 tokens.
One of the most valuable insights from this case study is Honeycomb’s approach to LLM development. The team explicitly rejects the notion that traditional software development practices apply to LLM-powered features:
Their solution was to adopt a “ship to learn” mindset, deploying rapidly and iterating based on production data. At times, they shipped updates daily. This approach required:
The use of SLOs is particularly interesting. Since regression tests cannot be written for nondeterministic systems, SLOs serve as a proxy for ensuring that improvements don’t degrade previously working behavior. This represents a shift from deterministic pass/fail testing to probabilistic monitoring of system behavior over time.
The case study provides remarkably detailed metrics on Query Assistant’s effectiveness:
Adoption by Pricing Tier:
Manual Query Retention (Week 6):
This 6x difference in retention is one of the strongest signals reported and suggests the feature successfully “graduates” users to manual querying rather than creating dependency.
Complex Query Creation:
The team intentionally designed Query Assistant to emit more complex queries with multiple WHERE and GROUP BY clauses to demonstrate the interface’s flexibility.
Board Creation (Strong Activation Signal):
Trigger Creation (Strongest Activation Signal):
The trigger correlation was notably weaker and inconsistent across measurement windows, suggesting Query Assistant doesn’t significantly impact alerting decisions.
The operational costs are remarkably low, which is a key finding for organizations considering LLM integration:
The low cost is attributed to several factors:
The team provides practical advice: use GPT-4 for prototyping but invest in prompt engineering to make GPT-3.5-turbo work reliably for production.
At launch in May 2023, latency was problematic:
By October 2023, OpenAI had substantially improved their infrastructure:
This highlights a dependency risk for LLM-powered features: performance depends partly on the model provider’s infrastructure improvements.
The case study documents unexpected user behaviors that the team never anticipated or tested for:
DSL Expression Parsing: Users pasted Derived Column expressions (a completely different DSL from another part of the product) into Query Assistant, and it successfully generated runnable queries. Users even marked results as helpful. This demonstrates GPT-3.5-turbo’s ability to generalize beyond the specific use case it was prompted for.
Trace ID Recognition: Users pasted 16-byte hex-encoded trace IDs with no other context, and Query Assistant correctly inferred they wanted to filter by that trace ID. The team believes this works because GPT-3.5-turbo’s training data includes enough tracing context to recognize the pattern.
Query Modification: Users frequently use Query Assistant to modify existing queries rather than building from scratch. The team includes the existing query as context in the prompt, and the model reliably distinguishes between modification requests and new query requests. This feature was added within 30 minutes of launch based on immediate user feedback.
The team incorporated detailed customer feedback into their iteration process. Intercom provided particularly detailed feedback about query types and where Query Assistant fell short. This feedback directly influenced a feature allowing team-defined Suggested Queries to guide the model toward better accuracy for schemas with custom field names.
Sales team feedback indicated Query Assistant helps shorten the introductory phase of enterprise sales cycles by quickly demonstrating “time to value,” even though it doesn’t automatically close deals.
The case study is notably honest about where the feature underperformed:
The case study offers several actionable insights:
Notion AI, serving over 100 million users with multiple AI features including meeting notes, enterprise search, and deep research tools, demonstrates how rigorous evaluation and observability practices are essential for scaling AI product development. The company uses Brain Trust as their evaluation platform to manage the complexity of supporting multilingual workspaces, rapid model switching, and maintaining product polish while building at the speed of AI industry innovation. Their approach emphasizes that 90% of AI development time should be spent on evaluation and observability rather than prompting, with specialized data specialists creating targeted datasets and custom LLM-as-a-judge scoring functions to ensure consistent quality across their diverse AI product suite.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.