Honeycomb: Natural Language Query Interface with Production LLM Integration

LLMOps Database

Tech

Honeycomb

Company

Honeycomb

Title

Natural Language Query Interface with Production LLM Integration

Industry

Tech

Link

https://www.youtube.com/watch?v=DZgXln3v85s

Year

2023

Summary (short)

Honeycomb implemented a natural language query interface for their observability platform to help users more easily analyze their production data. Rather than creating a chatbot, they focused on a targeted query translation feature using GPT-3.5, achieving a 94% success rate in query generation. The feature led to significant improvements in user activation metrics, with teams using the query assistant being 2-3x more likely to create complex queries and save them to boards.

Tags

# Honeycomb's Natural Language to Query Implementation ## Background and Problem Statement Honeycomb is an observability company that helps developers analyze application data running in production. While their platform had strong product-market fit with SREs and platform engineers, they identified that regular software developers and product managers struggled with their querying interface. The core challenge was making their powerful query capabilities accessible to users who weren't familiar with their query syntax. ## Solution Architecture and Implementation - Chose to implement a focused natural language to query translation feature rather than a full chatbot - Used GPT-3.5 due to cost considerations (GPT-4 would have been prohibitively expensive at tens of millions per year) - Implemented a JSON specification output format for queries - Incorporated dataset definitions and schema information as context for the LLM - Used embeddings to select relevant subset of fields from schema to avoid overloading context ## Development Process and Timeline - Gave themselves an initial one-month timeline to ship to production - Followed with another month of intensive iteration based on production data - Started with experimental release and progressed to GA quality - Used real user inputs/outputs to guide improvements rather than hypothetical scenarios ## Key Technical Challenges and Solutions ### Prompt Engineering - Discovered prompt engineering was more complex than anticipated - Found that function calling APIs sometimes reduced success rates for ambiguous inputs - Used simulated conversation patterns in prompts to guide model behavior - Iteratively improved prompts based on actual user interaction patterns ### Security and Rate Limiting - Implemented multiple layers of protection against prompt injection attacks: - Worked with security experts to identify potential vulnerabilities ### Cost Management - Analyzed query volumes based on existing manual query patterns - Projected costs at around $100k/year for GPT-3.5 - Implemented rate limiting both for security and cost control - Reduced prompt sizes and token usage where possible - Made conscious choice to use GPT-3.5 over GPT-4 due to cost differences ### Performance Monitoring and Evaluation - Instrumented the system extensively with observability - Tracked dozen+ fields per request including: - Set up SLOs (Service Level Objectives) to track success rates - Started at 75% success rate target, achieved 94% in production ## Results and Impact ### Quantitative Metrics - Improved success rate from initial ~76% to 94% - Teams using query assistant showed: - These represented the largest improvements seen in any previous product experiments ### Business Impact - Significantly reduced sales team's hand-holding needs for new users - Accelerated customer onboarding process - Improved self-serve conversion rates - ROI justified by both sales efficiency and product-led growth improvements ## Technical Lessons Learned ### User Interface Design - Chose targeted query interface over chatbot for better user experience - Enabled users to manually modify generated queries rather than continuing conversation - Integrated with existing UI patterns rather than creating new interaction models - Made generated queries editable and visible in the standard interface ### Data and Iteration - Found working with production data more valuable than hypothetical scenarios - Built evaluation systems based on real user interactions - Grouped and analyzed errors by frequency to identify patterns - Used actual user feedback to guide improvements ### Model Selection and Integration - Chose GPT-3.5 based on practical constraints: - Considered but decided against open source models due to: ## Future Considerations - Planning more prominent placement of query assistant for new users - Considering selective use of agents/chaining for non-realtime features - Exploring observability practices specific to LLM applications - May consider open source models when tooling improves - Focusing on improving data collection and evaluation systems ## Best Practices Identified - Start with focused, well-defined problem rather than broad AI application - Use real user data to guide improvements - Implement comprehensive monitoring from day one - Consider full system costs including API calls, engineering time, and infrastructure - Balance accuracy with latency based on use case - Build security measures at multiple layers - Track leading indicators tied to business outcomes - Make conscious trade-offs based on actual constraints rather than using latest technology

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.

Book a Demo

Use Open Source