Company
Elastic
Title
Building a Production-Grade GenAI Customer Support Assistant with Comprehensive Observability
Industry
Tech
Year
2024
Summary (short)
Elastic developed a customer support chatbot using generative AI and RAG, focusing heavily on production-grade observability practices. They implemented a comprehensive observability strategy using Elastic's own stack, including APM traces, custom dashboards, alerting systems, and detailed monitoring of LLM interactions. The system successfully launched with features like streaming responses, rate limiting, and abuse prevention, while maintaining high reliability through careful monitoring of latency, errors, and usage patterns.
This case study explores how Elastic built and deployed a production-grade customer support assistant using generative AI, with a particular focus on the observability aspects of running LLMs in production. The Support Assistant integrates with their existing support portal and uses RAG (Retrieval Augmented Generation) to provide accurate responses based on Elastic's documentation and knowledge base. The observability implementation is particularly noteworthy as it demonstrates a comprehensive approach to monitoring and maintaining an LLM-based system in production. The team utilized several key components of the Elastic Stack: * Dedicated Monitoring Infrastructure: They maintain a separate Elastic Cloud cluster specifically for monitoring purposes, distinct from their production and staging data clusters. This separation provides clean monitoring data isolation and helps prevent monitoring activities from impacting production workloads. * APM Integration: The team implemented extensive Application Performance Monitoring using Elastic's Node APM client. They created custom spans to track critical metrics like time-to-first-token and total completion time for LLM responses. The APM integration includes correlation between logs and transaction data, enabling detailed debugging and performance analysis. * Comprehensive Dashboarding: Their status dashboard includes key metrics such as: - Total chat completions and unique users - Latency measurements for RAG searches and chat completions - User engagement metrics including returning users - Error rates and types - System capacity and usage patterns * Intelligent Alerting System: They implemented a tiered alerting approach where different severity levels trigger different notification methods (email vs. Slack). Alerts are carefully designed to be actionable, avoiding alert fatigue. Thresholds are set based on observed baseline metrics during development and testing phases. The LLMOps implementation includes several sophisticated production features: * Streaming Response Handling: They implemented streaming responses with a 10-second timeout for first-token generation, balancing user experience with system reliability. The timeout implementation required careful handling of client-server communication to ensure proper error states and UI updates. * Request Validation and Safety: The system includes prompt hardening to decline inappropriate requests, with standardized decline messages that can be monitored. This allows tracking of declined requests to identify potential abuse patterns or areas where the system might need adjustment. * Rate Limiting: Based on usage analysis from internal testing, they implemented a rate limit of 20 chat completions per hour per user. This limit was carefully chosen to allow legitimate use while preventing abuse. * Feature Flagging: They developed a sophisticated feature flag system that allows both enabling and blocking access at user and organization levels, providing fine-grained control over system access. * Payload Size Management: The team encountered and solved issues with large context windows from RAG results exceeding payload limits. Their initial solution involved increasing payload size limits, with plans for a more sophisticated approach using result IDs and server-side context assembly. * Error Monitoring: Comprehensive error tracking includes custom error types for various failure modes, with proper error grouping in APM for effective debugging and trend analysis. The monitoring stack uses multiple data collection methods: - Elastic APM for application performance data - Filebeat for log collection - Synthetic monitoring for endpoint health checks - Stack monitoring for cluster health - Azure OpenAI integration for LLM-specific metrics One particularly interesting aspect is their handling of observability data during development. They advocate for early integration of monitoring tools, allowing collection of baseline metrics and early detection of potential issues before production deployment. The case study also demonstrates thoughtful consideration of failure modes and edge cases. For example, their handling of timeout scenarios includes proper error propagation through the streaming response chain, ensuring consistent user experience even during failures. The implementation shows careful attention to security and abuse prevention, with multiple layers of protection including authentication requirements, rate limiting, and the ability to quickly disable access for problematic users. This case study represents a mature approach to running LLMs in production, with particular strength in its comprehensive observability implementation. The system successfully launched and has been handling production traffic, with the observability tools proving valuable both for monitoring system health and for continuous improvement of the service.

Start your new ML Project today with ZenML Pro

Join 1,000s of members already deploying models with ZenML.