Elastic's Field Engineering team developed a customer support chatbot using RAG instead of fine-tuning, leveraging Elasticsearch for document storage and retrieval. They created a knowledge library of over 300,000 documents from technical support articles, product documentation, and blogs, enriched with AI-generated summaries and embeddings using ELSER. The system uses hybrid search combining semantic and BM25 approaches to provide relevant context to the LLM, resulting in more accurate and trustworthy responses.
This case study explores how Elastic implemented a production-grade customer support chatbot using Retrieval-Augmented Generation (RAG) and their own search technology. The project offers valuable insights into the practical challenges and solutions of deploying LLMs in a production environment, particularly for technical customer support applications.
The team initially evaluated both fine-tuning and RAG approaches, ultimately choosing RAG for several compelling reasons. The decision was driven by practical considerations including the difficulty of creating paired question-answer training data at scale, the need for real-time updates as documentation changes, and the requirement for role-based access control to sensitive information. This pragmatic approach demonstrates the importance of considering operational requirements beyond just model performance when designing LLM systems.
The architecture is built around a sophisticated knowledge library containing over 300,000 documents from multiple sources:
* Technical support articles written by Support Engineers
* Product documentation across 114 unique versions
* Blogs and other technical content
* Search/Security/Observability Labs content
A key architectural decision was consolidating multiple storage systems (previously using Swiftype and Elastic Appsearch) into a single Elasticsearch index with document-level security. This simplified the infrastructure while enabling role-based access control, showing how security considerations must be built into the core design of production LLM systems.
The team implemented several innovative technical solutions for document processing and enrichment:
* Used Crawlee for large-scale web crawling, running on Google Cloud Run with 24-hour job timeouts
* Implemented custom request handlers for different document types to ensure consistent document structure
* Created an automated enrichment pipeline using OpenAI GPT-3.5 Turbo to generate document summaries and relevant questions
* Utilized ELSER (Elastic Learned Sparse Embedding Retrieval) for creating and storing document embeddings
The search implementation uses a hybrid approach combining semantic search on titles and summaries with BM25 on full content. This demonstrates the importance of carefully designing the retrieval system in RAG applications, as the quality of context provided to the LLM directly impacts response accuracy.
A particularly interesting aspect of their implementation is the automated document enrichment process. The team created a service that uses GPT-3.5 Turbo to generate summaries and potential questions for documents that lack human-written summaries. This process increased their searchable content by 10x, going from 8,000 human-written summaries to 128,000 AI-generated ones. The enrichment process was carefully managed to avoid overloading their LLM instance, showing the importance of considering infrastructure limitations in production systems.
The team's focus on observability and continuous improvement is evident in their development approach. They:
* Regularly measure user sentiment and response accuracy
* Push frequent small updates to production to minimize risk
* Analyze user trends to identify knowledge gaps
* Continuously test new hypotheses about search relevance and context size
Some key learnings from the project include:
* Smaller, more precise context leads to more deterministic LLM responses
* Document-level security is crucial for managing access control at scale
* User intent understanding requires multiple search query approaches
* Analysis of user search patterns helps identify knowledge gaps
The project also revealed several important operational considerations:
* The need to balance enrichment processing load against production system performance
* The importance of maintaining consistent document structure across diverse sources
* The value of automated processes for maintaining and updating the knowledge base
* The need for careful prompt engineering in both the enrichment process and final user interactions
The team continues to evolve the system, planning future enhancements such as technical diagram support and GitHub issue integration. This ongoing development approach shows how production LLM systems require continuous refinement and expansion to meet user needs effectively.
This case study demonstrates the complexity of implementing LLMs in production, particularly the importance of surrounding infrastructure and careful system design. It shows how successful LLMOps requires a holistic approach that considers not just the model interaction, but also data management, security, scalability, and continuous improvement processes.
Start your new ML Project today with ZenML Pro
Join 1,000s of members already deploying models with ZenML.