Elastic developed ElasticGPT, an internal generative AI assistant built on their own technology stack to provide secure, context-aware knowledge discovery for their employees. The system combines RAG (Retrieval Augmented Generation) capabilities through their SmartSource framework with private access to OpenAI's GPT models, all built on Elasticsearch as a vector database. The solution demonstrates how to build a production-grade AI assistant that maintains security and compliance while delivering efficient knowledge retrieval and generation capabilities.
This case study explores how Elastic built ElasticGPT, their internal generative AI assistant, as a demonstration of implementing LLMs in a production enterprise environment. The implementation serves dual purposes: providing real utility to Elastic employees while also serving as a "customer zero" reference architecture for clients looking to build similar systems.
The core of the system is SmartSource, their RAG-based framework that combines Elasticsearch's vector search capabilities with OpenAI's GPT models. This architecture choice reflects a pragmatic approach to enterprise AI deployment - using retrieval to ground LLM responses in authoritative internal data while leveraging best-in-class foundation models for generation.
Key aspects of the production implementation include:
**Architecture and Infrastructure**
* The system is built entirely on Elastic Cloud using Kubernetes for orchestration
* Elasticsearch serves as both the vector database for RAG and the storage system for chat data
* They maintain a 30-day retention policy for chat data while preserving metrics longer term
* The frontend is built with React and their EUI framework, emphasizing maintainability
* The API layer uses a stateless, streaming design for real-time response delivery
**Data Integration and Security**
* Enterprise Connectors are used to ingest data from various internal sources including their Wiki and ServiceNow
* All data is processed into searchable chunks with vector embeddings generated for semantic search
* Security is implemented through Okta SSO integration and end-to-end encryption
* OpenAI models (GPT-4o and GPT-4o-mini) are accessed through a private Azure tenant for compliance
* The system helps prevent shadow AI by providing secure access to popular LLM models
**RAG Implementation**
* LangChain orchestrates the RAG pipeline, handling data chunking, embedding generation, and context retrieval
* The system intelligently selects relevant context chunks rather than entire documents
* Responses are streamed in real-time to provide a natural conversation experience
* Source attribution and linking builds trust in the system's responses
**Monitoring and Observability**
* Elastic APM provides comprehensive monitoring of all system components
* Every API transaction is tracked for latency and error rates
* Kibana dashboards provide real-time visibility into system health and usage
* User feedback is collected and stored for continuous improvement
**Production Considerations**
* The system is designed to scale from hundreds to thousands of users
* Updates can be deployed without downtime thanks to Kubernetes orchestration
* The platform approach allows rapid iteration as AI capabilities evolve
* They're expanding to support additional LLM providers like Anthropic and Google
The implementation demonstrates several important LLMOps best practices:
* Using RAG to ground responses in authoritative data rather than relying solely on LLM knowledge
* Implementing comprehensive security and compliance measures
* Building robust monitoring and observability from the start
* Designing for scalability and maintainability
* Creating feedback loops for continuous improvement
Future plans include:
* Incorporating new features like "Semantic Text" field type and inference endpoints
* Adding specialized AI agents for workflow automation
* Expanding LLM observability capabilities
* Integration with additional LLM providers
What makes this case study particularly interesting is how it showcases building a production AI system using primarily off-the-shelf components (albeit their own) rather than custom infrastructure. This approach potentially offers faster time-to-market and reduced maintenance burden compared to building everything from scratch.
The system's architecture also demonstrates a thoughtful balance between capability and control. By hosting third-party LLMs on a private Azure tenant and implementing strict security measures, they maintain control over data and compliance while still leveraging state-of-the-art models. The RAG implementation similarly balances the power of LLMs with the need for accurate, authoritative responses.
One potential limitation to note is the dependence on OpenAI's models, though they are working to diversify their LLM providers. The case study also doesn't provide specific metrics on accuracy or user satisfaction, focusing instead on technical implementation details.
From an LLMOps perspective, the system demonstrates mature practices around deployment, monitoring, and scaling. The use of Kubernetes for orchestration, comprehensive APM monitoring, and designed-in scalability shows an understanding of what it takes to run AI systems in production. The attention to security and compliance aspects is also noteworthy, as these are often overlooked in more experimental implementations.
The case study serves as a valuable reference for organizations looking to implement similar systems, particularly those already using Elastic stack components. It shows how to leverage existing infrastructure investments while adding modern AI capabilities in a controlled, secure manner.
Start your new ML Project today with ZenML Pro
Join 1,000s of members already deploying models with ZenML.