Elastic: Quantitative Framework for Production LLM Evaluation in Security Applications

LLMOps Database

Tech

Elastic

Company

Elastic

Title

Quantitative Framework for Production LLM Evaluation in Security Applications

Industry

Tech

Link

https://www.elastic.co/blog/elastic-security-generative-ai-features

Year

2025

Summary (short)

Elastic developed a comprehensive framework for evaluating and improving GenAI features in their security products, including an AI Assistant and Attack Discovery tool. The framework incorporates test scenarios, curated datasets, tracing capabilities using LangGraph and LangSmith, evaluation rubrics, and a scoring mechanism to ensure quantitative measurement of improvements. This systematic approach enabled them to move from manual to automated evaluations while maintaining high quality standards for their production LLM applications.

This case study details how Elastic developed and implemented a robust evaluation framework for their production GenAI features in the security domain, showcasing a mature approach to LLMOps that moves beyond simple proof-of-concepts to serving enterprise users at scale. Elastic has established itself as a significant player in the GenAI infrastructure space, being recognized as the #2 in Top 5 LangGraph Agents in Production 2024 by LangChain and named GenAI Infrastructure and Data Partner of the Year by AWS. Their position as both a user and provider of GenAI development tools (particularly as the world's most downloaded vector database) gives them unique insights into production LLM systems. The case study focuses on three main GenAI products in production: * Elastic AI Assistant for Security - A chatbot helping with security-related queries and ES|QL translations * Attack Discovery - A system that analyzes security alerts to identify and summarize active attacks * Automatic Import - A tool that creates custom integrations from sample log lines The heart of the case study is their evaluation framework, which was developed to ensure consistent quality improvements in their GenAI features. The framework consists of several key components: Test Scenarios and Datasets: * They created diverse security scenarios including living-off-the-cloud attacks, advanced persistent threats, and known vulnerabilities * Test datasets were carefully curated through human-in-the-loop validation * Used real-world data from sources like ohmymalware.com and their Security Labs team * Examples that met quality criteria were added to their test dataset through LangSmith's UI Tracing Implementation: * Utilized LangGraph for designing and running AI Agent workflows * Integrated LangSmith for comprehensive tracing capabilities * Implemented Elasticsearch as their vector database for RAG functionality * Created a complete trace of the system from user request to response generation Evaluation Rubrics and Scoring: * Developed detailed rubrics to evaluate specific desired behaviors * Used LLM-as-judge approach for automated evaluations * Implemented real-time evaluation in the production flow * Created a scoring mechanism with minimum thresholds (e.g., 85% accuracy requirement) * Weighted different evaluation criteria based on business requirements The evolution of their evaluation process is particularly noteworthy. They started with manual spreadsheet-based testing for their AI Assistant's natural language-to-ES|QL generation functionality. As they matured, they transitioned to automated evaluations using LangSmith, significantly improving their workflow efficiency. For Attack Discovery, they faced additional challenges: * Need for realistic input alerts representing actual attack scenarios * Requirement for cybersecurity expertise in evaluating outputs * Complex evaluation criteria including chronological accuracy and narrative clarity * Initial reliance on manual expert review before moving to automated evaluation Their framework also enables systematic comparison of different LLMs and prompt variations, leading to a recommended LLM matrix for different tasks. This allows them to make data-driven decisions about which configurations to use in production. Technical Implementation Details: * Integration of multiple LLM providers through their connector system * Use of Elasticsearch for vector search capabilities * Implementation of RAG patterns for knowledge augmentation * Automated evaluation pipelines using LangSmith * Visualization of results using tools like Seaborn for heatmap generation The framework demonstrates several LLMOps best practices: * Continuous evaluation and improvement cycles * Automated testing and validation * Quantitative metrics for decision-making * Clear evaluation criteria and rubrics * Version control for prompts and configurations * Integration with modern LLM development tools * Comprehensive tracing and monitoring Importantly, their approach emphasizes the need for both qualitative and quantitative evaluation methods, with a focus on measurable improvements rather than subjective assessments. This has allowed them to maintain high quality standards while serving enterprise users at scale. The case study represents a mature example of LLMOps in practice, showing how to move beyond initial experimentation to robust, production-grade systems with reliable evaluation frameworks. Their approach provides valuable insights for organizations looking to implement similar systems, particularly in domains requiring high accuracy and reliability like security.

Start your new ML Project today with ZenML Pro

Join 1,000s of members already deploying models with ZenML.

Learn more

Try Free