Santalucía Seguros implemented a GenAI-based Virtual Assistant to improve customer service and agent productivity in their insurance operations. The solution uses a RAG framework powered by Databricks and Microsoft Azure, incorporating MLflow for LLMOps and Mosaic AI Model Serving for LLM deployment. They developed a sophisticated LLM-based evaluation system that acts as a judge for quality assessment before new releases, ensuring consistent performance and reliability of the virtual assistant.
Santalucía Seguros is a Spanish insurance company that has been serving families for over 100 years. The company faced a common challenge in the insurance industry: agents needed to access vast amounts of documentation from multiple locations and in different formats to answer customer queries about products, coverages, and procedures. This created friction in customer service interactions and slowed down the sales process.
To address this, Santalucía implemented a GenAI-based Virtual Assistant (VA) using a Retrieval Augmented Generation (RAG) framework. The VA enables insurance agents to get instant, natural language answers to their questions through Microsoft Teams, accessible on mobile devices, tablets, or computers with 24/7 availability. The stated benefits include faster customer response times, improved customer satisfaction, and accelerated sales cycles by providing immediate and accurate answers about coverage and products.
The solution architecture is built on what Santalucía calls their “Advanced Analytics Platform,” which is powered by Databricks and Microsoft Azure. This combination was chosen to provide flexibility, privacy, security, and scalability. Several key architectural decisions are worth noting:
The RAG system enables continuous ingestion of up-to-date documentation into embedding-based vector stores. These vector stores provide the ability to index information for rapid search and retrieval, which is essential for answering agent queries in real-time. The architecture supports ongoing updates as new documentation becomes available, which is a critical requirement for an insurance company that regularly updates its product offerings and coverage details.
The RAG system itself is set up as a pyfunc model in MLflow, the open-source LLMOps framework that originated from Databricks. This approach allows the team to version, track, and deploy the RAG pipeline as a cohesive model artifact. Using pyfunc provides flexibility in how the model logic is implemented while still benefiting from MLflow’s model registry and deployment capabilities.
For LLM inference, the team uses Databricks Mosaic AI Model Serving endpoints to host all LLM models used for queries. This centralized approach to model serving provides several operational benefits that are discussed in detail below.
One of the key LLMOps practices highlighted in this case study is the use of Mosaic AI Model Serving to integrate external LLMs such as GPT-4 and other models available in the Databricks Marketplace. The Model Serving layer manages configuration, credentials, and permissions for these third-party models, exposing them through a unified REST API.
This abstraction layer offers several advantages from an operational perspective. First, it ensures that any application or service consumes the LLM capabilities in a standardized way, reducing integration complexity. Second, it simplifies the work of development teams when adding new models by eliminating the need to build custom integrations with third-party APIs. Third, and perhaps most importantly for enterprise use cases, it enables centralized management of token consumption, credentials, and security access.
The team has built a streamlined deployment process where new endpoints can be created on request using a git repository with a CI/CD process. This process deploys the endpoint configuration to the appropriate Databricks workspace automatically. The configuration is defined in JSON files that parameterize credentials and endpoints, with sensitive credentials stored securely in Azure Key Vault. MLflow is then used to deploy models in Databricks through the CI/CD pipelines.
This approach demonstrates a mature LLMOps practice where model serving infrastructure is treated as code, version controlled, and deployed through automated pipelines rather than manual configuration.
Perhaps the most interesting LLMOps practice described in this case study is the implementation of an LLM-as-a-judge evaluation system integrated directly into the CI/CD pipeline. The business context here is critical: Santalucía cannot afford to release updates to the VA that degrade response quality for previously working scenarios.
The challenge is that each time new documents are ingested into the VA, the team must verify the assistant’s performance before releasing the updated version. Traditional approaches that rely on user feedback are too slow and reactive for this use case. Instead, the system must be able to assess quality automatically before scaling to production.
The solution uses a high-capacity LLM as an automated evaluator within the CI/CD pipeline. The process works as follows:
First, the team creates a ground truth set of questions that have been validated by domain experts. When new product documentation is added to the VA, the team (either manually or with LLM assistance) develops a set of questions about the documentation along with expected answers. Importantly, this ground truth dataset grows with each release, building an increasingly robust regression test suite.
Second, the LLM-as-a-judge is configured with natural-language-based criteria for measuring accuracy, relevance, and coherence between expected answers and those provided by the VA. These criteria are designed to assess whether the VA’s responses match the intent and content of the ground truth answers, even if the exact wording differs.
Third, during the CI/CD pipeline execution, the VA answers each question from the ground truth set, and the judge LLM assigns a score by comparing the expected answer with the VA’s response. This creates a quantitative quality assessment that can be used to gate releases.
The benefits of this approach are significant. It eliminates the wait for user reports about malfunctioning retrieval or generation. It also enables the team to make incremental changes to components like prompts while ensuring these changes don’t negatively impact quality for previously delivered releases. This is a form of regression testing specifically designed for the probabilistic nature of LLM outputs.
The case study explicitly acknowledges that supporting continuous delivery of new releases while maintaining good LLMOps practices and response quality is challenging. The seamless integration of newly ingested documents into the RAG system requires careful orchestration of multiple components: the document ingestion pipeline, the vector store updates, the RAG model, and the evaluation system.
The team emphasizes that ensuring response quality is critical for the business, and they cannot modify any part of the solution’s code without guaranteeing it won’t negatively impact previously delivered releases. This requires thorough testing and validation processes, which the LLM-as-a-judge approach addresses.
The reliance on “RAG tools available in the Databricks Data Intelligence Platform” suggests the team is leveraging platform-native capabilities for ensuring releases have the latest data with appropriate governance and guardrails around their output. This includes centralized model management through MLflow and secure credential handling through Azure Key Vault.
While the case study presents a compelling architecture, there are some areas where additional details would be valuable. The text does not provide specific metrics on response quality improvements, agent productivity gains, or customer satisfaction increases. Claims that the Virtual Assistant “exceeded user expectations” are not quantified with benchmarks or survey data.
The LLM-as-a-judge approach, while innovative, has known limitations. The quality of evaluation depends heavily on the comprehensiveness of the ground truth dataset and the ability of the judge LLM to accurately assess semantic similarity and correctness. The case study acknowledges that creating ground truth requires manual validation by professionals, which can be a bottleneck for rapidly evolving documentation.
Additionally, the reliance on external LLM services like GPT-4 through Model Serving introduces dependencies on third-party availability and pricing, though the abstraction layer does provide some flexibility to switch providers.
The complete technology stack includes:
This architecture demonstrates a pattern increasingly common in enterprise GenAI deployments: using a managed platform like Databricks for the core data and model infrastructure while integrating with enterprise collaboration tools like Microsoft Teams for the user-facing application layer.
The Santalucía Seguros case study represents a well-documented example of enterprise RAG deployment with mature LLMOps practices. The key innovations are the centralized model serving layer for managing LLM access and the LLM-as-a-judge evaluation system integrated into CI/CD. These practices address real operational challenges around credential management, security, and quality assurance in a production GenAI system. While quantitative results are not provided, the architectural patterns and processes described offer valuable guidance for organizations implementing similar solutions in regulated industries like insurance.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.
Coinbase, a cryptocurrency exchange serving millions of users across 100+ countries, faced challenges scaling customer support amid volatile market conditions, managing complex compliance investigations, and improving developer productivity. They built a comprehensive Gen AI platform integrating multiple LLMs through standardized interfaces (OpenAI API, Model Context Protocol) on AWS Bedrock to address these challenges. Their solution includes AI-powered chatbots handling 65% of customer contacts automatically (saving ~5 million employee hours annually), compliance investigation tools that synthesize data from multiple sources to accelerate case resolution, and developer productivity tools where 40% of daily code is now AI-generated or influenced. The implementation uses a multi-layered agentic architecture with RAG, guardrails, memory systems, and human-in-the-loop workflows, resulting in significant cost savings, faster resolution times, and improved quality across all three domains.