Doctolib developed an agentic AI system called Alfred to handle customer support requests for their healthcare platform. The system uses multiple specialized AI agents powered by LLMs, working together in a directed graph structure using LangGraph. The initial implementation focused on managing calendar access rights, combining RAG for knowledge base integration with careful security measures and human-in-the-loop confirmation for sensitive actions. The system was designed to maintain high customer satisfaction while managing support costs efficiently.
Doctolib, a European healthcare technology platform connecting patients with health professionals, embarked on developing an agentic AI system called “Alfred” to transform their customer support operations. The core business problem was straightforward: as the platform scaled, support request volume grew proportionally, and the traditional approach of linearly scaling human support teams was neither sustainable nor cost-effective. The company sought to automate routine support queries while preserving human intervention for complex cases requiring empathy and nuanced expertise.
The case study, published in early 2025 with development occurring through Q4 2024, provides a detailed technical walkthrough of how Doctolib implemented an agentic AI architecture for production use. It’s important to note that this system appears to still be in its early stages, with calendar access management serving as the initial proof of concept rather than a fully deployed, battle-tested solution.
The fundamental design decision was to build an agentic system rather than a traditional chatbot or simple RAG-based assistant. The agentic approach involves multiple specialized AI agents, each powered by an LLM but constrained through specialized prompts defining their role, context, and expertise, as well as a specific set of tools they can access. This follows the principle of least privilege—each agent only has access to the APIs and data sources necessary for its specific function.
The agents are orchestrated using LangGraph, a framework from the LangChain ecosystem designed for building complex agent workflows. The interaction between agents follows a directed graph structure where each node represents either an LLM-based agent or a deterministic function, and edges define communication paths. The flow of information depends on the output of previous nodes, allowing for dynamic conversation routing.
One notable architectural decision was the integration of their existing RAG (Retrieval Augmented Generation) engine as a specialized agent within the agentic system. This demonstrates a practical approach to building on existing infrastructure rather than replacing it entirely.
A critical LLMOps consideration in this implementation is the handling of AI hallucinations and sensitive operations. Doctolib made an explicit policy decision, reached through discussions with engineers, legal, and leadership: the LLM will never directly execute sensitive actions. The final step of performing any action that modifies data (such as changing agenda access permissions) always remains in the user’s hands.
This human-in-the-loop approach addresses a fundamental challenge in production LLM systems—the non-deterministic nature of LLMs means they can and do hallucinate. By requiring explicit user confirmation before any sensitive action is executed, the system maintains safety while still providing efficiency gains through automated information gathering and solution preparation.
However, this design introduces its own complexity: how do you ensure that what is displayed to users accurately represents what will happen when they confirm? The article describes a sophisticated verification mechanism where a deterministic node fact-checks the LLM’s crafted request by fetching fresh data for all referenced resources and returning both technical and human-readable forms. For example, if the LLM references user_id 42, the system verifies this corresponds to “John Doe” and displays that name, preventing hallucinated IDs from being executed.
The security architecture demonstrates thoughtful production engineering. The system implements service-to-service authentication using JSON Web Tokens (JWTs), with each token containing audience (target service) and issuer (calling service) claims. Beyond valid signatures, each service maintains an explicit allowlist of permitted callers, implementing defense in depth.
For user context propagation—ensuring Alfred operates with the same permissions as the user being assisted—the system carries two tokens with each request: the service-to-service JWT proving Alfred’s identity, and the user’s Keycloak token carrying user identity and permissions. This allows target services to both verify Alfred is authorized to make calls and apply the same permission checks as for direct user requests, maintaining consistent security boundaries.
This approach is notable because it avoids the common anti-pattern of giving AI agents elevated admin access. Instead, the AI can only do what the user themselves could do, which significantly reduces the risk surface of the AI system.
The article provides useful scale metrics for production planning: approximately 1,700 support cases per business day, with an estimated 10 interactions per conversation, resulting in roughly 17,000 messages daily. While the author notes this is manageable from a throughput perspective, several production challenges are identified:
The architecture diagram shows Alfred connecting to multiple backend services including a Knowledge Base (for RAG), Agenda service, and Organization service, each authenticated through the JWT mechanism described above.
For evaluation, Doctolib uses Literal.ai, a specialized platform for AI evaluation. Their core metrics include:
This evaluation approach addresses the fundamental LLMOps challenge of measuring AI system quality in a structured, repeatable way. The use of ground truth comparisons suggests they’ve invested in creating evaluation datasets, though the article doesn’t detail the size or composition of these datasets.
The article emphasizes avoiding the “terrible dummy chatbot experience” of rigid decision trees or free-text fields that go nowhere. Instead, Alfred is designed as a “digital butler” that understands user needs even when imperfectly articulated, knows which clarifying questions to ask, discreetly gathers available information from backend systems, and presents clear, actionable solutions through a dynamic user interface.
The practical example of managing calendar access rights demonstrates this philosophy: rather than requiring a perfectly formulated request like “give Maria Smith read-only access to my home consultations calendar,” the system engages in a multi-turn conversation to progressively gather the needed information through dynamically generated UI elements.
While the case study provides valuable technical details, several caveats should be noted:
The technical architecture appears well-thought-out, particularly the security model and human-in-the-loop design. However, the real test will be whether this approach scales across multiple support scenarios and whether the efficiency gains materialize in practice. The article is transparent about this being early-stage work, which is commendable.
The implementation uses several notable technologies and frameworks:
This case study provides a useful template for organizations considering agentic AI for customer support, particularly in regulated industries like healthcare where security and accuracy requirements are stringent.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Prudential Financial, in partnership with AWS GenAI Innovation Center, built a scalable multi-agent platform to support 100,000+ financial advisors across insurance and financial services. The system addresses fragmented workflows where advisors previously had to navigate dozens of disconnected IT systems for client engagement, underwriting, product information, and servicing. The solution features an orchestration agent that routes requests to specialized sub-agents (quick quote, forms, product, illustration, book of business) while maintaining context and enforcing governance. The platform-based microservices architecture reduced time-to-value from 6-8 weeks to 3-4 weeks for new agent deployments, enabled cross-business reusability, and provided standardized frameworks for authentication, LLM gateway access, knowledge management, and observability while handling the complexity of scaling multi-agent systems in a regulated financial services environment.
OpenAI's Forward Deployed Engineering (FDE) team, led by Colin Jarvis, embeds with enterprise customers to solve high-value problems using LLMs and deliver production-grade AI applications. The team focuses on problems worth tens of millions to billions in value, working with companies across industries including finance (Morgan Stanley), manufacturing (semiconductors, automotive), telecommunications (T-Mobile, Klarna), and others. By deeply understanding customer domains, building evaluation frameworks, implementing guardrails, and iterating with users over months, the FDE team achieves 20-50% efficiency improvements and high adoption rates (98% at Morgan Stanley). The approach emphasizes solving hard, novel problems from zero-to-one, extracting learnings into reusable products and frameworks (like Swarm and Agent Kit), then scaling solutions across the market while maintaining strategic focus on product development over services revenue.