Doctolib evolved their customer care system from basic RAG to a sophisticated multi-agent architecture using LangGraph. The system employs a primary assistant for routing and specialized agents for specific tasks, incorporating safety checks and API integrations. While showing promise in automating customer support tasks like managing calendar access rights, they faced challenges with LLM behavior variance, prompt size limitations, and unstructured data handling, highlighting the importance of robust data structuration and API documentation for production deployment.
Doctolib, a European healthcare technology company known for its medical appointment booking platform, embarked on a journey to revolutionize their customer care services using LLM-based solutions. This case study (Part 2 of a series) documents their evolution from a basic Retrieval-Augmented Generation (RAG) system to a more sophisticated agentic architecture. The work represents an experimental proof-of-concept (POC) rather than a fully deployed production system, with the team openly acknowledging the challenges that remain before achieving production readiness.
The core motivation for moving beyond RAG was to handle more complex customer support tasks that require multi-step reasoning, tool execution, and the ability to perform actions on behalf of users—capabilities that go beyond simple document retrieval and response generation.
The team evaluated several emerging agentic frameworks including CrewAI, AutoGen, and LangGraph. They ultimately selected LangGraph, a multi-agent framework built on top of LangChain, for several reasons:
LangGraph models interactions as cyclical graphs composed of nodes and branches. Each node represents a computation or processing step, which can be either an LLM-based agent or a deterministic function. These graphs enable advanced workflows with multiple loops and conditional logic, making them suitable for complex agent orchestration.
The architecture implements a hierarchical multi-agent system with two types of LLM agents:
Root Assistant (Primary Assistant): This agent serves as the entry point for user interactions. Its responsibilities include greeting users, engaging in conversation until a clear query emerges, and routing the user to the appropriate specialized assistant. The routing mechanism is based on an ML classification model.
Specialized Assistants: Each specialized assistant handles a fixed scope of one or two use cases. This design decision was intentional—by limiting the scope, prompt size, and number of associated tools for each agent, the team aimed to reduce pressure on individual agents and improve their reliability. Specialization enables better performance because agents can focus on their domain expertise.
Each specialized assistant has access to several categories of tools:
Data Fetching Tools: These retrieve contextual information about the user or their query, with the tool documentation specifying which Doctolib database resources are relevant to the user’s question.
FAQ Search Tool: This is essentially the RAG system from the original implementation, now integrated as one tool among many that agents can invoke.
Execution Tools (Sensitive): These tools automate customer support back-end actions required to resolve user issues. They are classified as “sensitive” because they require explicit user validation before execution. The system includes a fact-checking step as a safety net to ensure that tool arguments are properly filled by the specialized assistant before execution.
Task Completion Tools: These signal when a task is complete or canceled, allowing the conversation to loop back to the root assistant.
The article provides a concrete demonstration: a user wants to grant their secretary access to their calendar/agenda. The flow proceeds as follows:
The team is commendably transparent about the challenges they face in bringing this system to production. These challenges offer valuable insights for practitioners working on similar agentic systems.
One of the most significant issues is the non-deterministic nature of LLM models, which leads to inconsistent agent behavior. Specific problems include agents not invoking the correct tool at the right moment and agents not executing tools with properly specified parameters. This unpredictability is especially problematic when prompts become large. The team’s mitigation strategy involves reducing the expected tasks from individual LLMs and limiting their degrees of freedom—essentially keeping agents focused and constrained.
Agentic architectures require feeding extensive information through the LLM prompt, including tool descriptions, execution details, and message history. This leads to very large prompts, which are susceptible to the “Lost in the Middle” problem—LLMs tend to pay less attention to information in the middle of long contexts. The more information provided, the less likely the LLM is to follow guidelines consistently.
Enriching context around user questions requires agents to extract useful information from unstructured data and interpret it correctly. This task is difficult for models to perform consistently, adding another layer of complexity to reliable system operation.
The team emphasizes that the effectiveness of agentic systems hinges on the quality, completeness, and relevance of underlying data. They identify several key requirements:
Functional Data Structuration: Creating a clear and exhaustive data referential for all scopes and categories the system handles. This includes defining the prompt, context information, tool definitions, and available data tables for each specialized assistant. The goal is to break down user queries into manageable use cases with specific context data, instructions, and definitions that guide the LLM through small, manageable tasks.
API Documentation Quality: To execute actions on behalf of users, the system requires well-documented APIs. The quality of the agentic system depends directly on the quality of code documentation. The team envisions using OpenAPI specifications to directly feed their system, creating a new paradigm where code documentation becomes a valuable data source for the AI system itself.
Data Governance: Strong governance over key data assets is essential, ensuring datasets remain up-to-date and semantics are harmonized across the organization.
The article honestly outlines significant challenges that remain before this can become a reliable production AI product:
Evaluation Complexity: The system comprises many interconnected components that need individual evaluation to identify performance bottlenecks. The team mentions frameworks like Literal and LangSmith as potential tools for understanding error root causes. However, comprehensive evaluation of multi-agent systems remains an unsolved challenge in the field.
Production Deployment Dependencies: Deploying the framework requires strong collaboration and synchronization across multiple teams including design, feature teams, product management, and ML platform teams. LangGraph is described as a new and evolving library with few real production use cases to reference. Achieving the required robustness level for production confidence is an ongoing effort.
Organizational Change: Making the system scalable requires smart design, team synchronization, and excellent data structuration and documentation. This necessitates organizational change across the company to develop and maintain properly governed data assets.
While the case study presents promising concepts and an interesting architectural approach, it’s important to note several caveats:
That said, the transparency about challenges and limitations adds credibility to the case study. The architectural decisions—particularly the specialized agent approach to reduce complexity and the sensitive tool validation pattern—represent thoughtful design choices that could inform similar implementations elsewhere.
The Doctolib case study offers several valuable lessons:
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
OpenAI's Forward Deployed Engineering (FDE) team, led by Colin Jarvis, embeds with enterprise customers to solve high-value problems using LLMs and deliver production-grade AI applications. The team focuses on problems worth tens of millions to billions in value, working with companies across industries including finance (Morgan Stanley), manufacturing (semiconductors, automotive), telecommunications (T-Mobile, Klarna), and others. By deeply understanding customer domains, building evaluation frameworks, implementing guardrails, and iterating with users over months, the FDE team achieves 20-50% efficiency improvements and high adoption rates (98% at Morgan Stanley). The approach emphasizes solving hard, novel problems from zero-to-one, extracting learnings into reusable products and frameworks (like Swarm and Agent Kit), then scaling solutions across the market while maintaining strategic focus on product development over services revenue.
Amazon teams faced challenges in deploying high-stakes LLM applications across healthcare, engineering, and e-commerce domains where basic prompt engineering and RAG approaches proved insufficient. Through systematic application of advanced fine-tuning techniques including Supervised Fine-Tuning (SFT), Proximal Policy Optimization (PPO), Direct Preference Optimization (DPO), and cutting-edge reasoning optimizations like Group-based Reinforcement Learning from Policy Optimization (GRPO) and Direct Advantage Policy Optimization (DAPO), three Amazon business units achieved production-grade results: Amazon Pharmacy reduced dangerous medication errors by 33%, Amazon Global Engineering Services achieved 80% human effort reduction in inspection reviews, and Amazon A+ Content improved quality assessment accuracy from 77% to 96%. These outcomes demonstrate that approximately one in four high-stakes enterprise applications require advanced fine-tuning beyond standard techniques to achieve necessary performance levels in production environments.