Doctolib developed and deployed an AI-powered consultation assistant for healthcare professionals that combines speech recognition, summarization, and medical content codification. Through a comprehensive approach involving simulated consultations, extensive testing, and careful metrics tracking, they evolved from MVP to production while maintaining high quality standards. The system achieved widespread adoption and positive feedback through iterative improvements based on both explicit and implicit user feedback, combining short-term prompt engineering optimizations with longer-term model and data improvements.
Doctolib's case study presents a comprehensive look at developing and deploying an AI-powered consultation assistant in a healthcare setting, offering valuable insights into the challenges and solutions of putting LLMs into production in a highly regulated and sensitive domain.
The consultation assistant combines three key AI components: speech recognition for converting audio to text, LLM-based summarization to create medical consultation summaries, and medical content codification to map summaries to standard ontologies like ICD-10. What makes this case study particularly interesting is their methodical approach to moving from MVP to production while maintaining quality and safety standards.
Their LLMOps journey can be broken down into several key phases:
**Initial Development and Testing Phase:**
The team tackled the classic cold-start problem through a creative "fake it to make it" approach. They conducted simulated consultations using internal staff and created fake patient personas to generate initial training and evaluation data. This approach allowed them to begin testing and improving their system before deploying to real users, while also gathering valuable acoustic and content data that mimicked real-world conditions.
**Metrics and Evaluation:**
The team implemented a sophisticated evaluation framework combining multiple approaches:
* They used LLMs as automated judges to evaluate summary quality, focusing on critical aspects like hallucination rates and recall
* They discovered that traditional NLP metrics like ROUGE, BLEU, and BERT scores weren't as effective as expected, showing the importance of questioning standard approaches
* They implemented the F1 score for medical content codification accuracy
* They created a comprehensive set of online metrics including power user rate, activation rate, suggestion acceptance rate, and bad experience rate
**Feedback Collection System:**
They implemented a dual-track feedback system:
* Explicit feedback through ratings and comments from practitioners
* Implicit feedback tracking user behaviors like deletions, edits, and validations
This approach provided a continuous stream of improvement data while being mindful of healthcare practitioners' time constraints.
**Production Testing and Deployment:**
The team implemented a robust A/B testing framework for production changes, using it for both:
* Testing new features and improvements
* "Do no harm" testing to ensure infrastructure changes didn't negatively impact performance
**Continuous Improvement Strategy:**
They implemented a two-tiered improvement approach:
Short-term improvements (Days to Weeks):
* Quick iterations focused on immediate user frustrations
* Prompt engineering optimizations
* Expert validation of changes
* A/B testing of improvements
* Rapid deployment of successful changes
Long-term improvements (Weeks to Months):
* Data annotation process refinements
* Weak labeling implementation to expand training data
* Model fine-tuning optimizations
* Base model selection improvements
* Training data mixture optimization
**Key LLMOps Learnings:**
The case study reveals several important lessons for LLMOps practitioners:
* The importance of starting with a safe MVP even if initial performance isn't perfect
* The value of simulated data for initial testing and development
* The need to validate standard metrics in your specific context rather than blindly applying them
* The benefits of combining automated and human evaluation
* The importance of designing feedback collection systems that don't burden users
* The value of tracking both explicit and implicit feedback signals
* The need to account for various biases in feedback collection (new user bias, memory effect, timing effects)
* The benefits of maintaining parallel short-term and long-term improvement cycles
The team also demonstrated sophisticated understanding of the limitations and challenges in their approach. They acknowledged the diminishing returns of prompt engineering optimizations and the increasing complexity of identifying improvements as the system matures. Their approach to bias management in feedback collection shows a deep understanding of the complexities of deploying AI systems in production.
What makes this case study particularly valuable is its demonstration of how to balance rapid iteration and improvement while maintaining high standards in a healthcare context. The team's approach to safety and quality, combined with their practical solutions to common LLMOps challenges, provides a valuable template for others deploying LLMs in sensitive domains.
Start your new ML Project today with ZenML Pro
Join 1,000s of members already deploying models with ZenML.