Vira Health developed and evaluated an AI chatbot to provide reliable menopause information using peer-reviewed position statements from The Menopause Society. They implemented a RAG (Retrieval Augmented Generation) architecture using GPT-4, with careful attention to clinical safety and accuracy. The system was evaluated using both AI judges and human clinicians across four criteria: faithfulness, relevance, harmfulness, and clinical correctness, showing promising results in terms of safety and effectiveness while maintaining strict adherence to trusted medical sources.
Vira Health, a digital health company focused on menopause care, presents an interesting case study in the careful development and deployment of LLMs in a healthcare setting. The company's main product, Stella, is an end-to-end digital care pathway for menopause symptoms, and they sought to enhance their offering with an AI chatbot specifically designed to provide reliable menopause information.
The journey into LLM deployment began with their initial exploration of ChatGPT's capabilities when it was launched in late 2022. Rather than immediately integrating the technology into their product, they took a methodical approach starting with a preliminary study of ChatGPT's performance on 26 frequently asked menopause questions. This initial evaluation highlighted both the potential and the limitations of using general-purpose LLMs directly in their healthcare context.
The technical implementation of their solution showcases several important aspects of responsible LLMOps in healthcare:
**Architecture and Technical Details**
* They implemented a RAG (Retrieval Augmented Generation) architecture, which was chosen specifically to ensure that responses would be grounded in trusted medical sources.
* The system was built using vanilla Python rather than existing RAG frameworks like LlamaIndex, a decision made to maintain maximum flexibility and control in this sensitive domain.
* The document processing pipeline included:
* Chunking of position statements into 512-word segments
* Embedding generation using OpenAI's Ada-002 model
* Storage in a vector database for efficient retrieval
* Context assembly using top-5 similar chunks
* Response generation using GPT-4 (0125 preview version)
**Data and Content Management**
A key aspect of their LLMOps approach was the careful selection and preparation of source material. They specifically used peer-reviewed position statements from The Menopause Society as their trusted knowledge base. These statements are substantial clinical guidance documents (10-20 pages each) that detail current recommended clinical practices based on available evidence.
**Evaluation Framework**
Their evaluation methodology was particularly comprehensive and could serve as a model for other healthcare AI deployments:
* They created a test set of 40 open-ended questions using a hybrid approach:
* Initial question generation using GPT-4
* Refinement and validation by clinical experts
* Evaluation was conducted using multiple perspectives:
* Two human clinicians
* An AI judge (using Claude 2.1 from Anthropic)
* Automated faithfulness scoring using the "ragas" package
The evaluation covered four key criteria:
1. Faithfulness to source material (95% automated score)
2. Relevance to questions (strong alignment between AI and human judges)
3. Potential for harm (consistently rated as having minimal risk)
4. Clinical correctness (high scores from clinical evaluators)
**Safety and Quality Controls**
The implementation included several notable safety measures:
* Exclusive use of peer-reviewed position statements as source material
* Multiple layers of evaluation including both automated and human review
* Deliberate use of different LLM providers (OpenAI for generation, Anthropic for evaluation) to reduce systematic biases
* Careful attention to potential harmful responses or incorrect medical information
**Integration with Existing Systems**
The chatbot was designed to complement Vira Health's existing NLP capabilities, which included:
* Analytics for topic modeling of coach interactions
* Customer insight processing through semantic search
* Content recommendations within their app
**Challenges and Limitations**
The case study honestly addresses several challenges:
* The need for broader source material beyond clinical guidelines
* Some misalignment between AI and human evaluators on clinical correctness
* The challenge of prompt engineering for evaluation tasks
* The need for additional guardrails before clinical deployment
**Future Development**
The team identified several areas for future work:
* Exploration of newer LLM models
* Implementation of additional safety guardrails
* Integration of a wider range of trusted peer-reviewed sources
* Further validation before clinical deployment
This case study represents a thoughtful approach to implementing LLMs in a healthcare setting, with particular attention to safety, accuracy, and clinical relevance. The team's methodical approach to evaluation and their focus on using trusted sources demonstrates how LLMOps can be adapted for sensitive domains where accuracy and safety are paramount.
Start your new ML Project today with ZenML Pro
Join 1,000s of members already deploying models with ZenML.