Anterior, a healthcare AI company, developed a scalable evaluation system for their LLM-powered prior authorization decision support tool. They faced the challenge of maintaining accuracy while processing over 100,000 medical decisions daily, where errors could have serious consequences. Their solution combines real-time reference-free evaluation using LLMs as judges with targeted human expert review, achieving an F1 score of 96% while keeping their clinical review team under 10 people, compared to competitors who employ hundreds of nurses.
Anterior's case study presents a compelling example of implementing LLMs in a highly regulated and critical healthcare environment, specifically for prior authorization decisions. This case study is particularly valuable as it demonstrates how to scale LLM operations while maintaining high accuracy in a domain where errors can have serious consequences.
The company's core challenge revolves around automating prior authorization decisions in healthcare, where their system needs to analyze medical records and guidelines to determine whether treatment requests should be approved or reviewed by clinicians. What makes this case especially interesting from an LLMOps perspective is their innovative approach to evaluation and quality assurance at scale.
The journey begins with the typical MVP challenges that many organizations face when deploying LLMs in production. While creating an initial product with LLMs is relatively straightforward, scaling to handle hundreds of thousands of daily requests while maintaining accuracy presents significant challenges. The case study highlights a specific example where their system needed to interpret medical nuances correctly, such as distinguishing between "suspicious for MS" versus confirmed MS diagnosis - the kind of edge cases that become increasingly common as request volume grows.
Their LLMOps journey can be broken down into several key components:
Initial Evaluation Approach:
* They built an internal review dashboard called "scalp" for human review of AI outputs
* The dashboard was designed for efficiency, presenting all necessary context without scrolling
* Human reviewers could quickly add critiques and label incorrect responses
* These reviews generated ground truths for offline evaluations
Scaling Challenges:
The case study clearly illustrates the scaling problems with pure human review:
* At 1,000 decisions per day, reviewing 50% of cases required 5 clinicians
* At 10,000 decisions, maintaining the same review percentage would require 50 clinicians
* At 100,000 decisions, even reviewing 5% of cases would require an unsustainable number of reviewers
Innovation in Evaluation:
Their solution to the scaling problem involved developing a multi-layered evaluation system:
1. Real-time Reference-free Evaluation:
* Implemented LLMs as judges to evaluate outputs before knowing the true outcome
* Created confidence scoring systems for outputs
* Used both LLM-based and logic-based methods for confidence estimation
* Enabled immediate response to potential issues
2. Dynamic Prioritization System:
* Combined confidence grades with contextual factors (procedure cost, bias risk, error rates)
* Prioritized cases with the highest probability of error for human review
* Created a virtuous cycle where human reviews continuously improve the system
3. Integrated Pipeline:
* Built a system that could route cases based on confidence levels
* Low-confidence cases could be:
* Sent to more expensive, powerful models
* Routed to on-call clinicians
* Surfaced in customer review dashboards
The results of their LLMOps implementation are impressive:
* Achieved an F1 score of nearly 96% in prior authorization decisions
* Maintained high accuracy with less than 10 clinical experts (compared to competitors using 800+ nurses)
* Strong alignment between AI and human reviews
* Quick identification and response to errors while meeting SLA requirements
Key LLMOps Principles Emerged:
* Build systems thinking big - don't just audit performance, build systems to improve the auditing process
* Focus on live production data evaluation rather than relying solely on offline evaluations
* Prioritize review quality over quantity
* Invest in custom tooling to improve efficiency
The case study also highlights important considerations for LLMOps in regulated industries:
* The importance of maintaining audit trails
* The need for explainable decision-making
* The balance between automation and human oversight
* The critical nature of error detection and correction
From an architectural perspective, their system demonstrates sophisticated use of LLMs in production:
* Multiple model pipelines with different cost/accuracy tradeoffs
* Real-time evaluation and routing systems
* Integration of human review workflows
* Custom tooling for efficiency
What makes this case study particularly valuable is how it addresses the common challenge of scaling LLM operations while maintaining quality. Their solution of combining automated evaluation with targeted human review provides a template that could be adapted for other domains where high accuracy is crucial.
The case study also demonstrates the importance of domain expertise in LLMOps - their understanding of medical nuances informed both their evaluation criteria and their system design. This highlights how successful LLMOps often requires deep integration of domain knowledge with technical capabilities.
Start your new ML Project today with ZenML Pro
Join 1,000s of members already deploying models with ZenML.