Propel is developing a comprehensive evaluation framework for testing how well different LLMs handle SNAP (food stamps) benefit-related queries. The project aims to assess model accuracy, safety, and appropriateness in handling complex policy questions while balancing strict accuracy with practical user needs. They've built a testing infrastructure including a Slackbot called Hydra for comparing multiple LLM outputs, and plan to release their evaluation framework publicly to help improve AI models' performance on SNAP-related tasks.
This case study details Propel's systematic approach to developing and implementing an evaluation framework for Large Language Models (LLMs) in the specific domain of SNAP (Supplemental Nutrition Assistance Program) benefits administration and policy interpretation. The company's work represents an important example of how to thoughtfully deploy LLMs in production for high-stakes government services where accuracy and safety are paramount.
The core challenge Propel faces is ensuring that LLMs can reliably and safely handle SNAP-related queries, where incorrect information could have serious consequences for benefit recipients. Rather than simply deploying LLMs directly, they've taken a careful, methodical approach to evaluation and testing that offers valuable lessons for similar high-stakes domains.
Key aspects of their LLMOps approach include:
### Evaluation Framework Development
The team has created a structured evaluation framework that goes beyond simple accuracy metrics. Their approach involves:
* Creating automated test cases that can validate model responses against known correct answers
* Developing nuanced evaluation criteria that consider both technical accuracy and practical usefulness
* Building infrastructure to test multiple models simultaneously
* Planning to open-source their evaluation framework to benefit the broader community
### Testing Infrastructure
Propel has developed a custom testing infrastructure including:
* A Slackbot called Hydra that allows team members to easily compare responses from multiple frontier LLMs simultaneously
* Automated testing scripts that can validate model outputs against predefined criteria
* Systems for tracking model performance across different types of SNAP-related queries
### Domain Expert Integration
A key insight from their approach is the importance of domain expertise in developing effective LLM systems:
* They begin with extensive manual testing by SNAP policy experts
* Experts help identify subtle nuances that might be missed by purely technical evaluation
* Domain knowledge is used to develop more sophisticated evaluation criteria that consider practical implications
### Safety and Risk Management
The team has implemented several approaches to managing risk:
* Identifying high-risk query types where incorrect information could be particularly harmful
* Developing specific guardrails for these risky scenarios
* Creating "safe fallback" responses for cases where model confidence is low
### Model Selection and Routing
Their system includes sophisticated approaches to model selection:
* Testing different models for specific types of queries
* Considering cost and latency tradeoffs in model selection
* Implementing routing logic to direct different query types to appropriate models
### Context and Prompt Engineering
The team has explored various approaches to providing context to models:
* Testing different combinations of federal and state policy documents
* Experimenting with various prompt structures
* Evaluating the impact of different context lengths and types
### Continuous Improvement Process
Their approach includes mechanisms for ongoing improvement:
* Regular testing of new model versions against their evaluation framework
* Documentation of failure modes and edge cases
* Systematic collection of test cases and examples
### Practical Implementation Examples
The case study provides concrete examples of their evaluation approach, such as their "asset limits" test case, which demonstrates:
* How they handle complex policy questions with state-specific variations
* The balance between technical accuracy and practical usefulness
* Methods for evaluating model responses on nuanced policy topics
### Future Developments
The team is working on several advanced features:
* Using LLMs to evaluate other LLMs' outputs
* Developing more sophisticated automated evaluation techniques
* Creating public versions of their evaluation framework
### Lessons and Best Practices
Key takeaways from their experience include:
* The importance of domain expertise in developing effective LLM systems
* The value of systematic evaluation frameworks
* The need to balance multiple competing concerns in high-stakes applications
* The benefits of transparent, shared evaluation frameworks
This case study provides valuable insights for organizations looking to deploy LLMs in complex, high-stakes domains where accuracy and safety are crucial. Their methodical approach to evaluation and testing, combined with their focus on domain expertise and practical usefulness, offers a model for responsible LLM deployment in government services and other critical applications.
Start your new ML Project today with ZenML Pro
Join 1,000s of members already deploying models with ZenML.