Propel developed a sophisticated evaluation framework for testing and benchmarking LLM performance in handling SNAP (food stamp) benefit inquiries. The company created two distinct evaluation approaches: one for benchmarking current base models on SNAP topics, and another for product development. They implemented automated testing using Promptfoo and developed innovative ways to evaluate model responses, including using AI models as judges for assessing response quality and accessibility.
Propel, a company working in the government benefits space, has developed a comprehensive approach to evaluating and implementing LLMs for handling SNAP (Supplemental Nutrition Assistance Program) benefit inquiries. This case study demonstrates a methodical approach to building LLM evaluation frameworks for production use, with particular attention to domain-specific requirements and automated testing processes.
The company's approach is particularly noteworthy because it addresses a critical public service domain where accuracy and accessibility are paramount. They developed a dual-purpose evaluation framework that serves both as a benchmarking tool for assessing general-purpose LLMs and as a product development guide for their specific use case.
Key Technical Components and Methodology:
Their evaluation framework is built on several key pillars:
* Initial Expert Assessment: The process begins with domain experts extensively testing models to understand the characteristics of good and bad outputs. This human-in-the-loop approach helps establish baseline expectations and identify critical areas for automated testing.
* Dual Evaluation Goals:
* Benchmarking Framework: Focused on objective, factual assessment of how well current base models handle SNAP-related queries. This framework emphasizes consensus-based evaluation and is more tolerant of model refusals when responses could have adverse consequences.
* Product Development Framework: More opinionated and specific to their product goals, with stricter requirements for model responses and less tolerance for refusal to answer, given their specific use case of assisting SNAP clients.
* Factual Knowledge Testing: They implemented systematic testing of factual knowledge as a foundation, recognizing that accurate basic facts are prerequisites for more complex tasks. This includes testing knowledge of:
* Benefit amounts
* Eligibility criteria
* Program rules and requirements
* Income limits
* Current program parameters
Technical Implementation Details:
The automation of their evaluation framework is particularly innovative, utilizing several key technologies and approaches:
* Promptfoo Integration: They chose Promptfoo as their primary evaluation tool, specifically because it allows non-technical domain experts to participate in the evaluation process through familiar interfaces like Google Sheets.
* Automated Testing Implementation:
* Simple criteria testing (exact match, keyword presence)
* YES/NO response validation
* Multiple choice answer validation
* Complex response evaluation using AI judges
* AI-Based Response Evaluation: They implemented a novel approach using AI models to evaluate other AI models' responses, particularly for subjective criteria like language accessibility. This meta-evaluation approach allows them to automate the assessment of complex quality criteria that would traditionally require human review.
Challenges and Solutions:
The case study reveals several interesting challenges in implementing LLMs in production for government services:
* Knowledge Cutoff Issues: They discovered that many models provided outdated information due to training cutoff dates, particularly for time-sensitive information like benefit amounts and income limits.
* Solution Implementation: They addressed this through:
* Retrieval Augmented Generation (RAG) to incorporate current information
* External knowledge integration systems
* Large context window models to include up-to-date information
* Accessibility Requirements: They implemented specific evaluation criteria for ensuring responses meet plain language requirements, using AI models to assess reading level and accessibility.
Production Considerations:
The case study provides valuable insights into production-level LLM implementation:
* Automated Evaluation Pipeline: Their system allows for continuous testing across multiple models (including GPT-4, Claude 3.5 Sonnet, and Gemini 2.0 Flash Experimental)
* Performance Tracking: They maintain comparative performance metrics across different models (e.g., 73% success rate for Claude vs. 45% for other models in their test cases)
* Scalable Testing: Their approach allows for rapid expansion of test cases and evaluation criteria
Future Developments:
The case study indicates ongoing work in several areas:
* Expanding evaluation criteria for SNAP-specific use cases
* Developing more sophisticated automated evaluation methods
* Improving the integration of domain expertise into the evaluation process
This case study represents a sophisticated approach to implementing LLMs in a critical public service domain, with particular attention to automation, accuracy, and accessibility. The dual-purpose evaluation framework and innovative use of AI for meta-evaluation provide valuable insights for organizations implementing LLMs in regulated or high-stakes environments.
Start your new ML Project today with ZenML Pro
Join 1,000s of members already deploying models with ZenML.