SumUp's LLM Evaluation Framework for Financial Crime Reports
Company and Use Case Overview
SumUp, a financial institution, implemented an LLM-driven solution to automate the generation of financial crime reports (Reasons for Suspicion) used in Anti-Money Laundering (AML) processes. These reports are critical documents that outline suspicious activities and provide supporting evidence when escalating cases to authorities. The implementation required robust evaluation methods to ensure the quality and reliability of the generated narratives.
Technical Implementation Details
Challenges in LLM Output Evaluation
The team identified several key challenges in evaluating LLM-generated text:
- Diverse outputs requiring stability across different runs
- Subjective nature of language evaluation
- Difficulty in defining objective metrics for "good" responses
- Lack of appropriate benchmarks for LLM applications
- Limitations of traditional NLP metrics
Traditional NLP Metrics Assessment
- Implemented Rouge Score metrics (Rouge-1, Rouge-2, Rouge-L)
- Found traditional metrics inadequate:
Novel LLM-Driven Evaluation Framework
Developed a comprehensive evaluation system using LLMs as evaluators with:
- Custom Benchmark Checks:
- Scoring System:
Evaluation Results Structure
The framework generates detailed evaluation results including:
- Individual scores for each benchmark check
- Explanations for each score
- Overall general score
- Comprehensive explanation of the assessment
Production Implementation
Bias Mitigation Strategies
Implemented several approaches to address known LLM evaluator biases:
- Position swapping to counter position bias
- Few-shot prompting for calibration
- Awareness of verbose and self-enhancement biases
Integration with Human Reviewers
- Initial validation through parallel human review
- Strong correlation between automated and manual assessments
- Enabled data scientists to test improvements without increasing agent workload
Testing and Validation
Human Validation Process
- Conducted initial testing with manual agent reviews
- Collected numeric scores and comments from human reviewers
- Compared automated evaluations with human assessments
- Found strong alignment between automated and human evaluations
Quality Assurance Measures
- Implementation of position swapping for bias reduction
- Calibration through few-shot prompting
- Regular validation against human reviewer feedback
- Continuous monitoring of evaluation consistency
Technical Limitations and Considerations
Framework Limitations
- Application-specific metrics
- Non-standardized scores across different projects
- Potential for LLM-specific biases
Bias Types Identified
- Position bias in result comparison
- Verbose bias favoring longer responses
- Self-enhancement bias toward LLM-generated content
Production Benefits and Outcomes
Efficiency Improvements
- Automated evaluation of large volumes of narratives
- Reduced manual review requirements
- Faster iteration cycles for model improvements
Quality Assurance
- Consistent evaluation criteria
- Detailed feedback for improvements
- Balance between automation and human oversight
Future Development Considerations
Potential Improvements
- Further standardization of metrics
- Enhanced bias mitigation strategies
- Expanded benchmark checks
- Improved cross-project compatibility
Scaling Considerations
- Need for continuous calibration
- Balance between automation and human oversight
- Maintenance of evaluation quality at scale
Best Practices and Recommendations
Implementation Guidelines
- Define clear evaluation criteria
- Implement comprehensive benchmark checks
- Use few-shot prompting for calibration
- Maintain human oversight
- Regular validation of evaluation results
Risk Mitigation
- Regular bias assessment and correction
- Continuous monitoring of evaluation quality
- Maintenance of human review processes
- Clear documentation of limitations