SumUp developed an LLM application to automate the generation of financial crime reports, along with a novel evaluation framework using LLMs as evaluators. The solution addresses the challenges of evaluating unstructured text output by implementing custom benchmark checks and scoring systems. The evaluation framework outperformed traditional NLP metrics and showed strong correlation with human reviewer assessments, while acknowledging and addressing potential LLM evaluator biases.
Sumup, a financial technology company offering payment services, faces regulatory compliance requirements including Anti-Money Laundering (AML) transaction monitoring. When compliance agents identify suspicious account activity, they must escalate and report it to authorities by writing detailed financial crime reports called “Reason for Suspicion” narratives. These reports outline observed behavior, provide supporting evidence, and draw from multiple sources including ML model predictions, investigation notes, and verification processes.
The company developed an LLM-powered application to automate the generation of these financial crime reports, significantly reducing the time agents spend on repetitive documentation tasks. However, a critical LLMOps challenge emerged: how do you effectively evaluate the quality of free-text narratives produced by an LLM in a production environment?
The case study focuses primarily on the evaluation methodology for LLM-generated text, which represents a fundamental challenge in deploying LLMs for production use cases. Unlike traditional ML models that produce numerical predictions with well-established mathematical evaluation methods, assessing narrative quality involves qualitative and subjective aspects.
The team identified several key challenges unique to LLM application evaluation:
The team initially tested traditional NLP metrics, specifically ROUGE scores (measuring n-gram overlap between generated and reference text). Their findings demonstrated that these metrics were inadequate for the use case. When comparing an accurately generated text against an inaccurately generated one, the ROUGE scores showed minimal differences:
For inaccurately generated text, they observed ROUGE-1, ROUGE-2, and ROUGE-L metrics with precision, recall, and F-scores. The accurately generated text achieved higher scores, but the differences between the two opposing examples were minimal. The team concluded that traditional metrics could not effectively differentiate between good and bad outputs, especially for texts with subtle differences. These metrics also failed to check text structure or compare the presence or absence of specific information—critical requirements for financial crime reports.
The fundamental issue is that metrics like ROUGE concentrate on semantic similarity or n-gram patterns without considering broader context: Does the report provide supporting evidence? Is the structure adequate? Does it cover relevant topics? These are the questions that matter in production but that traditional metrics cannot answer.
Rather than relying on human review (which is time-consuming and impractical at scale), the team developed an LLM-driven evaluation approach. This methodology uses an LLM to evaluate another LLM’s output, a technique often called “LLM-as-a-judge.”
The evaluator was designed with custom benchmark checks embedded in a prompt, each focusing on specific quality criteria for financial crime reports:
The evaluator was instructed to produce a numerical score between 0-5 for each criterion, where 5 represents excellent coverage and 0 signifies very poor coverage. Each score is accompanied by an explanation, providing both quantitative metrics for comparison and qualitative insights for debugging.
The LLM-driven evaluator demonstrated clear differentiation between high-quality and low-quality generated text. An accurately generated narrative received an average general score of 4.67, with scores of 4-5 across all criteria. The evaluator correctly identified that topics matched the reference, customer data was present, suspicious indicators were consistent, and the conclusion was clear.
In contrast, an inaccurately generated text (which included invented information about police reports and illicit activities not present in the reference) received an average score of 2.5. The evaluator correctly flagged the hallucinated facts, giving a score of 1 for “check_facts_are_not_invented” and explaining: “The generated text mentions police reports and illicit activities, which are not present in the reference text. This is a clear sign of invented facts.”
Crucially, the team validated their automated evaluation against real agent feedback. They ran an initial iteration where agents manually reviewed and scored each LLM-generated narrative. The results showed that the automated text generation evaluation was often closely related to the comments and scores provided by human agents. Furthermore, the improvement areas identified by humans closely matched those highlighted by the automated evaluation.
The case study demonstrates awareness of the limitations and biases inherent in using LLMs as evaluators, referencing the academic study “Judging LLM-as-a-judge with MT-Bench and Chatbot Arena” by Lianmin Zheng. Three key biases are identified:
To mitigate these biases, the team implemented specific strategies:
The team also noted that these metrics are application-specific and cannot be used to compare across different projects. A score of 3 in one project does not equate to the same score in another unrelated project, making standardization across the organization challenging.
The solution maintains a human-in-the-loop approach for the actual report generation. The case study emphasizes that “prior human confirmation and investigation are paramount, as the consequences of automatically raising an alert without verification have an extremely high impact.” The LLM assists agents in documentation after they have confirmed suspicious activity, rather than automatically generating alerts.
The automated evaluation method serves a specific role in the MLOps workflow: it empowers data scientists to test model improvements without adding extra workload to compliance agents. Agents can concentrate on reviewing final results rather than participating in every evaluation iteration during development.
The evaluation system outputs structured JSON responses containing individual criterion scores, explanations for each score, and an aggregated general score with overall explanation. This format enables automated processing and monitoring of evaluation results at scale.
The team used synthetic but realistic examples for demonstration, such as a merchant named “Donald Duck” operating a beauty/barber business with suspicious transaction patterns including failed transactions, unusual processing hours, and excessive chargebacks. This approach protected real customer data while validating the evaluation methodology.
While the case study presents a practical approach to LLM evaluation in production, some considerations should be noted. The reliance on LLM-as-a-judge creates a dependency on the evaluator LLM’s capabilities and potential biases. The team acknowledges this but their mitigation strategies (position swapping, few-shot prompting) may not fully eliminate all biases. Additionally, the correlation between automated and human evaluation was described qualitatively (“closely related,” “usually matched”) rather than with precise statistical measures, which would strengthen the validation claims.
The approach of using application-specific benchmark checks is sensible for specialized domains like financial crime reporting, where generic LLM benchmarks would indeed be insufficient. However, organizations adopting this approach should invest in defining robust criteria specific to their use case and validating against sufficient human evaluation samples.
Nubank, one of Brazil's largest banks serving 120 million users, implemented large-scale LLM systems to create an AI private banker for their customers. They deployed two main applications: a customer service chatbot handling 8.5 million monthly contacts with 60% first-contact resolution through LLMs, and an agentic money transfer system that reduced transaction time from 70 seconds across nine screens to under 30 seconds with over 90% accuracy and less than 0.5% error rate. The implementation leveraged LangChain, LangGraph, and LangSmith for development and evaluation, with a comprehensive four-layer ecosystem including core engines, testing tools, and developer experience platforms. Their evaluation strategy combined offline and online testing with LLM-as-a-judge systems that achieved 79% F1 score compared to 80% human accuracy through iterative prompt engineering and fine-tuning.
Digits, a company providing automated accounting services for startups and small businesses, implemented production-scale LLM agents to handle complex workflows including vendor hydration, client onboarding, and natural language queries about financial books. The company evolved from a simple 200-line agent implementation to a sophisticated production system incorporating LLM proxies, memory services, guardrails, observability tooling (Phoenix from Arize), and API-based tool integration using Kotlin and Golang backends. Their agents achieve a 96% acceptance rate on classification tasks with only 3% requiring human review, handling approximately 90% of requests asynchronously and 10% synchronously through a chat interface.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.