Fintool, an AI equity research assistant, faced the challenge of processing massive amounts of financial data (1.5 billion tokens across 70 million document chunks) while maintaining high accuracy and trust for institutional investors. They implemented a comprehensive LLMOps evaluation workflow using Braintrust, combining automated LLM-based evaluation, golden datasets, format validation, and human-in-the-loop oversight to ensure reliable and accurate financial insights at scale.
Fintool represents an interesting case study in implementing LLMOps practices in the highly regulated and trust-sensitive financial sector. The company developed an AI equity research assistant that processes and analyzes vast amounts of unstructured financial data, including SEC filings and earnings call transcripts, to provide insights to institutional investors through their Fintool Feed product.
The core challenge they faced was maintaining high quality and trustworthiness while processing enormous volumes of data - specifically 1.5 billion tokens across 70 million document chunks with daily updates in the gigabyte range. This scale, combined with the diverse nature of user prompts (ranging from broad compliance monitoring to specific disclosure tracking) and the absolute requirement for accuracy in financial contexts, necessitated a robust LLMOps approach.
Their LLMOps implementation centered around a continuous evaluation workflow with several key components:
Quality Standards and Format Validation:
The foundation of their approach is a strict set of quality standards and format rules. Every generated insight must include verifiable source information, such as SEC document IDs. They implemented automated validation systems that check not just the presence of sources but also their format and direct connection to the insights provided. This is particularly crucial in financial services where traceability is paramount. The use of span iframes for citations within trace spans shows attention to making the validation process efficient for human reviewers.
Golden Dataset Management:
A notable aspect of their LLMOps practice is the sophisticated approach to golden datasets. Rather than using static test sets, they maintain dynamic golden datasets that combine:
* Production logs reflecting real-world usage patterns
* Carefully selected examples representing specific industry scenarios
* Regular updates to match their massive daily data processing requirements
This approach to maintaining evaluation datasets helps ensure their testing remains relevant as their system scales and evolves.
Automated Evaluation System:
Their implementation of LLM-as-a-judge evaluation is particularly interesting. They created automated scoring functions for key metrics including accuracy, relevance, and completeness. The example provided in the case study shows their approach to format validation:
* Custom prompts designed to validate specific aspects of the output
* Binary classification (PASS/FAIL) for clear decision making
* Integration with their broader evaluation pipeline
The automation of these evaluations helps maintain consistent quality while scaling to millions of insights.
Human-in-the-Loop Integration:
Their human oversight system demonstrates a well-thought-out approach to combining automated and human evaluation:
* Automated scoring triggers human review for low-scoring content
* Direct integration between their database and the evaluation UI
* Ability for experts to make immediate corrections and updates
* User feedback (downvotes) incorporated into the review triggering system
This integration of human oversight helps catch edge cases and maintain quality while keeping the process efficient and scalable.
Technical Infrastructure:
While the case study doesn't detail their entire technical stack, several key elements are apparent:
* Large-scale document processing pipeline handling billions of tokens
* Real-time monitoring and evaluation systems
* Integration between production systems and evaluation tools
* Custom validation rules and scoring functions
* Direct database integration for content updates
Production Results and Observations:
The implementation of this LLMOps workflow has yielded significant benefits:
* Successfully scaled to processing millions of datapoints daily while maintaining quality
* Improved efficiency in quality control through automation
* Enhanced accuracy through rigorous validation rules
* Streamlined human review process
Critical Analysis:
While the case study presents impressive results, there are some aspects that deserve careful consideration:
* The reliance on LLM-as-judge evaluation could potentially propagate model biases
* The case study doesn't detail their approach to prompt engineering or model selection
* Limited information about how they handle edge cases or conflicting information in financial documents
* No mention of their approach to model versioning or deployment
Despite these limitations, Fintool's implementation represents a solid example of LLMOps in a high-stakes domain, demonstrating how to combine automated evaluation, human oversight, and structured validation to maintain quality at scale. Their approach to continuous evaluation and quality control provides valuable insights for other organizations implementing LLMs in regulated industries.
Start your new ML Project today with ZenML Pro
Join 1,000s of members already deploying models with ZenML.