Klarity, a document processing automation company, transformed their approach to evaluating LLM systems in production as they moved from traditional ML to generative AI. The company processes over half a million documents for B2B SaaS customers, primarily handling complex financial and accounting workflows. Their journey highlights the challenges and solutions in developing robust evaluation frameworks for LLM-powered systems, particularly focusing on non-deterministic performance, rapid feature development, and the gap between benchmark performance and real-world results.
This case study presents a comprehensive look at how Klarity, a document processing automation company, approaches the challenges of evaluating and deploying LLMs in production, particularly for complex document processing tasks in financial and accounting workflows.
# Company Background and Evolution
Klarity, founded in 2016, focuses on automating back-office workflows that traditionally required large teams of offshore workers. The company went through multiple pivots before finding strong product-market fit in finance and accounting teams. They subsequently "replatformed" around generative AI, which led to significant growth and a $70 million Series B funding round. They now process more than half a million documents and maintain 15+ unique LLM use cases in production, often with multiple LLMs working together for single use cases.
# Technical Implementation and Challenges
The company's journey from traditional ML to generative AI highlighted several key challenges in LLM evaluation:
* Non-deterministic Performance: The same PDF processed multiple times could yield different responses, making consistent evaluation difficult.
* Complex User Experiences: New features like natural language analytics and business requirements document generation created novel evaluation challenges.
* Rapid Development Cycles: The ability to ship features within days (compared to previous 5-6 month cycles) made traditional evaluation approaches impractical.
* Benchmark Limitations: Standard benchmarks like MMLU didn't consistently correlate with real-world performance.
# Evaluation Strategy and Solutions
Klarity developed several innovative approaches to address these challenges:
## Staged Evaluation Approach
* Front-loading user testing and experience validation
* Treating each generative AI feature as requiring its own product-market fit
* Backloading comprehensive evaluation development
* Moving from user experience backwards to define evaluation metrics
## Customer-Specific Implementation
* Creating custom annotations for each customer's specific use case
* Developing use-case specific accuracy metrics
* Implementing comprehensive data drift monitoring
* Building their own synthetic data generation stack after finding existing solutions inadequate
## Architectural Decisions
* Reducing complexity by limiting degrees of freedom in their architecture
* Using dedicated LLMs for specific tasks rather than attempting to optimize across all possible combinations
* Maintaining "scrappy" evaluations for potential future use cases
# Production Implementation Details
The company employs several key strategies in their production environment:
* Automated Prompt Engineers (APE) to handle custom prompts across different features and customers
* Comprehensive monitoring systems for data drift and model performance
* A mix of customer-specific and general evaluation metrics
* Synthetic data generation and validation pipelines
# Key Learnings and Best Practices
Several important insights emerged from their experience:
* The importance of accepting imperfection in early evaluation stages
* The need to balance comprehensive evaluation with development speed
* The value of reducing complexity in architectural decisions
* The importance of maintaining forward-looking evaluation capabilities
# Challenges and Limitations
The case study reveals several ongoing challenges:
* The difficulty of fully automating evaluation in a quantitative way
* The balance between perfect optimization and practical implementation
* The challenge of maintaining evaluation quality while scaling rapidly
* The complexity of handling multiple LLMs and use cases simultaneously
# Future Directions
Klarity's approach to LLM evaluation continues to evolve, with focus areas including:
* Further development of synthetic data generation capabilities
* Expansion of evaluation frameworks to new use cases
* Continued refinement of their rapid development and evaluation pipeline
* Exploration of new ways to balance speed and quality in evaluation
This case study demonstrates the practical challenges and solutions in implementing LLM-based systems in production, particularly in highly regulated and complex domains like finance and accounting. It highlights the importance of pragmatic approaches to evaluation while maintaining high standards for production deployments.
Start your new ML Project today with ZenML Pro
Join 1,000s of members already deploying models with ZenML.