This case study presents a detailed examination of how Vendr, in partnership with Extend, implemented a production-scale document processing system using LLMs to extract structured data from SaaS order forms and contracts. The presentation features speakers from both companies sharing their experiences and technical approaches.
## Overall System Context and Goals
Vendr's primary challenge was to extract structured data from a large volume of unstructured documents, particularly SaaS order forms and contracts. These documents contained valuable information about pricing, terms, and purchased items that could provide insights to their customers. Rather than building this capability in-house, they partnered with Extend, which provided a platform that allowed Vendr's data science team to focus on prompt engineering, data correction, and downstream analysis.
## Technical Architecture and Implementation
Extend's platform approach to document processing emphasizes several key architectural decisions and best practices:
### Pre-processing Pipeline
The system implements a robust pre-processing pipeline that handles various document formats and quality issues. Instead of relying solely on LLMs, they use traditional OCR techniques for initial document preparation, including:
* Document rotation and skew correction
* Noise removal
* Document cropping
* Handling multi-page documents and complex tables
* Managing different image formats and resolutions
### Model Selection and Optimization
The platform takes a pragmatic approach to model selection:
* Default to OpenAI or Anthropic models as a starting point
* Evaluate different models based on specific requirements (accuracy, latency, cost)
* Use evaluation sets to measure performance and detect regressions
* Combine multimodal and text-only models where appropriate
* Careful consideration of when to use fine-tuning versus prompt engineering
### Quality Control and Error Handling
The system implements multiple layers of quality control:
* Model confidence estimation using token log probabilities
* Teacher-student model arrangements for cross-validation
* Data validation rules (e.g., checking if line items sum to totals)
* Human-in-the-loop review system for uncertain cases
* Tracking of corrections and reviews for continuous improvement
## Entity Recognition and Mapping
A significant portion of the case study focuses on entity recognition and mapping, which proved to be one of the most challenging aspects. The system needs to identify and correctly map:
* Customers (purchasing entities)
* Sellers (direct software companies or resellers)
* Suppliers (software brands/products)
The entity mapping process involves:
* Initial LLM extraction of entity names
* Mapping extracted names to a canonical catalog
* Handling variations in entity names and corporate structure changes
* Managing edge cases and ambiguous situations
## Innovation in Review Processes
One of the most interesting aspects of the implementation is how they approached document review efficiency:
### Document Similarity Analysis
* Used OpenAI embeddings to calculate document similarity
* Implemented different embedding lengths for different purposes (e.g., 450 words for quick comparisons)
* Used similarity clustering to identify potential misclassifications
* Detected format changes and document type variations
### Affirmation System
* Implemented a system to track human-verified data
* Used visual indicators (green checkmarks) for verified information
* Protected verified data from being overwritten by subsequent LLM processing
## Continuous Improvement and Maintenance
The system includes several mechanisms for ongoing improvement:
* Monitoring of production data patterns
* Regular updates to evaluation sets with new edge cases
* Iterative prompt optimization before considering fine-tuning
* Clear rubrics for handling ambiguous cases and changes
## Results and Impact
The implementation has successfully processed over 100,000 documents, achieving significantly better results than traditional human coding approaches. The hybrid approach of combining LLMs with targeted human review has proven both efficient and accurate, though some challenges remain, particularly around handling corporate actions and name changes over time.
## Lessons Learned
Key takeaways from the case study include:
* The importance of not treating LLMs as a silver bullet
* The value of combining traditional OCR with modern LLM approaches
* The critical role of human review in maintaining data quality
* The effectiveness of using document embeddings for quality control
* The importance of clear rubrics and guidelines for consistent data extraction
This case study provides valuable insights into implementing LLMs in production for document processing, highlighting both the potential and limitations of current technology while demonstrating practical approaches to building robust, production-grade systems.