Checkr tackled the challenge of classifying complex background check records by implementing a fine-tuned small language model (SLM) solution. They moved from using GPT-4 to fine-tuning Llama-2 models on Predibase, achieving 90% accuracy for their most challenging cases while reducing costs by 5x and improving response times to 0.15 seconds. This solution helped automate their background check adjudication process, particularly for the 2% of complex cases that required classification into 230 distinct categories.
Checkr, a background check technology company serving over 100,000 customers, presents a compelling case study in scaling LLM operations for production use in a critical business function. Their journey from traditional machine learning to advanced LLM implementations offers valuable insights into the practical challenges and solutions in LLMOps.
The company processes millions of background checks monthly, with 98% handled efficiently by traditional logistic regression models. However, the remaining 2% presented complex cases requiring classification into 230 distinct categories, which became the focus of their LLM implementation.
Their LLMOps journey can be broken down into several key phases and learnings:
Initial Implementation and Challenges:
The company started with a Deep Neural Network (DNN) approach that could only accurately classify 1% of the complex cases. This led them to explore LLM-based solutions. Their requirements were particularly demanding: they needed high accuracy for critical hiring decisions, low latency for real-time results, and cost-effective processing for millions of monthly tokens.
Experimental Phase:
Their systematic approach to finding the right solution involved multiple experiments:
* First attempt with GPT-4 as an "Expert LLM" achieved 87-88% accuracy on simpler cases but only 80-82% on complex ones
* RAG implementation with GPT-4 improved accuracy to 96% on simple cases but performed worse on complex ones
* Fine-tuning Llama-2-7b yielded 97% accuracy on simple cases and 85% on complex ones
* A hybrid approach combining fine-tuned and expert models showed no additional improvements
Production Implementation:
After extensive testing, they settled on Predibase as their production platform, implementing a fine-tuned llama-3-8b-instruct model. This solution achieved:
* 90% accuracy on their most challenging cases
* 0.15-second response times (30x faster than GPT-4)
* 5x cost reduction compared to GPT-4
* Efficient multi-LoRA serving capabilities for scaling to additional use cases
Technical Insights and Best Practices:
The team documented several valuable technical insights for LLMOps:
Model Training and Optimization:
* They found that monitoring model convergence was crucial, sometimes requiring the removal of auto-stopping parameters to avoid local minima
* Fine-tuned models showed less sensitivity to hyperparameters than expected
* They successfully implemented Parameter Efficient Fine-Tuning (PEFT) using LoRA, achieving comparable results to full fine-tuning at lower costs
Inference Optimization:
* Short prompts were found to be as effective as longer ones, enabling cost savings through reduced token usage
* They developed techniques for identifying less confident predictions by manipulating temperature and top_k parameters
* The team implemented efficient confidence scoring methods to identify predictions requiring human review
Production Environment Considerations:
* They leveraged LoRAX for serving multiple LoRA adapters without additional GPU requirements
* Implemented comprehensive production metrics dashboards for monitoring performance
* Developed strategies for handling their large dataset of 150,000 training examples efficiently
Infrastructure and Tooling:
The production setup includes:
* Predibase's SDK for programmatic control
* Web UI for project management and version control
* Performance visualization tools
* Production metrics dashboards for monitoring system efficiency
Key Learnings and Best Practices:
Their experience yielded several valuable insights for LLMOps practitioners:
* The importance of systematic experimentation with different model architectures and approaches
* The value of efficient fine-tuning techniques like PEFT/LoRA
* The critical role of monitoring and metrics in production deployment
* The significance of balancing accuracy, latency, and cost in production systems
The case study demonstrates the practical realities of implementing LLMs in production, showing how careful experimentation, systematic optimization, and appropriate tooling choices can lead to successful outcomes. It particularly highlights the potential of fine-tuned smaller models to outperform larger models in specific use cases, while offering better economics and performance characteristics.
Start your new ML Project today with ZenML Pro
Join 1,000s of members already deploying models with ZenML.