Convirza transformed their call center analytics platform from using traditional large language models to implementing small language models (specifically Llama 3B) with adapter-based fine-tuning. By partnering with Predibase, they achieved a 10x cost reduction compared to OpenAI while improving accuracy by 8% and throughput by 80%. The system analyzes millions of calls monthly, extracting hundreds of custom indicators for agent performance and caller behavior, with sub-0.1 second inference times using efficient multi-adapter serving on single GPUs.
Convirza represents an interesting evolution in the application of AI to call center analytics, demonstrating how LLMOps practices have matured from simple implementations to sophisticated, cost-effective production systems. The company has been in business since 2001, starting with analog recording devices and human analysis, but has transformed into an AI-driven enterprise that processes millions of calls monthly.
The company's LLM journey is particularly noteworthy, as it reflects the rapid evolution of language model technology and deployment strategies. Their AI stack transformation can be broken down into several key phases:
**Initial ML Implementation (2014-2019)**
* Traditional AWS-based infrastructure using Sagemaker
* Deployment of over 60 different models for data extraction and classification
* Introduction of BERT in 2019 as their first language model implementation
**Evolution to Larger Models (2021)**
* Transition to Longformer for improved context handling
* Challenges with training times (hours to days for model training)
* Complex infrastructure management with individual auto-scaling deployments for each model
**Current Architecture and Innovation (2024)**
The most interesting aspect of Convirza's current implementation is their innovative approach to efficient LLM deployment:
* Adoption of Llama 3.18B (3 billion parameters) as their base model
* Implementation of LoRA adapters for efficient fine-tuning
* Partnership with Predibase for infrastructure management
* Successful deployment of 60+ adapters on a single GPU
* Achievement of 0.1-second inference times, significantly beating their 2-second target
**Technical Implementation Details**
The system architecture demonstrates several sophisticated LLMOps practices:
*Training Pipeline:*
* Streamlined data preparation process with versioning
* Fine-tuning jobs scheduled through Predibase
* Careful hyperparameter optimization for LoRA (rank factor, learning rate, target module)
* Evaluation pipeline using unseen datasets for quality assurance
*Deployment Strategy:*
* Configuration-based deployment system
* Support for A/B testing and canary releases
* Ability to run multiple model versions simultaneously without additional cost
* Hybrid setup with some GPU instances in their VPC and additional scale provided by Predibase
*Monitoring and Observability:*
* Comprehensive monitoring of throughput and latency
* Data drift detection systems
* Integration between Predibase dashboards and internal monitoring
*Performance Metrics:*
* 10x cost reduction compared to OpenAI
* 8% improvement in F1 score accuracy
* 80% higher throughput
* Sub-0.1 second inference times
* Ability to handle hundreds of inferences per second
* Rapid scaling capability (under one minute for new nodes)
**Business Impact and Use Cases**
The system analyzes calls for multiple aspects:
* Agent Performance Metrics:
* Proper greeting procedures
* Business offering techniques
* Appointment scheduling effectiveness
* Customer engagement quality
* Custom metrics (like their "give a crap" indicator)
* Caller Analysis:
* Buying signals
* Lead quality scoring
* Prospect classification
* Customer service urgency detection
A notable case study with Wheeler Caterpillar demonstrated a 78% conversion increase within 90 days of implementation.
**Challenges and Solutions**
The team faced several significant challenges in their LLMOps implementation:
*Scale and Cost Management:*
* Challenge: Handling unpredictable call volumes and growing number of indicators
* Solution: Implementation of efficient adapter-based architecture with dynamic scaling
*Accuracy and Consistency:*
* Challenge: Maintaining high accuracy across hundreds of different indicators
* Solution: Use of smaller, more focused models with high-quality, curated training data
*Infrastructure Complexity:*
* Challenge: Managing multiple independent model deployments
* Solution: Consolidation onto single-GPU multi-adapter serving architecture
**Future Directions and Lessons Learned**
The case study demonstrates several important lessons for LLMOps implementations:
* Smaller, well-tuned models can outperform larger models in specific domains
* Adapter-based architectures can significantly reduce operational costs
* The importance of balancing model complexity with practical deployment considerations
* Value of partnerships for infrastructure management
This implementation showcases how careful consideration of model architecture, deployment strategy, and infrastructure management can create a highly efficient, scalable LLM system in production. The success of using smaller models with adapter-based fine-tuning challenges the common assumption that bigger models are always better, particularly in specialized domains with specific requirements.
Start your new ML Project today with ZenML Pro
Join 1,000s of members already deploying models with ZenML.