Company
Convirza
Title
Multi-LoRA Serving for Agent Performance Analysis at Scale
Industry
Tech
Year
2024
Summary (short)
Convirza, facing challenges with their customer service agent evaluation system, transitioned from Longformer models to fine-tuned Llama-3-8b using Predibase's multi-LoRA serving infrastructure. This shift enabled them to process millions of call hours while reducing operational costs by 10x compared to OpenAI, achieving an 8% improvement in F1 scores, and increasing throughput by 80%. The solution allowed them to efficiently serve over 60 performance indicators across thousands of customer interactions daily while maintaining sub-second inference times.
This case study provides an in-depth look at how Convirza, an AI-powered software platform founded in 2001, transformed their agent performance evaluation system using modern LLMOps practices and infrastructure. The company processes millions of customer service calls monthly to generate performance scorecards and insights, making it a significant real-world application of LLMs in production. Initial Technical Implementation and Challenges: Initially, Convirza relied on Longformer models trained for 60 specific performance indicators. This setup presented several significant operational challenges: * Training iterations were extremely slow, taking 9-24+ hours per iteration * Infrastructure costs were unsustainable, ranging from $500-$1,500 per indicator monthly * The system required dedicated compute resources for each model * Total inference costs reached hundreds of thousands of dollars annually * Limited ability to deploy custom indicators and respond to client needs quickly These challenges led them to explore more efficient alternatives, specifically looking into fine-tuning Small Language Models (SLMs) to improve iteration speed and maintain cost-effectiveness. Technical Solution Implementation: Convirza adopted Predibase's platform, implementing a sophisticated multi-LoRA serving infrastructure with several key technical components: * Base Model: Meta's Llama-3-8b serves as the foundation * LoRA Adapters: Individual adapters for each of the 60 performance indicators * Infrastructure: Predibase's LoRA eXchange (LoRAX) for multi-LoRA serving * Autoscaling: GPU infrastructure capable of handling variable workloads with minimal cold-start times The technical architecture demonstrates several important LLMOps best practices: * Resource Sharing: Multiple LoRA adapters share a single base model, significantly reducing infrastructure requirements and costs * Efficient Scaling: Autoscaling GPU infrastructure manages load spikes effectively * Performance Optimization: Maintained sub-second mean inference time despite high-volume processing * Model Efficiency: Achieved higher accuracy while using smaller models through effective fine-tuning Technical Results and Performance Metrics: The implementation delivered significant measurable improvements: * Cost Efficiency: 10x reduction in operational costs compared to using OpenAI's models * Accuracy: 8% increase in F1 scores compared to OpenAI solutions * Processing Speed: 80% higher throughput compared to OpenAI and 4x higher than their previous Longformer implementation * Response Time: Maintained sub-second mean inference time even during peak loads * Scalability: Successfully handled workload spikes requiring up to double-digit A100 GPUs Infrastructure Management and Scaling: A particularly noteworthy aspect of this case study is the infrastructure management approach. The autoscaling system successfully handles variable workloads while maintaining consistent performance. This is especially crucial given their requirement to scale up to multiple A100 GPUs during peak loads while maintaining sub-two-second response times. The LoRA eXchange (LoRAX) infrastructure demonstrates an elegant solution to the common challenge of serving multiple fine-tuned models efficiently. Instead of deploying separate instances for each model variant, the system maintains a single base model with multiple LoRA adapters, significantly reducing resource requirements and operational complexity. Critical Analysis and Limitations: While the case study presents impressive results, it's important to note several considerations: * The comparison benchmarks are primarily against OpenAI's models, and less information is provided about comparisons with other potential solutions * The specific details of the LoRA fine-tuning process and hyperparameters are not disclosed * The case study doesn't detail the specific challenges or limitations encountered with the new system Future Implications: This implementation demonstrates several important trends in LLMOps: * The shift towards using smaller, fine-tuned models instead of larger API-based solutions * The growing importance of efficient multi-model serving infrastructures * The critical role of autoscaling in managing variable workloads * The potential for significant cost savings through optimized infrastructure Technical Sustainability: The solution appears to be technically sustainable, with several key factors contributing to its long-term viability: * Cost-effective scaling through shared base model architecture * Efficient resource utilization through multi-LoRA serving * Flexible adaptation to varying workloads through autoscaling * Maintainable infrastructure through managed services This case study represents a significant example of successful LLMOps implementation in a production environment, demonstrating how careful attention to infrastructure design and model serving strategies can lead to substantial improvements in both performance and cost-effectiveness.

Start your new ML Project today with ZenML Pro

Join 1,000s of members already deploying models with ZenML.