Faire, an e-commerce marketplace, tackled the challenge of evaluating search relevance at scale by transitioning from manual human labeling to automated LLM-based assessment. They first implemented a GPT-based solution and later improved it using fine-tuned Llama models. Their best performing model, Llama3-8b, achieved a 28% improvement in relevance prediction accuracy compared to their previous GPT model, while significantly reducing costs through self-hosted inference that can handle 70 million predictions per day using 16 GPUs.
# Fine-tuning and Scaling LLMs for Search Relevance at Faire
Faire, a global wholesale marketplace connecting brands and retailers, implemented a sophisticated LLM-based solution to automate and scale their search relevance evaluation system. This case study demonstrates a complete LLMOps journey from problem definition to production deployment, highlighting key technical decisions and operational considerations.
# Initial Approach and Evolution
## Manual Process to LLM Integration
- Started with human labeling through a data annotation vendor
- Developed decision trees to achieve >90% agreement among labelers
- Moved to a GPT-based solution to increase speed and reduce costs
- Finally evolved to using fine-tuned open-source Llama models for better performance and cost efficiency
## Problem Definition Framework
- Adopted ESCI (Exact, Substitute, Complement, Irrelevant) framework
- Developed clear guidelines for edge cases and ambiguous queries
- Created comprehensive labeling guidelines to ensure consistency
# Technical Implementation
## Model Selection and Fine-tuning
- Tested multiple Llama variants:
- Used Parameter Efficient Fine-Tuning with LoRA adapter
- Implemented various optimization techniques:
## Training Infrastructure
- Utilized 8 A100 GPUs for training
- Implemented DeepSpeed for optimization
- Tested different dataset sizes:
- Training time for largest model (Llama2-13b) was approximately 5 hours on the large dataset
## Production Deployment
### Inference Optimization
- Implemented model quantization to 8-bit
- Utilized batch processing on A100 GPUs
- Deployed DeepSpeed for improved inference speed
- Implemented horizontal scaling across GPU instances
### Performance Metrics
- Achieved 28% improvement in Krippendorff's Alpha compared to previous GPT model
- Llama3-8b showed best performance, matching Llama2-13b with better efficiency
- Successfully scaled to 70 million predictions per day using 16 GPUs
# Key Learnings and Best Practices
## Model Selection Insights
- Fine-tuned open-source LLMs demonstrated strong performance
- Larger datasets proved more important than model size
- Llama3-8b achieved optimal balance of performance and efficiency
- Basic prompt engineering alone insufficient for domain-specific tasks
## Operational Considerations
- Self-hosting reduced operational costs significantly
- Batch processing crucial for high-throughput requirements
- GPU optimization techniques essential for production deployment
- Clear problem definition and high-quality labeled data critical for success
## Data Management
- Dataset size and composition significantly impact model performance
- Incremental improvements in data quality led to better results
- Balanced dataset creation crucial for model reliability
# Production Results and Impact
## System Capabilities
- Processes tens of millions of query-product pairs daily
- Enables daily relevance measurements vs. previous monthly cadence
- Provides near real-time feedback on search algorithm performance
## Applications
- Offline retrieval analysis
- Personalization measurement
- Experimental contribution assessment
- Ranker optimization between engagement and relevance
# Future Developments
## Planned Improvements
- Exploring real-time inference implementation
- Investigating model distillation for lower latency
- Considering RAG techniques for improved domain context
- Evaluating multimodal LLMs like LLaVA for image processing
## Technical Roadmap
- Working on reducing inference costs
- Developing explanability features for relevance decisions
- Investigating chain of thought reasoning for performance improvement
This case study exemplifies a comprehensive LLMOps implementation, showing how careful consideration of technical choices, infrastructure setup, and operational requirements can lead to a successful production deployment of LLMs for business-critical applications.
Start your new ML Project today with ZenML Pro
Join 1,000s of members already deploying models with ZenML.