Faire: Fine-tuning and Scaling LLMs for Search Relevance Prediction

LLMOps Database

E-commerce

Faire

Company

Faire

Title

Fine-tuning and Scaling LLMs for Search Relevance Prediction

Industry

E-commerce

Link

https://craft.faire.com/fine-tuning-llama3-to-measure-semantic-relevance-in-search-86a7b13c24ea

Year

2024

Summary (short)

Faire, an e-commerce marketplace, tackled the challenge of evaluating search relevance at scale by transitioning from manual human labeling to automated LLM-based assessment. They first implemented a GPT-based solution and later improved it using fine-tuned Llama models. Their best performing model, Llama3-8b, achieved a 28% improvement in relevance prediction accuracy compared to their previous GPT model, while significantly reducing costs through self-hosted inference that can handle 70 million predictions per day using 16 GPUs.

# Fine-tuning and Scaling LLMs for Search Relevance at Faire Faire, a global wholesale marketplace connecting brands and retailers, implemented a sophisticated LLM-based solution to automate and scale their search relevance evaluation system. This case study demonstrates a complete LLMOps journey from problem definition to production deployment, highlighting key technical decisions and operational considerations. # Initial Approach and Evolution ## Manual Process to LLM Integration - Started with human labeling through a data annotation vendor - Developed decision trees to achieve >90% agreement among labelers - Moved to a GPT-based solution to increase speed and reduce costs - Finally evolved to using fine-tuned open-source Llama models for better performance and cost efficiency ## Problem Definition Framework - Adopted ESCI (Exact, Substitute, Complement, Irrelevant) framework - Developed clear guidelines for edge cases and ambiguous queries - Created comprehensive labeling guidelines to ensure consistency # Technical Implementation ## Model Selection and Fine-tuning - Tested multiple Llama variants: - Used Parameter Efficient Fine-Tuning with LoRA adapter - Implemented various optimization techniques: ## Training Infrastructure - Utilized 8 A100 GPUs for training - Implemented DeepSpeed for optimization - Tested different dataset sizes: - Training time for largest model (Llama2-13b) was approximately 5 hours on the large dataset ## Production Deployment ### Inference Optimization - Implemented model quantization to 8-bit - Utilized batch processing on A100 GPUs - Deployed DeepSpeed for improved inference speed - Implemented horizontal scaling across GPU instances ### Performance Metrics - Achieved 28% improvement in Krippendorff's Alpha compared to previous GPT model - Llama3-8b showed best performance, matching Llama2-13b with better efficiency - Successfully scaled to 70 million predictions per day using 16 GPUs # Key Learnings and Best Practices ## Model Selection Insights - Fine-tuned open-source LLMs demonstrated strong performance - Larger datasets proved more important than model size - Llama3-8b achieved optimal balance of performance and efficiency - Basic prompt engineering alone insufficient for domain-specific tasks ## Operational Considerations - Self-hosting reduced operational costs significantly - Batch processing crucial for high-throughput requirements - GPU optimization techniques essential for production deployment - Clear problem definition and high-quality labeled data critical for success ## Data Management - Dataset size and composition significantly impact model performance - Incremental improvements in data quality led to better results - Balanced dataset creation crucial for model reliability # Production Results and Impact ## System Capabilities - Processes tens of millions of query-product pairs daily - Enables daily relevance measurements vs. previous monthly cadence - Provides near real-time feedback on search algorithm performance ## Applications - Offline retrieval analysis - Personalization measurement - Experimental contribution assessment - Ranker optimization between engagement and relevance # Future Developments ## Planned Improvements - Exploring real-time inference implementation - Investigating model distillation for lower latency - Considering RAG techniques for improved domain context - Evaluating multimodal LLMs like LLaVA for image processing ## Technical Roadmap - Working on reducing inference costs - Developing explanability features for relevance decisions - Investigating chain of thought reasoning for performance improvement This case study exemplifies a comprehensive LLMOps implementation, showing how careful consideration of technical choices, infrastructure setup, and operational requirements can lead to a successful production deployment of LLMs for business-critical applications.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.

Book a Demo

Use Open Source