Pinterest: Large Language Models for Search Relevance via Knowledge Distillation

LLMOps Database

Tech

Company

Title

Large Language Models for Search Relevance via Knowledge Distillation

Industry

Tech

Link

https://medium.com/pinterest-engineering/improving-pinterest-search-relevance-using-large-language-models-4cd938d4e892

Year

2024

Summary (short)

Pinterest tackled the challenge of improving search relevance by implementing a large language model-based system. They developed a cross-encoder LLM teacher model trained on human-annotated data, which was then distilled into a lightweight student model for production deployment. The system processes rich Pin metadata including titles, descriptions, and synthetic image captions to predict relevance scores. The implementation resulted in a 2.18% improvement in search feed relevance (nDCG@20) and over 1.5% increase in search fulfillment rates globally, while successfully generalizing across multiple languages despite being trained primarily on US data.

meta

hugging_face

Pinterest's implementation of LLMs for search relevance represents a sophisticated approach to deploying large language models in a production environment while addressing key challenges of scale, latency, and cost. This case study offers valuable insights into how large tech companies can effectively leverage LLMs while maintaining practical operational constraints. # Overview of the Problem and Solution Pinterest needed to improve their search relevance system to better match user queries with appropriate content (Pins). The challenge was to move beyond simple engagement metrics to ensure content was genuinely relevant to user information needs. They developed a system using large language models, but faced the typical challenges of deploying such models in production: latency requirements, cost considerations, and the need to scale to billions of queries. Their solution involved a two-stage approach: * A powerful LLM-based teacher model for high-quality relevance predictions * A distilled, lightweight student model for production serving This architecture allowed them to leverage the power of LLMs while maintaining practical serving constraints. # Technical Implementation Details ## Teacher Model Architecture The teacher model uses a cross-encoder architecture, implemented with various language models including BERT, T5, DeBERTa, XLM-RoBERTa, and Llama-3-8B. The model processes rich text features from Pins, including: * Pin titles and descriptions * Synthetic image captions (generated using BLIP) * High-engagement query tokens * User-curated board titles * Link titles and descriptions from external webpages For larger models like Llama, they implemented several optimization techniques: * Quantized model weights * qLoRA for fine-tuning * Gradient checkpointing * Mixed precision training ## Production Deployment Strategy The production deployment strategy shows careful consideration of real-world constraints. Instead of serving the large teacher model directly, Pinterest uses knowledge distillation to create a lightweight student model. This student model processes various pre-computed features: * Query-level features including interest features and SearchSAGE embeddings * Pin-level features including PinSAGE embeddings and visual embeddings * Query-Pin interaction features like BM25 scores and historical engagement rates The distillation process involves using the teacher model to generate labels on billions of search impressions, which are then used to train the student model. This approach allows them to scale beyond their initial human-annotated dataset and generalize to multiple languages and markets. # Quality Control and Evaluation Pinterest implemented a comprehensive evaluation strategy: ## Offline Evaluation * Accuracy metrics for 5-scale relevance predictions * AUROC metrics for binarized labels with different thresholds * Comparative analysis of different language models and feature combinations * Assessment of training data scale effects ## Online Evaluation * A/B testing in production * Human relevance evaluations using nDCG@20 * Cross-language performance assessment * Search fulfillment rate monitoring # Key Results and Insights The implementation demonstrated several important outcomes: * Model Performance: * The Llama-3-8B teacher model outperformed the BERT-base model by 12.5% * Each additional text feature improved model performance * Scaling up training data through distillation showed consistent improvements * Production Impact: * 2.18% improvement in search feed relevance * Over 1.5% increase in search fulfillment rates * Successful generalization to non-US markets despite limited training data # Technical Challenges and Solutions Several key challenges were addressed in the implementation: * Scale: The system needed to handle billions of queries and Pins. This was solved through the teacher-student architecture and efficient feature pre-computation. * Latency: Real-time serving requirements were met by distilling the large teacher model into a lightweight student model using pre-computed embeddings. * Data Quality: The limited human-annotated dataset was expanded through semi-supervised learning and knowledge distillation. * Multilingual Support: Despite training primarily on US data, the system successfully generalized to other languages through the use of multilingual models and rich feature engineering. # Architecture and Infrastructure Considerations The system architecture shows careful consideration of production requirements: * Feature Pipeline: Rich text features are processed and stored in advance * Embedding Systems: Specialized embedding models (SearchSAGE, PinSAGE) provide pre-computed representations * Serving Infrastructure: Lightweight student model serves traffic with low latency * Training Pipeline: Daily processing of search engagement data for continuous model improvement # Future Directions Pinterest's roadmap for the system includes several promising directions: * Integration of servable LLMs for real-time inference * Expansion into vision-and-language multimodal models * Implementation of active learning for dynamic training data improvement This case study demonstrates a practical approach to leveraging LLMs in a production search system while maintaining performance and efficiency requirements. The success of the implementation shows how careful architecture choices and engineering practices can make advanced AI technologies viable in large-scale production environments.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.

Book a Demo

Use Open Source