Company
Yelp
Title
Scaling Search Query Understanding with LLMs: From POC to Production
Industry
Tech
Year
2025
Summary (short)
Yelp implemented LLMs to enhance their search query understanding capabilities, focusing on query segmentation and review highlights. They followed a systematic approach from ideation to production, using a combination of GPT-4 for initial development, creating fine-tuned smaller models for scale, and implementing caching strategies for head queries. The solution successfully improved search relevance and user engagement, while managing costs and latency through careful architectural decisions and gradual rollout strategies.
Yelp's implementation of LLMs for search query understanding represents a comprehensive case study in bringing generative AI to production at scale. The case study is particularly valuable as it details the complete journey from initial concept to full production deployment, with careful attention to practical considerations around cost, latency, and quality. The company implemented LLMs to solve two main query understanding challenges: * Query Segmentation: Breaking down and labeling semantic parts of search queries (e.g., topic, location, time) * Review Highlights: Generating expanded lists of relevant phrases to match in reviews for better snippet generation Their approach to productionizing LLMs followed a systematic methodology that other organizations can learn from: ## Initial Formulation and Prototyping The team began with careful task formulation, using GPT-4 for initial prototyping. They made several key architectural decisions: * Combined related tasks (like spell correction and segmentation) into single prompts when the model could handle both effectively * Implemented RAG by augmenting queries with business names and categories to improve context * Carefully designed output formats and classification schemes that would integrate well with downstream systems * Iteratively refined prompts through extensive testing and observation ## Proof of Concept Phase To validate the approach before full deployment, they: * Leveraged the power law distribution of queries to cache high-end LLM responses for frequently used queries * Conducted offline evaluations using human-labeled datasets * Performed A/B testing to measure real impact on user behavior * Used metrics like Session/Search CTR to validate improvements ## Scaling Strategy Their scaling approach was particularly well thought out, addressing common challenges in LLMOps: 1. Cost Management: * Used expensive models (GPT-4) only for development and creating training data * Fine-tuned smaller models for production use * Implemented extensive caching for head queries * Used even smaller models (BERT/T5) for real-time processing of tail queries 2. Quality Control: * Created comprehensive golden datasets for fine-tuning * Conducted human re-labeling of potentially problematic cases * Monitored query-level metrics to identify and fix issues * Implemented quality checks before uploading to production datastores 3. Infrastructure: * Built new signal datastores to support larger pre-computed signals * Used key/value DBs to optimize retrieval latency * Implemented batch processing for offline generation * Created fallback mechanisms for tail queries The production architecture demonstrates careful consideration of real-world constraints: * Head queries (95% of traffic) are served from pre-computed caches * Tail queries use lighter-weight models optimized for speed * Multiple fallback mechanisms ensure system reliability * Infrastructure is designed to handle millions of distinct queries ## Results and Impact The implementation showed significant improvements: * Better search relevance through improved query understanding * Increased Session/Search CTR across platforms * Particularly strong improvements for less common queries * Enhanced ability to highlight relevant review snippets ## Testing and Monitoring The team implemented comprehensive testing: * Offline evaluation using human-labeled datasets * A/B testing for feature validation * Query-level metric tracking * Continuous monitoring of model outputs ## Lessons and Best Practices Key takeaways from their experience include: * Start with powerful models for development, then optimize for production * Use caching strategically based on query distribution * Implement gradual rollout strategies * Maintain fallback options for all critical paths * Focus on data quality for fine-tuning * Consider downstream applications when designing model outputs The case study also highlights important considerations around model selection, with different models used for different purposes: * GPT-4/Claude for development and training data generation * Fine-tuned GPT4o-mini for bulk query processing * BERT/T5 for real-time processing of tail queries This implementation showcases a mature approach to LLMOps, demonstrating how to balance competing concerns of cost, latency, and quality while successfully bringing LLM capabilities to production at scale.

Start your new ML Project today with ZenML Pro

Join 1,000s of members already deploying models with ZenML.