ZenML

Large-Scale Learned Retrieval System with Two-Tower Architecture

Pinterest 2024
View original source

Pinterest developed and deployed a large-scale learned retrieval system using a two-tower architecture to improve content recommendations for over 500 million monthly active users. The system replaced traditional heuristic approaches with an embedding-based retrieval system learned from user engagement data. The implementation includes automatic retraining capabilities and careful version synchronization between model artifacts. The system achieved significant success, becoming one of the top-performing candidate generators with the highest user coverage and ranking among the top three in save rates.

Industry

Tech

Technologies

Summary

Pinterest, a visual discovery platform serving over 500 million monthly active users (MAUs), undertook a significant effort to modernize their recommendation system’s retrieval stage. Previously, their retrieval approaches relied heavily on heuristic methods such as Pin-Board graph relationships or user-followed interests. This case study documents their transition to a learned, embedding-based retrieval system that leverages machine learning models trained purely on logged user engagement events to power personalized content recommendations at scale.

The system was deployed for both the homefeed (the primary content discovery surface) and notifications, representing a substantial production ML operation. This is a strong example of ML systems engineering at scale rather than traditional LLM-based generative AI, but it shares many operational patterns with LLMOps, particularly around embedding management, model serving, versioning, and continuous retraining pipelines.

Architecture Overview

Pinterest’s recommendation system follows a multi-stage funnel design that is common in large-scale recommendation systems. The funnel starts with candidate generation (retrieval) from billions of pins, narrows down to thousands of candidates through a pre-ranking or “light-weight scoring” (LWS) model, and finally applies a full ranking model to generate personalized feeds.

The ranking model at Pinterest is described as a “powerful transformer-based model” learned from raw user engagement sequences with mixed device serving. It excels at capturing both long-term and short-term user engagement patterns. However, the retrieval stage historically lagged behind, relying on heuristic approaches rather than learned representations.

Two-Tower Model Architecture

The core of the learned retrieval system is a two-tower neural network architecture, which is widely adopted in industry for retrieval tasks. This architecture separates the model into two components:

The two-tower design enables efficient online serving because user and item embeddings can be computed independently. At serving time, personalized retrieval is achieved through nearest neighbor search between the user embedding and pre-computed item embeddings.

Training Methodology

The model is trained as an extreme multi-class classification problem. Since computing a softmax over the entire corpus of billions of items is computationally infeasible, Pinterest employs in-batch negative sampling as a memory-efficient alternative. The training objective optimizes the probability of retrieving the correct item given the user context.

An important operational consideration is addressing popularity bias in the training data. Since items are sampled from training sets that reflect natural popularity distributions, Pinterest applies logit correction based on estimated item probabilities. This sampling bias correction helps ensure the model doesn’t simply learn to recommend popular items but instead captures true user-item relevance.

System Design for Production Serving

Given the scale of serving 500+ million MAUs, the system design required careful engineering. The architecture is split into two main components:

Online Serving: User embeddings are computed at request time, allowing the system to leverage the most up-to-date user features for personalized retrieval. This ensures that recent user actions immediately influence recommendations.

Offline Indexing: Millions of item embeddings are pre-computed and pushed to Pinterest’s in-house ANN (Approximate Nearest Neighbor) serving system called Manas. The Manas system is based on HNSW (Hierarchical Navigable Small World graphs), a state-of-the-art algorithm for efficient approximate nearest neighbor search.

Auto-Retraining Infrastructure

A critical aspect of production ML systems is model freshness. Pinterest established an automated retraining workflow to periodically refresh models with recent data, ensuring the system captures evolving user preferences and content trends.

However, the two-tower architecture introduces a unique operational challenge: the two model artifacts (user tower and item tower) are deployed to separate services. This creates a version synchronization problem. If the user embedding model is updated before the item index is rebuilt (or vice versa), the embedding spaces will be mismatched, causing a drastic drop in candidate quality.

Pinterest’s solution involves attaching model version metadata to each ANN search service host. This metadata contains a mapping from model name to the latest model version and is generated alongside the index. At serving time, the homefeed backend first retrieves the version metadata from its assigned ANN service host and uses the corresponding model version to compute user embeddings.

This approach ensures “anytime” model version synchronization—even during index rollouts when some ANN hosts may have version N while others have version N+1, the system correctly matches user embeddings to the appropriate item embedding space. Additionally, Pinterest maintains the latest N versions of the user tower model to support rollback capability, ensuring they can compute appropriate user embeddings even if the ANN service is rolled back to a previous build.

Results and Impact

The homefeed at Pinterest is described as “probably the most complicated system” with over 20 candidate generators in production using different retrieval strategies. The learned retrieval candidate generator specifically focuses on driving user engagement.

Key outcomes reported include:

While these results are self-reported and lack specific quantitative metrics, the fact that the system allowed deprecation of existing infrastructure suggests meaningful improvements in both effectiveness and system consolidation.

Operational Patterns and Lessons

Several operational patterns emerge from this case study that are relevant to anyone building production ML systems:

Separation of concerns in serving: By splitting online (user embedding) and offline (item indexing) components, the system achieves efficiency while maintaining personalization. The user tower can leverage real-time features while item embeddings can be pre-computed and indexed for fast retrieval.

Version synchronization as a first-class concern: In systems with multiple interdependent model artifacts, version management becomes critical. Pinterest’s metadata-based approach provides a robust solution that handles partial rollouts and rollbacks gracefully.

Continuous retraining with validation: The auto-retraining workflow includes model performance validation before deployment, ensuring model quality is maintained over time.

Bias correction in training: Addressing sampling bias is essential for retrieval systems to avoid the “popular item” trap and provide genuinely personalized recommendations.

Limitations and Considerations

While this case study provides valuable insights, readers should note some limitations:

Despite these limitations, this case study offers a solid example of how to operationalize embedding-based retrieval at massive scale, with particular attention to the often-overlooked challenges of model versioning and synchronization in distributed ML systems.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Building Economic Infrastructure for AI with Foundation Models and Agentic Commerce

Stripe 2025

Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.

fraud_detection chatbot code_generation +57

Building an Enterprise-Grade AI Agent for Recruiting at Scale

LinkedIn 2025

LinkedIn developed Hiring Assistant, an AI agent designed to transform the recruiting workflow by automating repetitive tasks like candidate sourcing, evaluation, and engagement across 1.2+ billion profiles. The system addresses the challenge of recruiters spending excessive time on pattern-recognition tasks rather than high-value decision-making and relationship building. Using a plan-and-execute agent architecture with specialized sub-agents for intake, sourcing, evaluation, outreach, screening, and learning, Hiring Assistant combines real-time conversational interfaces with large-scale asynchronous execution. The solution leverages LinkedIn's Economic Graph for talent insights, custom fine-tuned LLMs for candidate evaluation, and cognitive memory systems that learn from recruiter behavior over time. The result is a globally available agentic product that enables recruiters to work with greater speed, scale, and intelligence while maintaining human-in-the-loop control for critical decisions.

healthcare customer_support question_answering +51