ZenML

Building Price Prediction and Similar Item Search Models for E-commerce

eBay 2024
View original source

eBay developed a hybrid system for pricing recommendations and similar item search in their marketplace, specifically focusing on sports trading cards. They combined semantic similarity models with direct price prediction approaches, using transformer-based architectures to create embeddings that balance both price accuracy and item similarity. The system helps sellers price their items accurately by finding similar items that have sold recently, while maintaining semantic relevance.

Industry

E-commerce

Technologies

Overview

This case study comes from eBay’s Israel research team, presented by a researcher who previously worked at Nice (a tech company) about six years prior. The presentation focuses on eBay’s approach to product pricing assistance, specifically helping sellers determine appropriate prices for their items using machine learning and embedding-based retrieval systems. eBay operates at massive scale with approximately 2 billion listings, over 130 million active buyers, and 190 different marketplace sites worldwide. The Israeli research teams focus heavily on production-ready solutions working alongside product teams, while also maintaining capacity for more experimental research.

The Problem

When sellers want to list items on eBay, they encounter a listing creation form where they must specify various details including title, item specifics, and crucially, a price. Pricing is particularly challenging for several reasons:

The presentation specifically uses sports trading cards as the primary example, which represents a massive market in the United States. Sports card pricing is extremely nuanced - factors like autographs, rarity (e.g., only one copy printed), condition, grading, and specific player/year combinations can dramatically affect value. Traditional keyword search fails because collectors use abbreviated terms, slang, and domain-specific terminology that don’t match well with standard text retrieval.

Technical Approach: Embedding-Based Retrieval

Rather than using generative LLMs, eBay’s solution relies on dense vector embeddings and retrieval systems. The core pipeline works as follows:

The system takes a product title, passes it through an embedding model to generate a dense vector representation, stores all historically-sold items as vectors in a vector database, and when a seller creates a new listing, performs a k-nearest neighbors (KNN) search to find similar previously-sold products. These similar items are then shown to sellers as pricing guidance, either as a specific recommended price or as examples of comparable sales.

The key advantage of dense embeddings over traditional text search is handling domain-specific variations. For example, in sports cards, “RC” means “Rookie Card” - a semantic embedding model can learn these equivalences while keyword search fails. Similarly, “signed” and “auto” (for autograph) represent the same concept.

Training the Embedding Model

The team uses BERT-based transformer models (specifically encoder-only architectures) for generating embeddings. A critical insight from their experience is that off-the-shelf embedding models trained on general text perform significantly worse than domain-specific models trained on eBay’s data. When you have the GPU resources and sufficient training data, custom training yields substantially better results.

Generating Training Data

The training approach uses contrastive learning with pairs of similar items. To generate positive pairs (items that should have similar embeddings), the team leverages user behavior data:

Negative Sampling Strategy

For negative examples (items that should have dissimilar embeddings), they use in-batch negatives - a common technique where other items in the training batch serve as negatives. With batch sizes of 64, each positive pair gets 63 implicit negative examples “for free.” This approach is computationally efficient and generally effective since random items are unlikely to be similar.

However, the team found that “soft negatives” (random negatives) aren’t sufficient for learning fine-grained distinctions. They implemented hard negative mining: finding items that appear similar but have different prices. For example, cards featuring the same player from the same team but with different grades or conditions. This forces the model to learn the nuanced attributes that actually affect pricing.

The Semantic Similarity vs. Pricing Accuracy Trade-off

A central finding of this work is a fundamental tension between semantic similarity and pricing accuracy. The team experimented with two distinct approaches:

Approach 1: Semantic Similarity (Siamese Networks) Training the model purely on semantic similarity using contrastive learning. This produces embeddings where semantically similar items cluster together. However, when evaluated, they found cases where very similar items had vastly different prices because the model missed pricing-relevant nuances. For example, “lot of 12” boxes versus “lot of 3” boxes - semantically similar, but 4x different quantity significantly affects price.

Approach 2: Title-to-Price Prediction Training a transformer model to directly predict the sale price from the title text, then extracting embeddings from the CLS token of the final layer. This produces embeddings that cluster items by price regardless of semantic content. While this improved pricing accuracy (Mean Absolute Error of $29 vs $38 for semantic similarity), it created a trust problem: the model might recommend pricing a LeBron James card based on a Stephen Curry card simply because they happened to sell at similar prices - visually and semantically completely different items that would confuse sellers.

Quantitative Evaluation

The team conducted rigorous evaluation on thousands of test samples:

Multi-Task Learning Solution

To address this trade-off, the team developed a multi-task learning architecture that trains on both objectives simultaneously:

The experiments showed a clear continuum: as alpha shifts toward price prediction, MAE decreases but semantic matching errors (wrong player identification) increase. Conversely, emphasizing semantic similarity improves matching accuracy but hurts price predictions.

Production Considerations and Business Impact

The system is designed for production deployment with several practical considerations:

eBay’s goal is explicitly not manipulation or profit optimization for the platform - they emphasize the system purely provides data-driven guidance to help sellers make informed pricing decisions based on comparable historical sales.

Key Takeaways

The presentation offers several actionable insights for practitioners building similar retrieval systems:

This case study demonstrates sophisticated application of embedding-based retrieval in a production e-commerce context, highlighting the practical engineering and modeling decisions required to balance multiple competing objectives.

More Like This

Large-Scale Personalization and Product Knowledge Graph Enhancement Through LLM Integration

DoorDash 2025

DoorDash faced challenges in scaling personalization and maintaining product catalogs as they expanded beyond restaurants into new verticals like grocery, retail, and convenience stores, dealing with millions of SKUs and cold-start scenarios for new customers and products. They implemented a layered approach combining traditional machine learning with fine-tuned LLMs, RAG systems, and LLM agents to automate product knowledge graph construction, enable contextual personalization, and provide recommendations even without historical user interaction data. The solution resulted in faster, more cost-effective catalog processing, improved personalization for cold-start scenarios, and the foundation for future agentic shopping experiences that can adapt to real-time contexts like emergency situations.

customer_support question_answering classification +64

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Enterprise-Scale GenAI and Agentic AI Deployment in B2B Supply Chain Operations

Wesco 2025

Wesco, a B2B supply chain and industrial distribution company, presents a comprehensive case study on deploying enterprise-grade AI applications at scale, moving from POC to production. The company faced challenges in transitioning from traditional predictive analytics to cognitive intelligence using generative AI and agentic systems. Their solution involved building a composable AI platform with proper governance, MLOps/LLMOps pipelines, and multi-agent architectures for use cases ranging from document processing and knowledge retrieval to fraud detection and inventory management. Results include deployment of 50+ use cases, significant improvements in employee productivity through "everyday AI" applications, and quantifiable ROI through transformational AI initiatives in supply chain optimization, with emphasis on proper observability, compliance, and change management to drive adoption.

fraud_detection document_processing content_moderation +52