eBay developed a hybrid system for pricing recommendations and similar item search in their marketplace, specifically focusing on sports trading cards. They combined semantic similarity models with direct price prediction approaches, using transformer-based architectures to create embeddings that balance both price accuracy and item similarity. The system helps sellers price their items accurately by finding similar items that have sold recently, while maintaining semantic relevance.
This case study comes from eBay’s Israel research team, presented by a researcher who previously worked at Nice (a tech company) about six years prior. The presentation focuses on eBay’s approach to product pricing assistance, specifically helping sellers determine appropriate prices for their items using machine learning and embedding-based retrieval systems. eBay operates at massive scale with approximately 2 billion listings, over 130 million active buyers, and 190 different marketplace sites worldwide. The Israeli research teams focus heavily on production-ready solutions working alongside product teams, while also maintaining capacity for more experimental research.
When sellers want to list items on eBay, they encounter a listing creation form where they must specify various details including title, item specifics, and crucially, a price. Pricing is particularly challenging for several reasons:
The presentation specifically uses sports trading cards as the primary example, which represents a massive market in the United States. Sports card pricing is extremely nuanced - factors like autographs, rarity (e.g., only one copy printed), condition, grading, and specific player/year combinations can dramatically affect value. Traditional keyword search fails because collectors use abbreviated terms, slang, and domain-specific terminology that don’t match well with standard text retrieval.
Rather than using generative LLMs, eBay’s solution relies on dense vector embeddings and retrieval systems. The core pipeline works as follows:
The system takes a product title, passes it through an embedding model to generate a dense vector representation, stores all historically-sold items as vectors in a vector database, and when a seller creates a new listing, performs a k-nearest neighbors (KNN) search to find similar previously-sold products. These similar items are then shown to sellers as pricing guidance, either as a specific recommended price or as examples of comparable sales.
The key advantage of dense embeddings over traditional text search is handling domain-specific variations. For example, in sports cards, “RC” means “Rookie Card” - a semantic embedding model can learn these equivalences while keyword search fails. Similarly, “signed” and “auto” (for autograph) represent the same concept.
The team uses BERT-based transformer models (specifically encoder-only architectures) for generating embeddings. A critical insight from their experience is that off-the-shelf embedding models trained on general text perform significantly worse than domain-specific models trained on eBay’s data. When you have the GPU resources and sufficient training data, custom training yields substantially better results.
The training approach uses contrastive learning with pairs of similar items. To generate positive pairs (items that should have similar embeddings), the team leverages user behavior data:
For negative examples (items that should have dissimilar embeddings), they use in-batch negatives - a common technique where other items in the training batch serve as negatives. With batch sizes of 64, each positive pair gets 63 implicit negative examples “for free.” This approach is computationally efficient and generally effective since random items are unlikely to be similar.
However, the team found that “soft negatives” (random negatives) aren’t sufficient for learning fine-grained distinctions. They implemented hard negative mining: finding items that appear similar but have different prices. For example, cards featuring the same player from the same team but with different grades or conditions. This forces the model to learn the nuanced attributes that actually affect pricing.
A central finding of this work is a fundamental tension between semantic similarity and pricing accuracy. The team experimented with two distinct approaches:
Approach 1: Semantic Similarity (Siamese Networks) Training the model purely on semantic similarity using contrastive learning. This produces embeddings where semantically similar items cluster together. However, when evaluated, they found cases where very similar items had vastly different prices because the model missed pricing-relevant nuances. For example, “lot of 12” boxes versus “lot of 3” boxes - semantically similar, but 4x different quantity significantly affects price.
Approach 2: Title-to-Price Prediction Training a transformer model to directly predict the sale price from the title text, then extracting embeddings from the CLS token of the final layer. This produces embeddings that cluster items by price regardless of semantic content. While this improved pricing accuracy (Mean Absolute Error of $29 vs $38 for semantic similarity), it created a trust problem: the model might recommend pricing a LeBron James card based on a Stephen Curry card simply because they happened to sell at similar prices - visually and semantically completely different items that would confuse sellers.
The team conducted rigorous evaluation on thousands of test samples:
To address this trade-off, the team developed a multi-task learning architecture that trains on both objectives simultaneously:
The experiments showed a clear continuum: as alpha shifts toward price prediction, MAE decreases but semantic matching errors (wrong player identification) increase. Conversely, emphasizing semantic similarity improves matching accuracy but hurts price predictions.
The system is designed for production deployment with several practical considerations:
eBay’s goal is explicitly not manipulation or profit optimization for the platform - they emphasize the system purely provides data-driven guidance to help sellers make informed pricing decisions based on comparable historical sales.
The presentation offers several actionable insights for practitioners building similar retrieval systems:
This case study demonstrates sophisticated application of embedding-based retrieval in a production e-commerce context, highlighting the practical engineering and modeling decisions required to balance multiple competing objectives.
DoorDash faced challenges in scaling personalization and maintaining product catalogs as they expanded beyond restaurants into new verticals like grocery, retail, and convenience stores, dealing with millions of SKUs and cold-start scenarios for new customers and products. They implemented a layered approach combining traditional machine learning with fine-tuned LLMs, RAG systems, and LLM agents to automate product knowledge graph construction, enable contextual personalization, and provide recommendations even without historical user interaction data. The solution resulted in faster, more cost-effective catalog processing, improved personalization for cold-start scenarios, and the foundation for future agentic shopping experiences that can adapt to real-time contexts like emergency situations.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Wesco, a B2B supply chain and industrial distribution company, presents a comprehensive case study on deploying enterprise-grade AI applications at scale, moving from POC to production. The company faced challenges in transitioning from traditional predictive analytics to cognitive intelligence using generative AI and agentic systems. Their solution involved building a composable AI platform with proper governance, MLOps/LLMOps pipelines, and multi-agent architectures for use cases ranging from document processing and knowledge retrieval to fraud detection and inventory management. Results include deployment of 50+ use cases, significant improvements in employee productivity through "everyday AI" applications, and quantifiable ROI through transformational AI initiatives in supply chain optimization, with emphasis on proper observability, compliance, and change management to drive adoption.