idealo, a major European price comparison platform, implemented LLM-powered features to enhance product comparison and discovery. They developed two key applications: an intelligent product comparison tool that extracts and compares relevant attributes from extensive product specifications, and a guided product finder that helps users navigate complex product categories. The company focused on using LLMs as language interfaces rather than knowledge bases, relying on proprietary data to prevent hallucinations. They implemented thorough evaluation frameworks and A/B testing to measure business impact.
idealo is a major price comparison platform operating in six European countries, with Germany being their largest market. The company works with over 50,000 shops, including major marketplaces like Amazon and eBay, as well as local retailers, managing over 4 million products and 500 million offers. This case study details their journey in implementing LLM-powered features in production, focusing on practical lessons learned and real-world implementation challenges.
The company approached their LLM implementation with three key learnings that shaped their strategy:
### Focus on User Needs and LLM Strengths
Initially, like many companies, idealo considered building a general shopping assistant chatbot. However, early user testing revealed this wasn't aligned with their users' needs - users preferred not to type extensively or read long text responses. This led them to pivot towards more focused applications that matched both user needs and LLM capabilities.
Their approach to identifying viable use cases involved:
* Mapping LLM capabilities (language understanding, summarization, structured output) to specific user needs
* Prioritizing opportunities based on business impact and LLM capability leverage
* Emphasizing rapid prototyping and early user testing
* Using tools like AWS Bedrock playground for quick experimentation
### LLMs as Language Interfaces Rather Than Knowledge Bases
A crucial architectural decision was to use LLMs primarily as language interfaces rather than relying on their built-in knowledge. This approach was inspired by ChatGPT's evolution with web search integration, where responses are grounded in retrieved content rather than model knowledge.
For idealo, this meant:
* Using their proprietary product data and expert content as the source of truth
* Having LLMs process and transform this data rather than generate knowledge
* Implementing structured prompting techniques to ensure reliable outputs
* Focusing on data quality and prompt engineering rather than model fine-tuning
### Practical Implementation Examples
The company implemented two major features:
1. Product Comparison Tool:
* Analyzes complex product specifications (60-70 attributes)
* Intelligently selects the most relevant attributes for comparison
* Understands and highlights meaningful differences between products
* Uses LLMs to determine which differences are improvements (e.g., lower weight being better for tablets)
2. Guided Product Finder:
* Transforms expert content into interactive question-flow
* Uses multi-step prompting pipeline:
* First prompt determines next relevant consideration
* Second prompt maps considerations to available filters
* Third prompt generates user-friendly questions and answers
* Maintains structured output formats (JSON/XML) for system integration
### Evaluation Framework
idealo emphasized the importance of robust evaluation frameworks from the start of any LLM project. Their evaluation strategy included:
* Rule-based validation (e.g., checking if product values match source data)
* Traditional metrics where applicable (precision/recall for entity recognition)
* LLM-as-judge approach for qualitative assessments
* Comprehensive prompt evaluation systems
* A/B testing for business impact measurement
### Technical Implementation Details
The technical implementation included several sophisticated approaches:
* Structured output enforcement using JSON schema validation
* Integration with multiple LLM providers (Claude 2.1, GPT-4)
* Careful prompt engineering to prevent hallucinations
* Custom evaluation prompts for quality assurance
* Integration with existing product database and filtering systems
### Results and Validation
The company implemented rigorous validation processes:
* Manual review of hundreds of examples for hallucination detection
* Comparative testing between different LLM versions (e.g., Claude 2 vs 2.1)
* Live A/B testing to measure business impact
* Monitoring of business KPIs including bounce rates and conversion rates
### Challenges and Lessons
Key challenges included:
* Initial difficulty in evaluating fuzzy outputs like question quality
* Need for extensive prompt engineering to ensure reliable structured outputs
* Balancing between model capabilities and practical business needs
* Managing the transition from prototype to production-ready features
The case study demonstrates a mature approach to LLM implementation in production, emphasizing the importance of user needs, data quality, and robust evaluation frameworks. Their success came from treating LLMs as tools within a larger system rather than complete solutions, and maintaining strong focus on measurable business impact.
Start your new ML Project today with ZenML Pro
Join 1,000s of members already deploying models with ZenML.