Company
eBay
Title
Developing and Deploying Domain-Adapted LLMs for E-commerce Through Continued Pre-training
Industry
E-commerce
Year
2025
Summary (short)
eBay tackled the challenge of incorporating LLMs into their e-commerce platform by developing e-Llama, a domain-adapted version of Llama 3.1. Through continued pre-training on a mix of e-commerce and general domain data, they created 8B and 70B parameter models that achieved 25% improvement in e-commerce tasks while maintaining strong general performance. The training was completed efficiently using 480 NVIDIA H100 GPUs and resulted in production-ready models aligned with human feedback and safety requirements.
eBay's journey into deploying Large Language Models (LLMs) in production represents a significant case study in practical LLMOps implementation for e-commerce applications. The company faced the common challenge many enterprises encounter: how to leverage the power of LLMs while maintaining cost-effectiveness, data security, and domain specificity. Their solution demonstrates a pragmatic hybrid approach combining both custom-built models and adapted open-source ones. The case study focuses on their development of e-Llama, a domain-adapted version of Meta's Llama 3.1 model, which exemplifies a practical approach to LLMOps in a large-scale e-commerce environment. This project is particularly noteworthy as it balances the speed-to-market advantages of using existing models with the need for domain-specific optimization. **Infrastructure and Technical Implementation** The technical implementation showcases several key aspects of production-grade LLM deployment: * Hardware Infrastructure: * Deployment of 60 nodes with 8 NVIDIA H100 80GB GPUs each * Total of 480 GPUs interconnected via NVIDIA NVLink and InfiniBand * Integration with eBay's existing compute platform * Training Optimization: * Implementation of 3D parallelism (data, tensor, and pipeline parallel) * Utilization of distributed optimizer states * Integration of flash-attention-2 for improved efficiency * Achievement of efficient training metrics: completion of 1 trillion tokens in about one month (340k GPU-hours) **Data Engineering and Processing** The data strategy demonstrates sophisticated LLMOps practices: * Multiple Data Source Integration: * Public eBay listings and product reviews * Curated open-source datasets * High-quality proprietary datasets * 10% non-English language general domain data * Data Processing Pipeline: * Implementation of e-commerce classifiers for data filtering * Careful serialization for autoregressive language modeling * Balanced mixing of domain-specific and general domain data (1:1 ratio) **Model Training and Optimization** The training methodology shows careful consideration of production requirements: * Hyperparameter Optimization: * Learning rate set to 10% of original maximum * Implementation of cosine learning rate scheduling with warmup * Batch size optimization at 11.8 million tokens * Total of 85k update steps * Performance Monitoring: * Tracking of domain-specific improvements (25% for English, 30% for non-English) * Monitoring of general domain performance (only 1% degradation) * Regular benchmark evaluations **Production Deployment Considerations** The case study highlights several critical aspects of production deployment: * Safety and Reliability: * Post-training instruction tuning * Implementation of human feedback alignment * Development of content safety guardrails * Integration of explicit instruction following capabilities * Scalability and Efficiency: * Efficient resource utilization compared to base Llama training * Balance between model size options (8B and 70B parameters) * Optimization for multi-market deployment **Critical Analysis and Lessons Learned** The case study provides several valuable insights for LLMOps practitioners: * Hybrid Approach Benefits: * Demonstrates the value of combining custom and adapted models * Shows how to leverage existing models while adding domain-specific value * Illustrates practical balance between development speed and customization * Technical Tradeoffs: * Careful handling of catastrophic forgetting through data mixing * Balance between domain adaptation and general capabilities * Resource optimization for large-scale training * Production Considerations: * Importance of safety and alignment in production systems * Need for efficient scaling strategies * Value of maintaining multilingual capabilities **Future Implications** The case study suggests several important trends for LLMOps: * The viability of adapting open-source models for specific domains * The importance of efficient training infrastructure * The need for balanced approaches to model development and deployment * The value of maintaining general capabilities while adding domain-specific knowledge This implementation demonstrates a sophisticated understanding of LLMOps requirements in a production environment, showing how large enterprises can practically implement and scale LLM solutions while maintaining control over costs, performance, and safety. The careful attention to infrastructure, data processing, and model optimization provides valuable insights for organizations looking to deploy LLMs in production environments.

Start your new ML Project today with ZenML Pro

Join 1,000s of members already deploying models with ZenML.