Company
Mercari
Title
Fine-Tuning and Quantizing LLMs for Dynamic Attribute Extraction
Industry
E-commerce
Year
2024
Summary (short)
Mercari tackled the challenge of extracting dynamic attributes from user-generated marketplace listings by fine-tuning a 2B parameter LLM using QLoRA. The team successfully created a model that outperformed GPT-3.5-turbo while being 95% smaller and 14 times more cost-effective. The implementation included careful dataset preparation, parameter efficient fine-tuning, and post-training quantization using llama.cpp, resulting in a production-ready model with better control over hallucinations.
# Fine-Tuning and Deploying LLMs at Mercari for Attribute Extraction ## Overview and Business Context Mercari, a Japanese C2C marketplace platform, faced the challenge of accurately extracting specific attributes from user-generated listing descriptions. The project aimed to improve listing quality and user experience by better understanding seller-provided content across multiple languages and categories. ## Technical Challenges and Requirements - Handle dynamically specified attributes that change frequently - Process multilingual content (primarily Japanese, but also English and Chinese) - Maintain cost-effectiveness compared to commercial LLM APIs - Control and minimize hallucinations - Scale to production-level request volumes ## Solution Architecture and Implementation ### Data Preparation and Model Selection - Created dataset focusing on top 20 marketplace categories - Structured training data with specific prompts in multiple languages - Selected gemma-2b-it as base model after evaluating Japanese LM capabilities via Nejumi Leaderboard ### Fine-Tuning Process - Utilized QLoRA for efficient fine-tuning on single A100 GPU (80GB) - Key implementation steps: ### Technical Implementation Details - Target modules included: - Training configurations: ### Post-Training Optimization - Implemented post-training quantization using llama.cpp - Process included: - Achieved 95% reduction in model size ## Production Deployment and Performance ### Evaluation Metrics - BLEU score comparison with GPT-3.5-turbo - Model size and inference latency measurements - Cost analysis per request - Quality of extracted attributes ### Key Results - Achieved higher BLEU score than GPT-3.5-turbo-0125 (>5 percentage points improvement) - Reduced model size by 95% compared to base model - Estimated 14x cost reduction compared to GPT-3.5-turbo - Better control over hallucinations through fine-tuning ## MLOps and Production Considerations ### Infrastructure - Utilized GCP A100 GPU instances for training - Implemented efficient model serving using quantized formats - Integrated with existing marketplace systems ### Monitoring and Quality Control - Continuous evaluation of extraction accuracy - Performance monitoring across different languages - Cost tracking and optimization ### Data Pipeline - Structured data collection from marketplace listings - Prompt template management - Training data versioning through W&B ## Best Practices and Learnings ### Model Development - Careful selection of base model based on language capabilities - Efficient use of QLoRA for fine-tuning - Importance of proper prompt engineering - Balance between model size and performance ### Production Deployment - Effective use of quantization for deployment - Integration of monitoring and evaluation systems - Cost-effective alternatives to commercial APIs ### Future Improvements - Potential for multi-lingual model expansion - Continuous model updates based on new marketplace categories - Further optimization of inference performance ## Technical Dependencies and Tools - HuggingFace Transformers - QLoRA fine-tuning framework - llama.cpp for quantization - Weights & Biases for experiment tracking - Google Cloud Platform for infrastructure - Custom prompt engineering framework ## Documentation and Templates - Standardized prompt templates for different languages - Model evaluation protocols - Deployment and serving configurations - Monitoring and alerting setup This case study demonstrates a practical approach to implementing LLMs in production for specific business needs, balancing performance, cost, and maintainability. The success in achieving better performance than larger commercial models while significantly reducing costs showcases the value of careful fine-tuning and optimization in real-world LLM applications.

Start your new ML Project today with ZenML Pro

Join 1,000s of members already deploying models with ZenML.