GoDaddy: Scaling Product Categorization with Batch Inference and Prompt Engineering

LLMOps Database

E-commerce

GoDaddy

Company

GoDaddy

Title

Scaling Product Categorization with Batch Inference and Prompt Engineering

Industry

E-commerce

Link

https://aws.amazon.com/blogs/machine-learning/how-godaddy-built-a-category-generation-system-at-scale-with-batch-inference-for-amazon-bedrock?tag=soumet-20

Year

2025

Summary (short)

GoDaddy sought to improve their product categorization system that was using Meta Llama 2 for generating categories for 6 million products but faced issues with incomplete/mislabeled categories and high costs. They implemented a new solution using Amazon Bedrock's batch inference capabilities with Claude and Llama 2 models, achieving 97% category coverage (exceeding their 90% target), 80% faster processing time, and 8% cost reduction while maintaining high quality categorization as verified by subject matter experts.

meta

GoDaddy, a major domain registrar and web hosting company serving over 21 million customers, needed to enhance their product categorization system to improve customer experience. Their existing system used Meta Llama 2 to generate categories for 6 million products but faced challenges with incomplete or mislabeled categories and high operational costs. This case study demonstrates how they implemented a scalable, cost-effective solution using batch inference capabilities in Amazon Bedrock. The solution architecture leverages several AWS services in an integrated workflow: * Amazon Bedrock for batch processing using both Meta Llama 2 and Anthropic's Claude models * Amazon S3 for storing product data and inference outputs * AWS Lambda for orchestrating the model operations * LangChain's PydanticOutputParser for structured output parsing The team implemented several key LLMOps practices and techniques to optimize the system: **Batch Processing Implementation** The solution uses Amazon Bedrock's CreateModelInvocationJob API for batch processing, which allows multiple inference requests to be processed asynchronously. The system maintains job status tracking through GetModelInvocationJob API calls, handling various states like Submitted, InProgress, Failed, Completed, and Stopped. This enables efficient monitoring and management of large-scale inference jobs. **Prompt Engineering Optimization** The team extensively experimented with prompt engineering techniques to improve output quality and consistency: * Clear instruction formatting with consistent separator characters * Few-shot prompting with 0-10 examples tested * N-packing technique to combine multiple SKUs in single queries * Model-specific optimizations for both Claude and Llama 2 * JSON format instructions for structured outputs **Output Parsing and Validation** They implemented robust parsing using LangChain's PydanticOutputParser with a defined schema for consistent output structure. The system includes output fixing capabilities to handle parsing failures and maintain high-quality results. **Evaluation Framework** A comprehensive evaluation framework was developed measuring: * Content coverage (missing values in generation) * Parsing coverage (missing samples in format parsing) * Parsing recall and precision on product names * Final coverage across both generation and parsing * Human evaluation for qualitative assessment **Performance Results** The implementation achieved significant improvements: * Processing speed: 5,000 products in 12 minutes (80% faster than requirement) * Accuracy: 97% category coverage on both 5k and 100k test sets * Cost: 8% more affordable than the previous Llama2-13b solution * Quality: High satisfaction in human evaluation by subject matter experts **Model Selection and Optimization** The team's testing revealed that Claude-Instant with zero-shot prompting provided the best overall performance: * Superior latency, cost, and accuracy metrics * Better generalizability with higher packing numbers * More efficient prompt handling requiring shorter inputs * Higher output length limits enabling more efficient batching **Technical Challenges and Solutions** * JSON parsing optimization through prompt engineering reduced latency by 77% * Implementation of n-packing techniques to improve throughput * Model-specific prompt templates for optimal performance * Careful balance of few-shot example counts against costs **Security and Infrastructure** The solution incorporates AWS best practices for security across all components: * Secure S3 storage for data and outputs * IAM roles and permissions for Lambda functions * Secure API access to Amazon Bedrock * Monitoring and logging capabilities **Future Improvements** The team identified several areas for future enhancement: * Dataset expansion for better ground truth * Increased human evaluation coverage * Potential model fine-tuning when more data is available * Implementation of automatic prompt engineering * Integration with knowledge bases to reduce hallucinations This case study demonstrates successful implementation of LLMOps practices in a production environment, showing how careful attention to prompt engineering, batch processing, and evaluation can create a scalable, cost-effective solution. The project achieved significant improvements in accuracy and efficiency while maintaining high-quality outputs, providing a blueprint for similar large-scale LLM implementations.

Start your new ML Project today with ZenML Pro

Join 1,000s of members already deploying models with ZenML.

Learn more

Try Free