Company
Square
Title
RoBERTa for Large-Scale Merchant Classification
Industry
Finance
Year
2025
Summary (short)
Square developed and deployed a RoBERTa-based merchant classification system to accurately categorize millions of merchants across their platform. The system replaced unreliable self-selection methods with an ML approach that combines business names, self-selected information, and transaction data to achieve a 30% improvement in accuracy. The solution runs daily predictions at scale using distributed GPU infrastructure and has become central to Square's business metrics and strategic decision-making.
This case study examines Square's implementation of a large-scale merchant classification system using the RoBERTa language model. The problem Square faced was significant - with tens of millions of merchants on their platform, accurate business categorization was crucial for everything from product development to compliance and financial forecasting. Their existing approach relied heavily on merchant self-selection during onboarding, which proved highly unreliable due to factors like rushed sign-ups and unclear category definitions. The technical implementation provides an excellent example of putting large language models into production at scale. The team made several key architectural and operational decisions that are worth examining: **Data Quality and Preprocessing** The system is built on a foundation of high-quality training data, with over 20,000 manually reviewed merchant classifications. The preprocessing pipeline includes several sophisticated steps: * Removal of auto-created services that weren't modified by merchants to reduce noise * Ranking of catalog items by purchase frequency to focus on the most relevant items * Careful prompt formatting to provide clear context to the model * Conversion of business categories to numerical encodings for model processing **Model Architecture and Training** The team chose RoBERTa (Robustly Optimized BERT Pretraining Approach) as their base model, implemented using Hugging Face's transformers library. The training process demonstrates several production-focused optimizations: * Use of Databricks GPU-enabled clusters for training * Implementation of memory optimization techniques including FP16 precision and gradient checkpointing * Careful hyperparameter selection including learning rate, warmup periods, and batch sizes * Integration with modern ML infrastructure tools like Hugging Face's Trainer class **Production Inference System** The production deployment is particularly noteworthy for its focus on scalability and efficiency. The system needs to generate predictions for tens of millions of merchants daily, and the team implemented several sophisticated optimizations: * Multiple GPU workers for parallel processing * PySpark integration for distributed computation * Batch size optimization for GPU utilization * Incremental prediction system that only processes changed records * Two-table output system with both historical and latest predictions for different use cases **Monitoring and Evaluation** The system includes several production monitoring capabilities: * Historical prediction storage for tracking model drift * Regular accuracy evaluations across business categories * Performance tracking against the previous self-selection system * Daily refresh of business metrics powered by the model **Technical Results and Impact** The implementation achieved significant technical improvements: * 30% overall improvement in classification accuracy * Category-specific improvements ranging from 13% to 38% * Successful daily processing of tens of millions of merchants * Integration with business metrics and strategic planning systems From an LLMOps perspective, this case study demonstrates several best practices: * Starting with high-quality, manually verified training data * Careful attention to data preprocessing and cleaning * Use of modern ML infrastructure (Databricks, Hugging Face) * Focus on production-grade inference optimization * Implementation of comprehensive monitoring systems * Clear metrics for measuring business impact The case also illustrates some important limitations and considerations in production LLM systems: * Need for significant GPU infrastructure for both training and inference * Importance of memory optimization techniques for large models * Trade-offs between batch size and processing efficiency * Complexity of maintaining daily predictions for millions of users The solution shows how traditional classification tasks can be improved using modern LLM architectures, while also highlighting the infrastructure and optimization requirements for running such systems at scale. The team's focus on practical concerns like inference optimization and incremental updates demonstrates a mature understanding of production ML system requirements. Particularly noteworthy is the system's business impact - it has become a central part of Square's operations, powering everything from product strategy to marketing campaigns. This shows how well-implemented LLM systems can move beyond experimental use cases to become critical business infrastructure. Future developments mentioned include potential applications in payment processing optimization, suggesting that the team sees opportunities to expand the system's use cases once the core implementation has proven stable and reliable.

Start your new ML Project today with ZenML Pro

Join 1,000s of members already deploying models with ZenML.