Faire: Evolution of ML Model Deployment Infrastructure at Scale

LLMOps Database

E-commerce

Faire

Company

Faire

Title

Evolution of ML Model Deployment Infrastructure at Scale

Industry

E-commerce

Link

https://www.youtube.com/watch?v=CVJhosjEvYE&list=PLlcxuf1qTrwBGJBE0nVbAs0fbNLHidJaN&index=11

Year

2023

Summary (short)

Faire, a wholesale marketplace, evolved their ML model deployment infrastructure from a monolithic approach to a streamlined platform. Initially struggling with slow deployments, limited testing, and complex workflows across multiple systems, they developed an internal Machine Learning Model Management (MMM) tool that unified model deployment processes. This transformation reduced deployment time from 3+ days to 4 hours, enabled safe deployments with comprehensive testing, and improved observability while supporting various ML workloads including LLMs.

This case study examines how Faire, a wholesale marketplace focusing on enabling brick-and-mortar retailers, evolved their machine learning infrastructure over two years to better support various ML workloads including search, discovery, financial products, and user/content safety. The journey of their ML platform team provides valuable insights into the challenges and solutions for scaling ML operations in a growing e-commerce company. The core focus areas they identified were: * Enabling new capabilities * Improving productivity * Ensuring stability * Maintaining observability Initially, Faire's ML deployment process was straightforward but limiting. Data scientists would save models to S3, modify code in the core backend monolith, and register features in their feature store. While simple, this approach presented several challenges: * Handling varying workloads and traffic patterns became difficult * Supporting new architectures and frameworks was complicated * Developer velocity was slow with limited testing capabilities * Version management and model lineage tracking were problematic To address these issues, Faire first attempted to solve the problem by adopting external tools: * Comet for model tracking and experiment management * Amazon SageMaker for real-time inference However, simply adding these tools wasn't enough. They created a configuration management layer using YAML files to standardize: * Model container specifications * Endpoint configurations * Deployment parameters This improved flexibility but introduced new complexity. The deployment process expanded from 3 steps to 5+ steps requiring multiple PRs, making it challenging for data scientists to manage deployments effectively. Clean-up and retraining processes became even more complex, requiring coordination across multiple systems. The breakthrough came with the development of their internal Machine Learning Model Management (MMM) tool. This unified platform streamlined the entire deployment process through a single UI, handling: * Model registration and tracking in Comet * Feature store integration * Automated deployment to SageMaker * Configuration management * Metrics and monitoring setup * Backend code generation Key improvements delivered by MMM include: * Zero-code deployments for standardized models * Automated clean-up processes * Reduction in deployment time from 3+ days to 4 hours * Comprehensive testing and safety checks * Improved tracking and observability The MMM system supports different deployment strategies: * Gradual rollouts * Expedited deployments (blue-green) * Shadow deployments for testing The technical implementation uses: * Next.js for the frontend * DynamoDB for data storage * GitHub Actions for workflows * Python scripts for code generation * Integration with SageMaker and Comet APIs An interesting aspect of their approach is how they handled the build vs. buy decision. As a small team with a large mandate, they chose to leverage existing enterprise solutions (SageMaker, Comet) while building custom integration layers that matched their specific needs. This balanced approach allowed them to move quickly while maintaining control over their deployment processes. The system continues to evolve, with planned improvements including: * Automated shadow testing for retrained models * Full automation of monolith integration * Extension to other ML workloads like one-click training and fine-tuning * Enhanced feature store exploration capabilities Faire's approach to LLMOps demonstrates the importance of abstracting complexity while maintaining flexibility. Rather than forcing data scientists to manage multiple systems and workflows, they created a unified interface that handles the complexity behind the scenes. This allows their teams to focus on model development while ensuring production deployments remain safe and observable. Their experience highlights several key lessons for LLMOps: * The importance of balancing flexibility with standardization * The value of automated testing and deployment safety checks * The need for comprehensive observability across the ML lifecycle * The benefits of incremental improvement in platform capabilities The case study also shows how ML infrastructure needs evolve as organizations scale, and how solutions must adapt to support increasingly complex requirements while maintaining usability for data scientists and ML engineers. Their success in deploying various model types, including LLMs and large vision models, demonstrates the platform's versatility and robustness.

Start your new ML Project today with ZenML Pro

Join 1,000s of members already deploying models with ZenML.

Learn more

Try Free