Company
Roblox
Title
Building a Hybrid Cloud AI Infrastructure for Large-Scale ML Inference
Industry
Media & Entertainment
Year
2024
Summary (short)
Roblox underwent a three-phase transformation of their AI infrastructure to support rapidly growing ML inference needs across 250+ production models. They built a comprehensive ML platform using Kubeflow, implemented a custom feature store, and developed an ML gateway with vLLM for efficient large language model operations. The system now processes 1.5 billion tokens weekly for their AI Assistant, handles 1 billion daily personalization requests, and manages tens of thousands of CPUs and over a thousand GPUs across hybrid cloud infrastructure.
Roblox's journey in scaling their AI infrastructure presents a comprehensive case study in building enterprise-grade LLMOps capabilities. The company faced the challenge of transitioning from a fragmented approach where individual teams built their own ML solutions to a unified, scalable platform capable of supporting hundreds of ML models in production. This case study is particularly noteworthy as it demonstrates the evolution of an ML platform through three distinct phases, each addressing different aspects of the MLOps lifecycle. ### Initial Context and Challenges In late 2021, Roblox was dealing with a fragmented ML infrastructure where different teams were building their own solutions for critical components like the avatar Marketplace, homepage, and search. This led to duplicated efforts in feature engineering and inference scaling, highlighting the need for a centralized platform. ### Phase One: Foundation Building The first phase focused on establishing core ML infrastructure components: * They adopted Kubeflow as their primary ML platform, providing essential capabilities for notebooks, pipelines, experimentation, and model serving * Developed 'roblox-ml', a Python library to simplify model deployment to production * Implemented a third-party feature store solution that grew to support over 900 features across 100+ feature services * Utilized KServe with Triton Inference Server for model serving, supporting multiple ML frameworks on both GPUs and CPUs * Established comprehensive testing protocols including offline experiments, shadow testing, and A/B testing * Implemented continuous monitoring for both operational metrics and model accuracy ### Phase Two: Scaling Infrastructure The second phase focused on optimization and scaling: * Expanded distributed training capabilities to handle billions of parameters across multiple nodes * Integrated Ray for batch inference, improving resource utilization and enabling better task parallelism * Moved CPU inference workloads to their data centers for better latency control and privacy * Developed a custom feature store built on Feast and Flink, processing 30 billion records daily with 50ms P99 latency * Implemented a vector database for efficient embedding storage and retrieval * Established a ground truth team to standardize dataset production and validation ### Phase Three: Advanced LLMOps The final phase focused on operationalizing massive-scale inference: * Built a unified ML gateway to centralize access to all large models across different environments * Adopted vLLM as their primary LLM inference engine, achieving 2x improvements in latency and throughput * Currently processing 4 billion tokens weekly through their LLM infrastructure * Implemented speculative decoding for improved inference performance * Added multimodal support to vLLM through open-source contributions ### Technical Infrastructure Details The platform now supports approximately 250 ML inference pipelines, utilizing: * Tens of thousands of CPUs and over a thousand GPUs across hybrid cloud infrastructure * Processing 1.5 billion tokens weekly for their AI Assistant * Handling 1 billion daily personalization requests for 79.5 million daily active users * Managing over 20 parallel A/B tests for personalization models ### Practical Applications The infrastructure supports various AI-powered features: * Recommendation and search systems for experience discovery * Matchmaking algorithms for server allocation * AI Assistant for script generation * Texture and Material Generator tools * Avatar Auto Setup (achieving 8% of UGC avatar body publications) * Real-time AI chat translation * Voice safety moderation ### Monitoring and Observability The platform includes robust monitoring capabilities: * Centralized throttling by token count for generative AI workloads * Latency-aware load balancing between regions * Comprehensive usage tracking and monitoring tools * Human moderator evaluation for reported inference disagreements ### Open Source Commitment Roblox has demonstrated commitment to the open-source community: * Released their voice safety classifier as open source * Plans to open source their ML gateway * Active contributions to vLLM, particularly in multimodal support * Built their stack using various open-source technologies ### Key Learning Points The case study highlights several important lessons for organizations building large-scale ML infrastructure: * The importance of progressive evolution in building ML infrastructure * Benefits of centralizing ML operations versus fragmented team-specific solutions * Value of hybrid approaches combining cloud and on-premises resources * Importance of building flexible infrastructure that can scale with growing demands * Benefits of contributing to and leveraging open-source technologies The case study demonstrates how a major platform can successfully scale its AI infrastructure while maintaining performance, efficiency, and reliability. It shows the importance of careful planning and phased implementation when building large-scale ML systems, as well as the value of combining various technologies and approaches to meet specific needs.

Start your new ML Project today with ZenML Pro

Join 1,000s of members already deploying models with ZenML.