Autodesk: Building a Scalable ML Platform with Metaflow for Distributed LLM Training

LLMOps Database

Tech

Autodesk

Company

Autodesk

Title

Building a Scalable ML Platform with Metaflow for Distributed LLM Training

Industry

Tech

Link

https://www.youtube.com/watch?v=qK-EPLkgOP0

Year

Summary (short)

Autodesk built a machine learning platform from scratch using Metaflow as the foundation for their managed training infrastructure. The platform enables data scientists to construct end-to-end ML pipelines, with particular focus on distributed training of large language models. They successfully integrated AWS services, implemented security measures, and created a user-friendly interface that supported both experimental and production workflows. The platform has been rolled out to 50 users and demonstrated successful fine-tuning of large language models, including a 6B parameter model in 50 minutes using 16 A10 GPUs.

This case study details Autodesk's journey in building their Autodesk Machine Learning Platform (AMP) from the ground up, with a particular focus on handling large language models and distributed training workflows. The presentation was given by Riley, a senior software engineer in the machine learning platform team at Autodesk. # Platform Overview and Architecture The platform was built with Metaflow as its core orchestration tool, chosen for its versatility in handling various applications including data compute orchestration and versioning. A key consideration was Autodesk's commitment to AWS infrastructure, and Metaflow's strong integration with AWS managed services made it an ideal choice. The architecture consists of several key components: * SageMaker Studio serves as the managed IDE (called AM Studio internally) * A custom UI for authentication and authorization * Metaflow for pipeline orchestration and experiment tracking * AWS Batch for compute orchestration * Step Functions for production workflow scheduling Security was a paramount concern, with every component undergoing thorough security hardening: * All Docker images are based on security-hardened in-house images * Regular patching pipelines scan for vulnerabilities * Infrastructure is deployed using security-hardened Terraform modules * The entire system underwent formal security review and approval # MLOps Workflow and CI/CD The platform implements a comprehensive MLOps workflow that covers both experimental and production scenarios. For production deployments, they implemented a GitOps pattern: * Users perform model evaluation and tag successful runs with "prod_ready" * Code is committed and PRs are created in Enterprise GitHub * Jenkins pipelines handle testing and validation * Lambda functions manage deployment to Step Functions * Code consistency is verified between Git and Metaflow artifacts * Production deployments are monitored with Slack alerts for failures To facilitate adoption, they created: * A comprehensive example repository showcasing various use cases * Cookie-cutter templates for standardized project structure * Standardized naming conventions for flows, projects, and tags # Distributed Training Capabilities The platform shows particular strength in handling distributed training for large language models: * Supports multiple distributed computing frameworks (Hugging Face Accelerate, PyTorch Lightning, DeepSpeed) * Utilizes AWS Batch multi-node parallel jobs * Implements GPU monitoring and profiling through custom decorators * Provides CloudWatch dashboards for resource utilization * Supports various instance types and spot instances for cost optimization A notable achievement was the integration of Ray with Metaflow: * Created a custom Ray parallel decorator * Successfully fine-tuned a 6B parameter model in 50 minutes using 16 A10 GPUs * Enabled seamless monitoring of Ray applications through the Metaflow UI Performance benchmarking showed impressive results: * Tested configurations with 2-4 nodes, each with 4 A10 GPUs * Implemented optimization techniques like activation checkpointing and CPU optimizer offloading * Demonstrated scaling efficiency with larger models (T5-3B and GPT-J 6B) # High-Performance Computing Integration To handle larger models and datasets, they implemented several HPC features: * Elastic Fabric Adapter (EFA) integration for improved inter-node communication * FSX for Lustre integration for high-performance file system access * Custom security-hardened Deep Learning AMIs with EFA support * Support for attaching multiple EFA devices to A100 GPUs # User Experience and Adoption The platform was designed with user experience as a primary concern: * Intuitive UI for notebook instance management * Direct integration of Metaflow UI within SageMaker Studio * Comprehensive documentation and examples * Support for multiple teams through AWS account isolation # Challenges and Solutions Several challenges were addressed during implementation: * Multi-tenancy handling through separate AWS accounts per team * Integration of existing Ray workflows with Metaflow * Security compliance while maintaining usability * Migration of users from existing bespoke ML platforms # Current Status and Future Plans The platform has been successfully rolled out to 50 users and continues to evolve: * Regular refinements based on user feedback * Ongoing improvements to the CI/CD pipeline * Exploration of additional HPC capabilities * Enhanced monitoring and observability features The case study demonstrates a thoughtful approach to building a production-grade ML platform that balances security, usability, and performance. The focus on distributed training capabilities and integration with modern MLOps tools makes it particularly relevant for organizations working with large language models and other compute-intensive ML workloads.

Start your new ML Project today with ZenML Pro

Join 1,000s of members already deploying models with ZenML.

Learn more

Try Free