Adept.ai, building an AI model for computer interaction, faced challenges with complex fine-tuning pipelines running on Slurm. They implemented a migration strategy to Kubernetes using Metaflow and Argo for workflow orchestration, while maintaining existing Slurm workloads through a hybrid approach. This allowed them to improve pipeline management, enable self-service capabilities for data scientists, and establish robust monitoring infrastructure, though complete migration to Kubernetes remains a work in progress.
Adept.ai is developing an innovative AI model designed to interact with and operate computer interfaces, as demonstrated through their browser extension that can interpret natural language instructions and perform complex tasks across different websites. This case study explores their journey in modernizing their LLM operations infrastructure, particularly focusing on the transition from Slurm to Kubernetes for managing fine-tuning pipelines and associated workflows.
**Initial Infrastructure and Challenges**
The company initially relied on Slurm, a well-established cluster management and job scheduling system commonly used in HPC environments. While Slurm proved effective for their initial needs, particularly in handling distributed training workloads across multiple nodes with GPUs, several challenges emerged as the company grew:
* The fine-tuning pipeline codebase became increasingly complex with multiple entry points, various config files, and different processes for training, fine-tuning, and evaluation
* New team members struggled to navigate and modify the existing system
* Developers and data scientists desired a more modern, flexible alternative with better documentation and community support
**Migration Strategy**
Rather than pursuing an immediate complete migration, Adept.ai adopted a pragmatic hybrid approach:
* Selected Metaflow with Argo as their workflow orchestration solution on Kubernetes
* Maintained existing Slurm workloads while gradually transitioning to containerized operations
* Implemented an interim solution where Metaflow workflows would SSH into Slurm nodes to execute training commands
The migration process encountered several significant challenges:
1. **Code Complexity and Configuration Management**
* Required multiple sprints to restructure code into logical workflow steps
* Needed to establish consistent config loading mechanisms across containerized environments
* Used Python path environment variables to ensure proper code discovery
2. **Legacy System Support**
* Maintained support for existing Slurm-based workflows while building new infrastructure
* Created a bridge solution where Metaflow steps would SSH into Slurm nodes
* Focused on maintaining productivity while gradually moving toward fully containerized workloads
3. **Containerization Challenges**
* Dealt with large repository sizes (approaching 1GB) due to Git LFS usage and model artifacts
* Implemented Git-ops practices with automatic container builds for each commit
* Created custom container building solutions to handle complex dependency requirements
**Workflow Implementation**
The team successfully implemented several types of workflows:
* **Fine-tuning and Evaluation Pipelines**
* Created structured DAGs for fine-tuning processes
* Implemented comprehensive evaluation across different action types
* Built-in observability with Slack notifications and dashboard integration
* **Automated Scheduling**
* Set up cron jobs for recurring tasks
* Implemented nightly training jobs handling large-scale operations (up to 32 nodes/240 GPUs)
* Enabled power users to programmatically trigger new workflows
* **CI/CD Integration**
* Automated workflow updates on main branch merges
* Implemented container building and deployment pipelines
* Created systems for automatic nightly workflow updates
* **Infrastructure Monitoring**
* Developed workflows for tracking Slurm job status
* Implemented detection systems for long-running and zombie workflows
* Created automated housekeeping tasks for workflow cleanup
**Results and Current Status**
The migration has delivered several key benefits:
* Enabled self-service capabilities for data scientists to launch fine-tuning jobs
* Provided robust workflow management through Argo on Kubernetes
* Established automated monitoring and maintenance systems
However, some challenges remain:
* Complete migration to Kubernetes is still in progress
* Container size management continues to be a concern
* Some desired features are still missing, such as:
* Step-level exception handling
* Queue management for resource-constrained scenarios
* More sophisticated inter-workflow communication
**Technical Infrastructure Details**
The solution leverages several key technologies:
* Metaflow for workflow definition and management
* Argo for Kubernetes-based workflow orchestration
* Custom CLI tools for simplified job launching
* Integration with monitoring and observability tools
* Circle CI for continuous integration and deployment
**Lessons Learned**
This case study highlights several important lessons for organizations undertaking similar migrations:
* The value of gradual migration strategies that maintain existing capabilities
* The importance of building bridges between old and new infrastructure
* The need to balance ideal architectural solutions with practical constraints
* The benefits of investing in automation and self-service capabilities
While the migration is still ongoing, Adept.ai's approach demonstrates how organizations can modernize their LLM operations infrastructure while maintaining productivity and gradually moving toward a more containerized, Kubernetes-based future.
Start your new ML Project today with ZenML Pro
Join 1,000s of members already deploying models with ZenML.