At ZenML, we're constantly striving to improve our infrastructure and processes. Recently, we embarked on an exciting journey to migrate our ZenML Pro tenant lifecycle management from Spacelift to a more robust and flexible solution using Kubernetes (ArgoCD) and native Terraform capabilities. This blog post details our motivation, the challenges we faced, and the benefits we've reaped from this migration.
Why We Made the Switch
Our decision to move away from Spacelift was driven by several factors:
1. Scalability Constraints: Spacelift's paid plan limited the number of parallel runners, which became a bottleneck as we grew.
2. Data Ownership: All metadata and states were stored on Spacelift's end, reducing our control over critical information.
3. Limited Control: We needed more low-level control over the lifecycle processes of our tenants.
4. Performance Issues: Long start times due to repeated provider downloads during initialization were slowing us down.
5. User Experience: New user onboarding was hampered by lengthy tenant provisioning times, often taking around 3 minutes.
What We're Deploying
Our new infrastructure setup deploys a variety of resources to support each ZenML Pro tenant. Here's a summary of the key components:
1. AWS IAM Roles and Policies
2. Kubernetes Resources:
- A dedicated namespace for each tenant.
- Role and RoleBinding for the ZenML server.
- Service accounts for the ZenML server and jobs.
3. Helm Release: Deploys the ZenML chart with tenant-specific configurations.
4. Authentication secrets
This infrastructure ensures each tenant has isolated, secure, and efficiently managed resources within our AWS and Kubernetes environment.
Embracing Kubernetes with ArgoCD, Terraform, and GitOps
By leveraging Kubernetes with ArgoCD, Terraform's native lifecycle management, and GitOps principles, we've gained:
Improved Scalability
Our move to Kubernetes significantly enhances scalability:
- Dynamically scale Argo CD runners for more concurrent deployments
- Automatically adjust Kubernetes nodes based on demand
- Optimize compute resources with auto-scaling pods
- Efficiently distribute network traffic across the cluster
GitOps-Driven Configuration Management
Our migration to ArgoCD has allowed us to fully embrace GitOps principles, significantly improving our configuration management and deployment processes:
- Version-Controlled Configurations: All infrastructure configurations are stored in our GitHub repository, providing a single source of truth for our entire system state.
- Automated Synchronization: ArgoCD Applications are configured to watch specific paths in our GitHub repository. When changes are pushed, ArgoCD automatically detects and applies these updates. We don't use this functionality by intent, to let the ZenML Pro decide on needed changes to the tenants, so the users are never interrupted with unexpected changes being applied.
- Push-Based Updates: ZenML Pro pushes updated configurations directly to GitHub, triggering the GitOps workflow.
- Reproducible Deployments: To ensure reproducibility and stability, our ArgoCD Applications are configured to point to exact commit SHAs rather than branches. This practice guarantees that we always know precisely which version of our configuration is deployed.
- Audit Trail: By using Git as the backend for our configurations, we maintain a comprehensive audit trail of all changes, including who made them and when.
- Easy Rollbacks: If issues arise, we can quickly roll back to a previous known-good state by reverting commits in our Git repository.
- Infrastructure as Code: Our GitOps approach extends the "Infrastructure as Code" paradigm, treating our entire infrastructure configuration as a versioned codebase.
- Increased Transparency: Team members can review proposed changes through pull requests before they're applied to the live environment.
This GitOps-driven approach has significantly enhanced our ability to manage and deploy ZenML Pro tenants efficiently and reliably. It provides us with a clear, auditable, and reproducible process for managing our infrastructure, aligning perfectly with our goals for scalability, control, and performance.
Faster Provisioning
Optimized processes slash tenant creation times:
- Reduced from 3 minutes to under 1 minute 🚀
- Parallel resource creation where possible
- Efficient use of Kubernetes' declarative model
- Streamlined Terraform execution
- Caching of common resources and configurations
Smoother Onboarding
New users can start in less than a minute:
- Faster access to ZenML Pro environment
- Reduced waiting time improves first impression
- Allows for quicker iterative testing and setup
- Enables rapid proof-of-concept deployments
- Increases overall user satisfaction and adoption rates
Enhanced Data Control
We now manage all states and metadata in-house:
- Full ownership of critical data
- Improved security and compliance
- Easier backups and disaster recovery
- Ability to perform advanced analytics on infrastructure data
- Seamless integration with our existing systems
Granular Process Control
Direct control over the entire tenant lifecycle:
- Customize provisioning steps for specific needs
- Implement fine-tuned security policies
- Easily add or modify lifecycle stages
- Rapid troubleshooting and issue resolution
- Flexibility to integrate with other tools and services
Optimizing Tenant Provisioning Time
In our quest for efficiency, we tackled one of the most time-consuming aspects of our deployment process: the spin-up time for Terraform pods. Initially, these pods took approximately 30-40 seconds to become operational, which was a significant bottleneck in our deployment pipeline. Through careful optimization, we managed to reduce this time to around 5 seconds, marking a substantial improvement in our overall deployment speed.
Key Optimizations
1. Image Puller DaemonSet
We implemented an image puller DaemonSet to ensure that the Terraform images are always "warm" and readily available on all nodes in our Kubernetes cluster. This DaemonSet:
- Preloads the Terraform image on every node
- Keeps the image updated to the latest version
- Eliminates the need to pull the image at pod creation time
By having the Terraform image pre-pulled and cached on each node, we significantly reduced the time required to start new Terraform pods.
2. Custom Terraform Image with Pre-initialized Providers
We created a specialized Terraform image that includes pre-initialized providers. This custom image:
- Runs `terraform init` during the image build process
- Pre-downloads all necessary providers and modules
- Caches these components within the image
This approach eliminates the need to download providers on Terraform initialization every time a new pod starts, which was previously a major contributor to the long spin-up times.
Results and Benefits
The combination of these optimizations yielded impressive results:
- Reduced Spin-up Time for Terraform Pods: From 30-40 seconds down to approximately 5 seconds
- Faster Deployments: Quicker pod initialization leads to faster overall deployments (from 3 minutes to less than a minute)
- Improved Resource Efficiency: Less time spent waiting for pods to become ready
- Enhanced User Experience: Faster response times for tenant-related operations
Looking Ahead
This migration significantly improves our ability to provide a robust, scalable, and efficient ZenML Pro experience. By embracing open-source tools and cloud-native technologies, we're not just solving today's challenges – we're building a foundation for the future of ZenML Pro.
We're excited about the possibilities this new setup brings and are constantly looking for ways to further improve our service. Stay tuned for more updates as we continue to evolve our infrastructure!