Software Engineering

Empowering ZenML Pro Infrastructure Management: Our Journey from Spacelift to ArgoCD

Andrei Vishniakov
Oct 11, 2024
4 mins

At ZenML, we're constantly striving to improve our infrastructure and processes. Recently, we embarked on an exciting journey to migrate our ZenML Pro tenant lifecycle management from Spacelift to a more robust and flexible solution using Kubernetes (ArgoCD) and native Terraform capabilities. This blog post details our motivation, the challenges we faced, and the benefits we've reaped from this migration.

Why We Made the Switch

Our decision to move away from Spacelift was driven by several factors:

1. Scalability Constraints: Spacelift's paid plan limited the number of parallel runners, which became a bottleneck as we grew.

2. Data Ownership: All metadata and states were stored on Spacelift's end, reducing our control over critical information.

3. Limited Control: We needed more low-level control over the lifecycle processes of our tenants.

4. Performance Issues: Long start times due to repeated provider downloads during initialization were slowing us down.

5. User Experience: New user onboarding was hampered by lengthy tenant provisioning times, often taking around 3 minutes.

What We're Deploying

Our new infrastructure setup deploys a variety of resources to support each ZenML Pro tenant. Here's a summary of the key components:

1. AWS IAM Roles and Policies

2. Kubernetes Resources:

  • A dedicated namespace for each tenant.
  • Role and RoleBinding for the ZenML server.
  • Service accounts for the ZenML server and jobs.

3. Helm Release: Deploys the ZenML chart with tenant-specific configurations.

4. Authentication secrets

This infrastructure ensures each tenant has isolated, secure, and efficiently managed resources within our AWS and Kubernetes environment.

Embracing Kubernetes with ArgoCD, Terraform, and GitOps

By leveraging Kubernetes with ArgoCD, Terraform's native lifecycle management, and GitOps principles, we've gained:

Improved Scalability

Our move to Kubernetes significantly enhances scalability:

  • Dynamically scale Argo CD runners for more concurrent deployments
  • Automatically adjust Kubernetes nodes based on demand
  • Optimize compute resources with auto-scaling pods
  • Efficiently distribute network traffic across the cluster

GitOps-Driven Configuration Management

Our migration to ArgoCD has allowed us to fully embrace GitOps principles, significantly improving our configuration management and deployment processes:

  • Version-Controlled Configurations: All infrastructure configurations are stored in our GitHub repository, providing a single source of truth for our entire system state.
  • Automated Synchronization: ArgoCD Applications are configured to watch specific paths in our GitHub repository. When changes are pushed, ArgoCD automatically detects and applies these updates. We don't use this functionality by intent, to let the ZenML Pro decide on needed changes to the tenants, so the users are never interrupted with unexpected changes being applied.
  • Push-Based Updates: ZenML Pro pushes updated configurations directly to GitHub, triggering the GitOps workflow.
  • Reproducible Deployments: To ensure reproducibility and stability, our ArgoCD Applications are configured to point to exact commit SHAs rather than branches. This practice guarantees that we always know precisely which version of our configuration is deployed.
  • Audit Trail: By using Git as the backend for our configurations, we maintain a comprehensive audit trail of all changes, including who made them and when.
  • Easy Rollbacks: If issues arise, we can quickly roll back to a previous known-good state by reverting commits in our Git repository.
  • Infrastructure as Code: Our GitOps approach extends the "Infrastructure as Code" paradigm, treating our entire infrastructure configuration as a versioned codebase.
  • Increased Transparency: Team members can review proposed changes through pull requests before they're applied to the live environment.

This GitOps-driven approach has significantly enhanced our ability to manage and deploy ZenML Pro tenants efficiently and reliably. It provides us with a clear, auditable, and reproducible process for managing our infrastructure, aligning perfectly with our goals for scalability, control, and performance.

Faster Provisioning

Optimized processes slash tenant creation times:

  • Reduced from 3 minutes to under 1 minute 🚀
  • Parallel resource creation where possible
  • Efficient use of Kubernetes' declarative model
  • Streamlined Terraform execution
  • Caching of common resources and configurations

Smoother Onboarding

New users can start in less than a minute:

  • Faster access to ZenML Pro environment
  • Reduced waiting time improves first impression
  • Allows for quicker iterative testing and setup
  • Enables rapid proof-of-concept deployments
  • Increases overall user satisfaction and adoption rates

Enhanced Data Control

We now manage all states and metadata in-house:

  • Full ownership of critical data
  • Improved security and compliance
  • Easier backups and disaster recovery
  • Ability to perform advanced analytics on infrastructure data
  • Seamless integration with our existing systems

Granular Process Control

Direct control over the entire tenant lifecycle:

  • Customize provisioning steps for specific needs
  • Implement fine-tuned security policies
  • Easily add or modify lifecycle stages
  • Rapid troubleshooting and issue resolution
  • Flexibility to integrate with other tools and services

Optimizing Tenant Provisioning Time

In our quest for efficiency, we tackled one of the most time-consuming aspects of our deployment process: the spin-up time for Terraform pods. Initially, these pods took approximately 30-40 seconds to become operational, which was a significant bottleneck in our deployment pipeline. Through careful optimization, we managed to reduce this time to around 5 seconds, marking a substantial improvement in our overall deployment speed.

Key Optimizations

1. Image Puller DaemonSet

  We implemented an image puller DaemonSet to ensure that the Terraform images are always "warm" and readily available on all nodes in our Kubernetes cluster. This DaemonSet:
 
  - Preloads the Terraform image on every node
  - Keeps the image updated to the latest version
  - Eliminates the need to pull the image at pod creation time

  By having the Terraform image pre-pulled and cached on each node, we significantly reduced the time required to start new Terraform pods.

2. Custom Terraform Image with Pre-initialized Providers

  We created a specialized Terraform image that includes pre-initialized providers. This custom image:
 
  - Runs `terraform init` during the image build process
  - Pre-downloads all necessary providers and modules
  - Caches these components within the image

  This approach eliminates the need to download providers on Terraform initialization every time a new pod starts, which was previously a major contributor to the long spin-up times.

Results and Benefits

The combination of these optimizations yielded impressive results:

- Reduced Spin-up Time for Terraform Pods: From 30-40 seconds down to approximately 5 seconds
- Faster Deployments: Quicker pod initialization leads to faster overall deployments (from 3 minutes to less than a minute)
- Improved Resource Efficiency: Less time spent waiting for pods to become ready
- Enhanced User Experience: Faster response times for tenant-related operations

Looking Ahead

This migration significantly improves our ability to provide a robust, scalable, and efficient ZenML Pro experience. By embracing open-source tools and cloud-native technologies, we're not just solving today's challenges – we're building a foundation for the future of ZenML Pro.

We're excited about the possibilities this new setup brings and are constantly looking for ways to further improve our service. Stay tuned for more updates as we continue to evolve our infrastructure!

Looking to Get Ahead in MLOps & LLMOps?

Subscribe to the ZenML newsletter and receive regular product updates, tutorials, examples, and more articles like this one.
We care about your data in our privacy policy.