Software Engineering

Infrastructure as Code (IaC) for MLOps with Terraform & ZenML

ZenML Team
Jul 31, 2024
6 min

🧑‍💻 What is Infrastructure-as-code?

Infrastructure-as-code (IaC) refers to using a dynamic codebase to provision and manage infrastructure, rather than deferring to manual processes. Before IaC took over as a mainstay in the DevOps world, there was no one standard for automating the management of infrastructure resources, especially using the hyperscalers like AWS, GCP, and Azure.

Popular tools for IaC

By now, there are many tools in the IaC space, however, these are a few that remain the most widely used:

  • Terraform: Arguably, the tool that popularized IaC. It remains one of the standards in the space and has its configuration language called HCL.
  • Pulumi: A competitor to Terraform, with the key difference being that it can be used in any programming language, not just HCL.
  • AWS CloudFormation, Azure Resource Manager, and Google Cloud Deployment Manager: ****These are services offered by cloud providers, with their own syntax and methodologies. They offer the advantage of being native to the underlying cloud provider, but come with the disadvantage of your code usually then being constrained to only that cloud provider.

For this blog post, I will be focusing on Terraform as it is by far the most popular tool. However, ZenML does have a 1-click deployment feature in the browser that leverages native cloud services like AWS CloudFormation.

🥘 Infrastructure as Code in MLOps

In many ways, MLOps is an extension of DevOps, and therefore it is only natural that IaC plays as critical a role in MLOps, as it does in DevOps. Just like with any other computing paradigm, MLOps also requires users to provision and manage resources on the cloud. Therefore, ML practitioners can simply reuse the tried and tested practices of IaC in DevOps, and reap the same rewards. Specifically, for machine learning teams, adopting IaC can lead to the following benefits:

  • Consistent and reproducible rollouts of experiments, training, and deployments
  • A strong security standard for machine learning
  • Visibility into spending and potential for cost optimization
  • A link between ML and the rest of the organization's infrastructure deployments
  • Easy onboarding of new teams/projects

📦MLOps Stacks: Modularized configuration for your infrastructure

ZenML is a MLOps framework that acts as a bridge between machine learning teams and production infrastructure. The ZenML stack concept is the configuration of tools and infrastructure that your pipelines can run on.  A machine learning pipeline written in ZenML runs on the configuration defined in the stack.

A visualization of a machine learning pipeline loading production data on the left side, and a showcase of a ZenML stack on the right side. The stack consists of a Kubernetes orchestrator, an S3 artifact store, an MLflow experiment tracker, and a deepchecks data validator.

Out of the box, ZenML’s API assumes that the infrastructure is already provisioned and then expects to register the components in its database. This can be a bit clunky, as it involves enabling the right permissions, executing the correct commands, and ensuring things work as expected across environments. All in all, it can be cumbersome to try to automate this process!

However, if you look closely, there is a natural link between the infrastructure deployed via IaC and the ZenML stack.  What if there is a way to do the provisioning and registration back to ZenML in one go? This is where the Terraform modules for ZenML stacks come in.

🛫 Terraform Modules for ZenML Stacks

Recently, we published new modules to the Hashicorp registry for provisioning a MLOps stack on each of the popular cloud providers. These Terraform modules set up the necessary infrastructure for a ZenML stack and register the necessary configuration back to a ZenML server. This allows you to easily integrate MLOps into your existing cloud infrastructure without any reinventing of the wheel!

Here is how to do it:

🛠 Prerequisites

🏗 Resources Created

The Terraform module in this repository creates the following resources in your AWS account:

AWS GCP Azure
Resources provisioned
  1. an S3 bucket
  2. an ECR repository
  3. an IAM user and an access key for it
  4. an IAM role with the minimum necessary permissions to access the S3 bucket, the ECR repository and the SageMaker service to build and push container images, store artifacts, and run pipelines
  1. a GCS bucket
  2. a Google Artifact Registry
  3. a Service Account with a Service Account Key and the minimum necessary permissions to access the GCS bucket, the Google Artifact Registry and the GCP project to build and push container images with Google Cloud Build, store artifacts and run pipelines with Vertex AI.
  1. an Azure Resource Group with the following child resources:
    1. an Azure Storage Account and a Blob Container
    2. an Azure Container Registry
  2. an Azure Service Principal with a Service Principal Password and the minimum necessary permissions to access the Blob Container, the ACR container registry and the Azure subscription to build and push container images, store artifacts and run pipelines with Skypilot.
Stack components created
  1. an S3 Artifact Store linked to the S3 bucket
  2. an ECR Container Registry linked to the ECR repository
  3. a SageMaker Orchestrator linked to the AWS account
  4. an AWS Service Connector configured with the IAM role credentials and used to authenticate all ZenML components with the AWS account
  1. an GCP Artifact Store linked to the GCS bucket
  2. an GCP Container Registry linked to the Google Artifact Registry
  3. a Vertex AI Orchestrator linked to the GCP project
  4. a Google Cloud Build Image Builder linked to the GCP project
  5. a GCP Service Connector configured with the GCP service account credentials and used to authenticate all ZenML components with the GCP resources
  1. an Azure Artifact Store linked to the Azure Storage Account and Blob Container
  2. an ACR Container Registry linked to the Azure Container Registry
  3. an Azure Skypilot Orchestrator linked to the Azure subscription
  4. an Azure Service Connector configured with the Azure Service Principal credentials and used to authenticate all ZenML components with the Azure resources

🚀 How to use the Terraform modules

Aside from the prerequisites mentioned above, you also need to create a ZenML Service Account API key for your ZenML Server. You can do this by running the following command in a terminal where you have the ZenML CLI installed:

zenml service-account create <service-account-name>

After that, it’s a matter of copying this code snippet into a file. Here is an example configuration for AWS (GCP and Azure are similar):

module "zenml_stack" {
  source  = "zenml-io/zenml-stack/aws"
  region = "us-west-2"
  zenml_server_url = "https://your-zenml-server-url.com"
  zenml_api_key = "ZENKEY_1234567890..."
}

You can then execute:

terraform init
terraform apply

If you would like to destroy the resources and delete the stack, you can run:

terraform destroy

Wait a few minutes, and you will see a fully configured stack in your ZenML dashboard!

📢 Note that currently the terraform scripts only deploy a basic cloud stack with an orchestrator, container registry, image builder, and artifact store. To add more components, please read the docs

💨 Try it yourself

You can try it yourself by following the instructions above or following the guide in the Hashicorp registry for AWS (Source), GCP (Source), or Azure (Source). The corresponding open-source GitHub repositories are also available.

If you need assistance, join our Slack Community and drop a message in the #general channel. We’re known for our quick and helpful responses!

⭐️ Show Your Support

If you find this project helpful, please consider giving ZenML a star on GitHub. Your support helps promote the project and lets others know it's worth checking out.

Thank you for your support! 🌟

Looking to Get Ahead in MLOps & LLMOps?

Subscribe to the ZenML newsletter and receive regular product updates, tutorials, examples, and more articles like this one.
We care about your data in our privacy policy.