AWS MLOps Made Easy: Integrating ZenML for Seamless Workflows

In today’s rapidly evolving tech landscape, Machine Learning Operations (MLOps) remains crucial, even with the rise of Large Language Models (LLMs). Following MLOps best-practices helps in tasks like fine-tuning models and ensuring reproducible machine learning workflows. This blog delves into the intricacies of implementing MLOps on AWS, leveraging services like SageMaker**,** ECR, S3, EC2, and EKS. We’ll explore how ZenML, an open-source MLOps framework, simplifies the integration and management of these services, enabling seamless transitions between different AWS components and enhancing productivity. Whether you’re a seasoned data scientist or new to MLOps, this guide will provide valuable insights and practical steps to optimize your ML workflows on AWS.

What is MLOps?

Similar to DevOps, and how it benefits software development, MLOps is a concept developed to benefit the development of ML systems. MLOps considers every stage of the ML lifecycle, from building, deploying, and serving to monitoring ML models, helping businesses get models to production faster and with higher success rates through the right platform, processes and people.

A diagram explaning what MLOps is — from Neptune AI

Executed well, these capabilities combine to manage the additional complexity introduced when managing ML systems and reduce model performance degradation. This decreases overhead and operating costs for the business, while enabling the use of advanced analytics, ML-powered decisioning, and unlocking new revenue streams.

Components of an MLOps pipeline

If you are new to the world of MLOps, it is often daunting to be immediately faced with a sea of tools that seemingly all promise and do the same things. It is useful in this case to try to categorize tools in various groups in order to understand their value in your toolchain in a more precise manner.

The most important tool categories are the following:

Orchestrator: runs your pipeline code.
Artifact Store: stores your artifacts like training data and models.
Container Registry: stores Docker images that you might build to containerize your ML pipeline.
Model Deployer: helps in the deployment of your trained models
Step Operator: allows you to run certain steps in your pipeline on specialized compute like GPUs.

and many more.

ZenML defines the concept of Stacks and Stack Components which represent these categories, each of which has a particular function in your MLOps pipeline.

ZenML has Stack Components to represent the tools used in an MLOps pipeline.

MLOps on AWS

AWS offers a comprehensive suite of managed services that can be directly utilized to build your MLOps solution without the need for manual deployment. These services cover various aspects of the MLOps pipeline, providing ready-to-use components. However, it's important to note that while AWS provides these individual services, integrating them into a cohesive MLOps workflow still requires careful planning and configuration on your part.

an example pipeline in AWS to train and deploy ML models. — *an example pipeline in AWS* *to train and deploy ML models.*

ECR (Elastic Container Registry)

ECR is AWS's fully-managed Docker container registry. In an MLOps pipeline, ECR can be used to store, manage, and deploy Docker images containing your pipeline code and associated dependencies. This ensures consistency across development, testing, and production environments. Read more about ECR on the official User Guide.

ECR is one of the options you can choose for the container registry stack component of your stack.

S3 (Simple Storage Service):

S3 is AWS's object storage service. In MLOps, S3 can serve as a central repository for storing training data, model artifacts, and other large datasets. It provides durability, scalability, and easy integration with other AWS services, making it an essential component for data management in ML workflows. Read more about S3 on the official User Guide.

S3 buckets fit into the artifact store stack component bracket.

EC2 (Elastic Compute Cloud):

EC2 offers resizable compute capacity in the cloud. In MLOps, EC2 instances can be used for various tasks such as data preprocessing, model training, and inference. Its flexibility allows you to choose the right instance type based on your computational needs, from CPU-optimized to GPU-accelerated instances. Read more about EC2 on the official User Guide.

An EC2 instance can serve as an orchestrator stack component, which is where you’d run your pipelines and its steps. You can also host other services on the VM but that would involve a fair bit of setup on your end.

EKS (Elastic Kubernetes Service):

EKS is a managed Kubernetes service that simplifies the deployment, management, and scaling of containerized applications. In MLOps, EKS can be used to orchestrate your pipelines, deployment of ML models, manage scaling based on demand, and ensure high availability of your ML services. A cluster can also be used to host many other services, like an image builder that you might want to use in your workflow. Read more about EKS on the official User Guide.

A cluster can be:

an orchestrator stack component when it is used to run your pipelines
a step operator when you want to only run specific steps in your pipelines on some cluster nodes (for example, running GPU-intensive steps on a GPU node) and the rest of the pipeline elsewhere, even locally.
a platform to host other stack components like the kaniko image builder, an MLflow experiment tracker or a model deployer.

Amazon SageMaker:

Amazon SageMaker is AWS's fully-managed machine learning platform. It covers the entire ML lifecycle, from data preparation and model training to deployment and monitoring. In MLOps, SageMaker can be used for collaborative notebook environments, automated model training, easy deployment, and continuous monitoring of model performance. Learn how you can implement MLOps on the official guide.

You can use Sagemaker as:

an orchestrator to run your pipelines
a step operator when you would only want to run specific steps on its compute and run the rest of the pipeline elsewhere, even locally.

MLOps on AWS: Real-life use-cases

There are many companies that use AWS extensively for their ML solutions. Here are a few that we liked:

Coca-Cola

Coca-Cola Andina leveraged AWS to enhance its data analytics and machine learning capabilities significantly. They built a data lake on Amazon S3 to store vast amounts of data from various sources, ensuring scalability and accessibility. AWS Lambda was used for serverless computing to process data in real-time, while Amazon RDS managed relational databases efficiently. For machine learning, Amazon SageMaker played a crucial role by enabling the development, training, and deployment of ML models.

The integration of these AWS services facilitated a robust MLOps pipeline, ensuring continuous improvement and deployment of machine learning models to meet business needs. But that said, ensuring seamless collaboration between data scientists, engineers, and business stakeholders was essential. They needed to establish robust workflows and communication channels to align their efforts and achieve the desired outcomes.

Cisco

Cisco leverages AWS for its MLOps by migrating its large language models (LLMs) to Amazon SageMaker and using NVIDIA Triton Inference Server. This migration allows Cisco to separate ML models from applications, which are hosted on Amazon EKS, enhancing efficiency and reducing costs. By using SageMaker endpoints, Cisco can scale models independently, improving development and deployment cycles. Additionally, Cisco employs asynchronous inference to save costs by scaling resources based on demand. This setup enables faster application startup, quicker experiments, and better resource management, ultimately optimizing inference costs and improving operational efficiency.

Fox

FOX Corporation leverages AWS to enhance its media and advertising insights by building a unified data solution. This solution processes up to 30 million data points per second. Key AWS services used include Amazon S3 for data storage, AWS Lambda for serverless computing, Amazon Data Firehose for real-time data streaming, and Amazon Rekognition for image and video analysis. FOX’s use cases include enriching media insights, providing real-time forecasts for NFL games, and developing ATLAS, an AI-driven solution for contextual ad opportunities.

How to decide between EC2, EKS, or Sagemaker pipelines

A GIF of cookie monster choosing between options

You must have noticed from the discussion above that you have multiple options for the orchestrator stack component. This section aims to help you choose from them, based on your needs and the strengths and weaknesses of the three services.

It's important to stress that as your ML project evolves, you may need to move between these services to accommodate changing requirements or to experiment with different approaches. This transition can come at a significant cost, including rewriting your MLOps pipelines, reconfiguring credentials and access permissions, and adapting to an entirely new interface.

EC2		EKS		SageMaker
Pros	Cons	Pros	Cons	Pros	Cons
Flexibility to customize ML environments with specific libraries and tools	Lack of built-in MLOps features, requiring manual implementation	Excellent for orchestrating distributed ML training and serving	Steeper learning curve, especially for those new to Kubernetes	Fully managed ML platform with integrated MLOps capabilities	Can be more expensive than managing your own infrastructure
Cost-effective for long-running, resource-intensive ML training jobs	Scaling ML workloads can be complex and time-consuming	Automatic scaling and load balancing	Can be overkill for small ML projects	Simplified model development, training, and deployment	Less flexibility compared to custom solutions on EC2 or EKS
Suitable for ML workflows requiring specialized hardware (e.g., specific GPUs)	More effort required for reproducibility and version control of ML environments	Enables consistent ML environments across development and production	Potential for higher costs due to cluster management overhead	Built-in support for popular ML frameworks and algorithms	Potential vendor lock-in to AWS ecosystem
	Security and updates need to be managed manually	Portable across different cloud providers or on-premises		Automatic scaling for training and inference	May have limitations for highly customized ML workflows
				Integrated monitoring and model performance tracking

** You can avoid vendor lock-ins if you use a flexible framework like ZenML that allows you to write your pipelines once and run anywhere.

To summarize,

EC2 is best suited for teams that require maximum flexibility in their ML environments or have specific hardware needs. It's ideal for organizations with the resources to manage infrastructure and implement custom MLOps solutions, particularly for resource-intensive or long-running ML training jobs.
EKS is optimal for teams embracing containerization and microservices architecture in their ML workflows. It excels in orchestrating distributed ML training and serving, making it suitable for complex, scalable ML pipelines. However, it requires expertise in Kubernetes and may introduce unnecessary complexity for smaller projects.
SageMaker is the go-to choice for teams seeking a fully managed ML platform with integrated MLOps features. It simplifies model development, training, and deployment, making it ideal for organizations wanting to rapidly implement ML pipelines without managing infrastructure. However, it may be less flexible for highly customized workflows and could lead to higher costs for large-scale operations.

Challenges in Adopting MLOps with AWS

Adoption within data science department
‍Some training might be needed to bring your data science department up-to-speed with how AWS works, what the different terms mean and how to manage the myriad of services it offers.

Credentials management
‍To allow data scientists to run pipelines on AWS, you would have to configure their machines with the right credentials and update them whenever new services are added or when permissions need to change (say when you are moving from one AWS account to another). This process eats up a significant chunk of your time that could have gone into ML development instead.

Hard to integrate the different services
‍Connecting multiple AWS services into a cohesive MLOps pipeline can be challenging. While writing your ML pipelines, you also have to design modules that deal with tasks like storing/retrieving artifacts from S3, building/pushing images to ECR among other things; tasks that shouldn’t need to be done by data scientists.

Hard to write pipelines
‍Writing pipeline code that adapts to your usage is challenging. For instance, if you want to make full use of your orchestration environments, you might have to add additional configurations like special instance types, memory and CPU requirements and more.

Pipelines are not portable
‍The pipelines you write for an EC2 instance won’t work directly on SageMaker and you will have to refactor the code to respect SageMaker’s SDK like defining a ProcessingStep and a Pipeline to begin with. In addition to being a tedious job in itself, it also requires you to study the SDK and learn what options would be best for your use case before you get to running the pipeline.

Cross-account, cross-team, cross-region collaboration
‍Collaborating with people across teams and accounts is challenging as it involves
- careful delegation of permissions to the right people. Different accounts will have their own set of IAM roles and policies and need to be configured separately on the user machine.
- sharing of information about the services being used. Questions like what bucket do I use for the model artifacts, what configuration should be set for the SageMaker instances, or what nodes to use in the EKS cluster need to be answered through a central control plane or something similar.

It is expensive and not very flexible
‍Running workloads on the cloud comes at a cost. If you have pipelines that only have some steps that might need specialized compute, you will want to run all other steps locally to save costs. Building such a solution would bring more complexity to an already complex pipeline and it would also mean that all user machines that want to run pipelines locally need to be configured with the right permissions and the right dependencies.

Real-life examples

In the real-life use cases section above, we noticed that although teams worldwide have made great progress by implementing AWS on MLOps, they also had to overcome many barriers similar to the ones we’ve discussed in this section.

A GIF of a guy walking into a room on fire, holding a pizza

Some concrete examples from the internet worth looking at:

“It is so hard to understand how to do something from their docs. Their blogs are good and are the only thing helping me implement at least something. “ from this post, signifying how it can be confusing at times to adopt new tools and switch your stacks.
“The company migrated its two pipelines over to Amazon SageMaker within 8 months.” from this study on a company, EagleView, from AWS Customer Stories, reiterating how most pipelines you write are not portable and you need to learn new tooling in order to migrate to another tool.
“SageMaker hides a lot of stuff from you in an attempt to be ‘easy’, but can end up being a pain to figure out if anything goes wrong.” and more from this post on Reddit.
“Bloody expensive… Hobby-scale instances are $1000/mo. Which, is crazy” from this post telling how it is expensive to experiment with your projects on managed services from AWS, underlining the need for reliable local testing.
"Mistakes that eat AWS credits aren’t able to be granted concessions (read as: refunds), so my $260 in credits are now forever lost to me." from this blog talking about complex pricing for resources that you manage yourself, on AWS.

Step-by-step Tutorial: Using an open-source framework to optimize your MLOps on AWS

What is ZenML?

ZenML is an open-source MLOps framework designed for creating portable, production-ready machine learning pipelines. It decouples infrastructure from code, enabling seamless collaboration among developers and supports various cloud providers and orchestration tools.

Key features include:

Infrastructure Agnostic: Run pipelines on AWS, GCP, Azure, Kubernetes, and more without changing code.
Automatic Logging: Track code, data, and model metadata automatically.
Version Control: Ensure reproducibility with built-in version control for ML workflows.
Flexibility: Switch between different backends and adapt your stack as needs evolve.
Security: SOC2 compliant, ensuring data security and confidentiality.
Seamless Transition: Move from local to cloud and from Jupyter notebooks to production pipelines effortlessly.
Rapid Iteration: Smart caching accelerates iterations, allowing for quick experimentation with ML and GenAI models.
Scalability: Easily scale to major clouds or Kubernetes, supporting both small models and large language models.

ZenML simplifies the MLOps process, allowing data scientists and ML engineers to focus on innovation rather than infrastructure management.

How ZenML uses various AWS services

Let’s see how ZenML can help alleviate most of the challenges that we have listed above.

Service Integration

ZenML brings together all of the AWS services we have discussed so far, as stack components that you can compose into ZenML Stacks. You can build a stack by choosing at least these three stack components, among many others: an orchestrator (EC2, EKS or Sagemaker), an artifact store (S3) and a container registry (ECR).

The graphic below shows how the pipeline code is separate from the stack that it runs on, making it very simple to switch stacks, and consequently services, without having to change your code at all. For example, you may realize at some point in your MLOps journey that running workloads on SageMaker might work better than running them on EC2. Switching this environment is literally a single command away!

ZenML integrates all AWS services that you need for MLOps into a ZenML Stack, that you can use in a pipeline.

Managing Credentials

Here’s a situation: you have developed a great MLOps pipeline that can train your model by reading data from an S3 bucket, loading the model, training it and storing it in the bucket. Now, you want others in your team to also be able to make changes to this pipeline and run it themselves. A common impediment to speedy execution here, is the local setup of credentials. In order to run this pipeline successfully, all team members would need to setup their aws config locally with the right role assumptions and secret keys. In addition to being very time-consuming, this also exposes your secrets to a greater attack surface.

Using your cloud infrastructure with ZenML as a Stack

With ZenML, you can leave credential management to your ZenML Server. ZenML has a concept of Service Connectors which are a powerful way to store and manage credentials centrally. Users wanting to run a pipeline on some stack don’t need any credentials locally or even any additional libraries installed. ZenML knows how to connect to any stack components they might need, for example, the container registry through the service connectors you configure once on your server.

Use Case: Managing different AWS accounts using ZenML

The number of AWS accounts you use can be from anywhere from 1 to 100s, depending on your use case. As such, the following becomes important

Distributing access of the right accounts to the right people:If managing one set of credentials is not hard enough, you need to set up the local environment for all of your team members to include the various accounts and the corresponding roles that you want them to use.
Knowing which resources to use for a specific project or environment:
Once the setup is done, users need to be aware of what AWS profiles to use when running certain pipelines. For example, you want to make sure that staging pipelines only run on the development account and don’t use resources from the production account.

Doing this work manually leaves room for errors, both during setup and also each time someone runs a pipeline. ZenML takes care of this intrinsically, through the use of ZenML Stacks, configured with the right service connectors.

Here’s how the workflow for using multiple accounts with ZenML would look like:

You create a role in the development account that ZenML would use to talk to your services. This role might have access to S3, ECR, EKS and any other services you need.
Then, you register a service connector, let’s call it the aws_dev_connector using this IAM role as an authentication method.
You can now register your stack components on the ZenML Server using this connector. This ensures that these stack component objects are now firmly linked to your development account.
Any stack that you create with these components would now use resources only in the development account and any user that wants to use them doesn’t need to know the details of the IAM role that you configured in the first step.
You can repeat the process for the production account.

Using multiple AWS accounts for MLOps with ZenML

You can also restrict access to stacks based on the users. For example, some users might only need access to the stacks that utilize the development account. You can accomplish this using ZenML’s RBAC features. Note that this is only available on ZenML Pro.

<aside>📝 Note that this example uses the IAM role method while talking about service connectors. You can choose any other method too, including STS tokens, or a secret key directly. Learn more here.

</aside>

Connect your AWS to ZenML in one-click

ZenML makes deploying your first cloud stack as easy as pressing a few buttons. This feature lets you deploy all the necessary pieces of infrastructure (like the Sagemaker, S3 and ECR) and sets up your remote stack that you can instantly use with your pipelines. You don’t have to worry about what permissions to set, what rules to configure; ZenML does it all in the background for you!

Steps to get your deployed AWS stack

You would first need to have a deployed ZenML Server. If you don’t have one, follow the guide here.
Go to “Stacks” and click “New Stack” and you’re greeted with the following screen that lets you choose between provisioning new components or reusing your existing resources (Scan your Cloud is one of my favorite features as it gives you an overview of what resources already exist in your account and can be used as ZenML stack components)

ZenML helps you set up infrastructure to run your ML pipelines

If you choose the in-browser experience, you only have to provide a Location and a name for your stack. You can also review what resources are created and what service connector permissions are used before proceeding. You also get a cost estimate based on AWS pricing for the components being provisioned.

ZenML Stack Infrastructure spin up page that shows components and how much they would cost.

This would ultimately take you to a CloudFormation page on your AWS account where you can review the script and then deploy your resources.

Running Workloads on AWS

Now that you understand, at least in writing, how ZenML fast-tracks your MLOps adoption on AWS, it’s time we run some pipelines! I will use an LLM finetuning example, that is available on our zenml-projects repo, to demonstrate account switching and quick development with ease.Here’s what we will do:

We first run this project on an EC2 instance on our development AWS account. This uses the skypilot-aws-dev stack, where all components are configured with a development service connector providing access to the dev account.

Looking at a stack description in the ZenML Dashboard

We then run the same example on our production SageMaker, without having to make any changes to our existing code. All you need to do is switch your ZenML stack: zenml stack set sagemaker-prod.

<aside>📝 Note that you might have to change some orchestrator settings like what region or accelerator to use, since those are service-specific. However, this doesn’t require a change in code; just the configuration changes.

</aside>

‍

What is a pipeline?

A pipeline in ZenML consists of a series of steps, organized in any order that makes sense for your use case. You can write your pipeline and step code using Python decorators over your functions and this doesn’t require learning any additional syntax.

# This is how a simple pipeline might look like.
@pipeline
def my_pipeline():
    output_step_one = step_1()
    step_2(input_one="hello", input_two=output_step_one)

Executing the Pipeline is as easy as calling the function that you decorated with the @pipeline decorator.

if __name__ == "__main__":
    my_pipeline()

Running our LLM finetuning pipeline

Follow these steps to run the llm-lora-finetuning project:

Clone the zenml-projects repo and go into the llm-lora-finetuning directory.
Connect to your ZenML server. You need a remote deployment of ZenML to be able to use remote stack components. If you need one quickly, you can spin up a ZenML Pro tenant for free: https://www.zenml.io/pro
Set the stack to your preferred stack. In our case, we choose the skypilot stack on our dev AWS account.

zenml stack set skypilot-aws-dev

We also edit the config file to include some skypilot specific settings. You can create a new one based on the ones that exist or you can also modify the existing orchestrator_finetune.yaml file. Include the following for SkyPilot, details of which you can find in our SkyPIlot orchestrator docs.

settings:
  docker:
    ...
  orchestrator.vm_aws:
    cpus: "2"
    memory: "16"
    accelerators: "V100:2"
    accelerator_args: {"tpu_vm": "true", "runtime_version": "tpu-vm-base"}
    use_spot: true
    region: eu-central-1
    zone: eu-central-1a
...

You are now all set. Run the following command to execute your pipeline. The run.py file takes your command line arguments and then calls the pipeline function with the right config YAML file.

python run.py --config orchestrator_finetune.yaml

You will now see logs that indicate that your pipeline has started. ZenML does the job of building a Docker image for your code, including all dependencies and then submits a job to the right SkyPilot cluster based on your settings.

Observing the run on the ZenML Dashboard

Once a run has started, you can head to your ZenML Dashboard to now track it visually.

There’s a directed acyclic graph (DAG) of the pipeline that ZenML constructs based on the order of execution of steps.

The pipeline run view on the ZenML Dashboard showing the DAG and relevant details.

You can click on any step to learn more about it. The step code, logs, metadata all show up in a side panel.

Step description on the ZenML Dashboard showing details like the start and end times.

You can also click any output of a step to know information like what artifact store path it is stored at and other metadata about it.

The artifact view on the pipeline run page that shows you what step produced it and when, among other things.

Model Control Plane

This is a really powerful feature that lets you bring together different pipelines, artifacts and models that pertain to a single project. You can find all the versions that your pipeline executions produce and can promote/demote them to production/staging based on the information that you see.

You can also log additional metadata essential to your application like metrics to a model version for easy comparisons.

You can log metadata like evaluation metrics on for each Model version and it shows up on the Model Control Plane.

Conclusion

In conclusion, while AWS offers a robust suite of services for MLOps, the complexity and integration challenges can be daunting. ZenML simplifies this process by providing a seamless, infrastructure-agnostic framework that integrates effortlessly with AWS services like SageMaker, ECR, S3, EC2, and EKS.

By decoupling infrastructure from code, ZenML reduces the overhead of managing credentials, configuring services, and writing adaptable pipelines. This not only accelerates the development and deployment of ML models but also ensures reproducibility and scalability. For teams looking to leverage AWS for their MLOps needs, ZenML offers a streamlined, efficient, and flexible solution, making it the ideal choice for optimizing ML workflows on AWS.