Software Engineering

Supercharge Open Source ML Workflows with ZenML And Skypilot

Hamza Tahir
Aug 30, 2024
5 mins

Whether you're an ML engineer focusing on infrastructure or a data scientist diving into model development, the combination of ZenML and SkyPilot offers a robust solution for managing ML workflows. This integration bridges the gap between rapid experimentation and scalable cloud execution.

Best part? Both tools are free and open source!

Why ZenML + SkyPilot?

SkyPilot brings its own set of powerful capabilities to the world of MLOps/LLMOps. As an open-source orchestration framework, SkyPilot excels in cloud-agnostic workloads, allowing users to run AI jobs on any cloud with minimal code changes. It offers intelligent cloud selection based on cost and availability, automatic spot instance handling for cost savings, and efficient management of cloud storage. SkyPilot's ability to easily launch, scale, and manage cloud resources makes it an ideal complement to ZenML's MLOps functionalities.

ZenML is an open source MLOps framework, that also abstracts away infrastructure complexity for cloud-agnostic ML workloads, but has less of a focus on the actual orchestration itself. Rather, it focuses on observability, reproducibility, and emphasizes the production stage of ML development. 

Therefore, both products have clear synergies. Here are the advantages of using both together:

  1. Python-Centric Workflows: Define pipelines in Python, even within notebooks, instead of YAML.
  2. Abstracted Orchestration: Hide infrastructure details, focusing on ML logic.
  3. Flexible Execution: Switch between local and cloud runs with minimal changes.
  4. Comprehensive Tracking: Automatically version code, metadata, and data.
  5. Automated Containerization: Simplify dependency management and reproducibility.

Implementation Example

A good example to see the difference from good-old plain Skypilot, would be to take the quickstart training example, and see how it would work with ZenML.

First, install the required packages:

pip install "zenml[server]" "zenml[skypilot]" torch transformers datasets
zenml integration install huggingface -y
# run zenml login on a deployed server
zenml login

Now, start writing your workflows. Here's a multi-step pipeline for fine-tuning a BERT model on the GLUE MRPC dataset:

from zenml import pipeline, step
from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments, BertForSequenceClassification
from typing import Tuple
import datasets

@step(enable_cache=False)
def load_data() -> datasets.DatasetDict:
    dataset = datasets.load_dataset("glue", "mrpc")
    return dataset

@step
def preprocess_data(dataset: datasets.DatasetDict) -> Tuple[datasets.Dataset, datasets.Dataset]:
    tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
    
    def tokenize_function(examples):
        return tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, padding="max_length")
    
    tokenized_datasets = dataset.map(tokenize_function, batched=True)
    return tokenized_datasets["train"], tokenized_datasets["validation"]

@step
def train_model(train_dataset: datasets.Dataset, eval_dataset: datasets.Dataset) -> BertForSequenceClassification:
    model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)
    
    training_args = TrainingArguments(
        output_dir="./results",
        num_train_epochs=3,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        warmup_steps=500,
        weight_decay=0.01,
        logging_dir='./logs',
    )
    
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset
    )
    
    trainer.train()
    return model

@step
def evaluate_model(model: BertForSequenceClassification, eval_dataset: datasets.Dataset) -> dict:
    trainer = Trainer(model=model)
    results = trainer.evaluate(eval_dataset)
    return results

@pipeline
def glue_fine_tuning_pipeline():
    dataset = load_data()
    train_dataset, eval_dataset = preprocess_data(dataset)
    model = train_model(train_dataset, eval_dataset)
    results = evaluate_model(model, eval_dataset)

if __name__ == "__main__":
    glue_fine_tuning_pipeline()

Running the Pipeline

1. Local Execution:

glue_fine_tuning_pipeline()

2. SkyPilot Execution:

from zenml.config import DockerSettings
from zenml.integrations.skypilot.flavors import SkypilotOrchestratorSettings

docker_settings = DockerSettings(
		required_integrations=["huggingface"],
    requirements=["torch", "transformers", "datasets"],
)
skypilot_settings = SkypilotOrchestratorSettings(
    instance_type="p3.2xlarge",
    use_spot=True,
    region="us-west-2"
)

glue_fine_tuning_pipeline.with_options(
    config_path="config.yaml",
    settings={
        "docker": docker_settings,
        "orchestrator.vm_aws": skypilot_settings
    }
)()

This demonstrates the ease of transitioning from local to cloud execution without altering the core pipeline logic. In both cases, this is how it will show up on the dashboard:

ML pipeline diagram with code for training model step
ML pipeline diagram with details of preprocess_data output
ML pipeline diagram with run details and overview

Notice how much more shared, collaborative, and observable this run is, vs. having it run ad-hoc. This is the power of having a shared MLOps framework.

MLOps Platform Perspective: Enhancing Team Productivity at Scale

Integrating ZenML with SkyPilot offers significant advantages for scaling ML operations across larger data science organizations:

  1. Resource Optimization: Centralized tracking of cloud resource usage across all ML projects enables better allocation and cost management.
  2. Standardization: Enforce consistent workflows and best practices across diverse teams and projects.
  3. Collaboration: Improved visibility into model development processes and results fosters knowledge sharing and reduces redundant work.
  4. Unified Interface: A single platform for managing ML experiments, models, and deployments streamlines operations.
  5. Scalability: Seamlessly transition from experimentation to production-scale workflows without changing tools.

Instead of fragmented tooling and ad-hoc scripts, ZenML provides a centralized interface that tracks experiments, models, and metrics. This comprehensive view enables data science leaders to make informed decisions about resource allocation and project priorities, while the underlying SkyPilot integration ensures efficient use of cloud resources.

Key Advantages

  1. Clear Separation of Concerns: Isolated steps improve maintainability and reusability.
  2. Flexible Resource Configuration: Adjust cloud resources via simple ZenML settings.
  3. Version Control: Automatic tracking of data, code, and model versions.
  4. Cost Optimization: Leverage SkyPilot's spot instance and multi-region pricing features.
  5. Reproducibility: Containerized environments ensure consistent execution across different environments.

Conclusion

The ZenML + SkyPilot integration offers a powerful solution for ML teams, from individual contributors to large-scale data science operations. It combines the simplicity of ZenML's pipeline abstraction with the efficiency of SkyPilot's cloud orchestration. This approach maintains agility throughout the ML lifecycle while providing the structure necessary for scaling ML operations.

By abstracting infrastructure complexities, this integration allows data scientists and ML engineers to focus on model development and experimentation. Simultaneously, it gives MLOps teams the tools to standardize practices, optimize resources, and foster collaboration across the organization. The seamless transition between local and cloud environments, coupled with comprehensive versioning and tracking, makes this an invaluable asset for modern ML workflows in organizations of any size.

Try out ZenML with Skypilot today with the starter guide, and let us know on Slack how it went!

Looking to Get Ahead in MLOps & LLMOps?

Subscribe to the ZenML newsletter and receive regular product updates, tutorials, examples, and more articles like this one.
We care about your data in our privacy policy.