Supercharge Open Source ML Workflows with ZenML And Skypilot

Whether you're an ML engineer focusing on infrastructure or a data scientist diving into model development, the combination of ZenML and SkyPilot offers a robust solution for managing ML workflows. This integration bridges the gap between rapid experimentation and scalable cloud execution.

Why ZenML + SkyPilot?

SkyPilot brings its own set of powerful capabilities to the world of MLOps/LLMOps. As an open-source orchestration framework, SkyPilot excels in cloud-agnostic workloads, allowing users to run AI jobs on any cloud with minimal code changes. It offers intelligent cloud selection based on cost and availability, automatic spot instance handling for cost savings, and efficient management of cloud storage. SkyPilot's ability to easily launch, scale, and manage cloud resources makes it an ideal complement to ZenML's MLOps functionalities.

ZenML is an open source MLOps framework, that also abstracts away infrastructure complexity for cloud-agnostic ML workloads, but has less of a focus on the actual orchestration itself. Rather, it focuses on observability, reproducibility, and emphasizes the production stage of ML development.

Therefore, both products have clear synergies. Here are the advantages of using both together:

Implementation Example

A good example to see the difference from good-old plain Skypilot, would be to take the quickstart training example, and see how it would work with ZenML.

pip install "zenml[server]" "zenml[skypilot]" torch transformers datasets
zenml integration install huggingface -y
# run zenml login on a deployed server
zenml login

Now, start writing your workflows. Here's a multi-step pipeline for fine-tuning a BERT model on the GLUE MRPC dataset:

from zenml import pipeline, step
from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments, BertForSequenceClassification
from typing import Tuple
import datasets

@step(enable_cache=False)
def load_data() -> datasets.DatasetDict:
    dataset = datasets.load_dataset("glue", "mrpc")
    return dataset

@step
def preprocess_data(dataset: datasets.DatasetDict) -> Tuple[datasets.Dataset, datasets.Dataset]:
    tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
    
    def tokenize_function(examples):
        return tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, padding="max_length")
    
    tokenized_datasets = dataset.map(tokenize_function, batched=True)
    return tokenized_datasets["train"], tokenized_datasets["validation"]

@step
def train_model(train_dataset: datasets.Dataset, eval_dataset: datasets.Dataset) -> BertForSequenceClassification:
    model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)
    
    training_args = TrainingArguments(
        output_dir="./results",
        num_train_epochs=3,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        warmup_steps=500,
        weight_decay=0.01,
        logging_dir='./logs',
    )
    
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset
    )
    
    trainer.train()
    return model

@step
def evaluate_model(model: BertForSequenceClassification, eval_dataset: datasets.Dataset) -> dict:
    trainer = Trainer(model=model)
    results = trainer.evaluate(eval_dataset)
    return results

@pipeline
def glue_fine_tuning_pipeline():
    dataset = load_data()
    train_dataset, eval_dataset = preprocess_data(dataset)
    model = train_model(train_dataset, eval_dataset)
    results = evaluate_model(model, eval_dataset)

if __name__ == "__main__":
    glue_fine_tuning_pipeline()

Running the Pipeline

glue_fine_tuning_pipeline()

from zenml.config import DockerSettings
from zenml.integrations.skypilot.flavors import SkypilotOrchestratorSettings

docker_settings = DockerSettings(
		required_integrations=["huggingface"],
    requirements=["torch", "transformers", "datasets"],
)
skypilot_settings = SkypilotOrchestratorSettings(
    instance_type="p3.2xlarge",
    use_spot=True,
    region="us-west-2"
)

glue_fine_tuning_pipeline.with_options(
    config_path="config.yaml",
    settings={
        "docker": docker_settings,
        "orchestrator.vm_aws": skypilot_settings
    }
)()

This demonstrates the ease of transitioning from local to cloud execution without altering the core pipeline logic. In both cases, this is how it will show up on the dashboard:

Notice how much more shared, collaborative, and observable this run is, vs. having it run ad-hoc. This is the power of having a shared MLOps framework.

MLOps Platform Perspective: Enhancing Team Productivity at Scale

Integrating ZenML with SkyPilot offers significant advantages for scaling ML operations across larger data science organizations:

Instead of fragmented tooling and ad-hoc scripts, ZenML provides a centralized interface that tracks experiments, models, and metrics. This comprehensive view enables data science leaders to make informed decisions about resource allocation and project priorities, while the underlying SkyPilot integration ensures efficient use of cloud resources.

Key Advantages

Conclusion

The ZenML + SkyPilot integration offers a powerful solution for ML teams, from individual contributors to large-scale data science operations. It combines the simplicity of ZenML's pipeline abstraction with the efficiency of SkyPilot's cloud orchestration. This approach maintains agility throughout the ML lifecycle while providing the structure necessary for scaling ML operations.

By abstracting infrastructure complexities, this integration allows data scientists and ML engineers to focus on model development and experimentation. Simultaneously, it gives MLOps teams the tools to standardize practices, optimize resources, and foster collaboration across the organization. The seamless transition between local and cloud environments, coupled with comprehensive versioning and tracking, makes this an invaluable asset for modern ML workflows in organizations of any size.

Try out ZenML with Skypilot today with the starter guide, and let us know on Slack how it went!