Harness the Power of Databricks for Scalable ML Pipelines with ZenML

Seamlessly integrate ZenML with Databricks to leverage its distributed computing capabilities for efficient and scalable machine learning workflows. This integration enables data scientists and engineers to run their ZenML pipelines on Databricks, taking advantage of its optimized environment for big data processing and ML workloads.

Features with ZenML

Effortlessly orchestrate ZenML pipelines on Databricks infrastructure
Leverage Databricks' distributed computing power for large-scale ML tasks
Seamlessly integrate with other Databricks services and tools
Monitor and manage pipeline runs through the Databricks UI
Schedule pipelines using Databricks' native scheduling capabilities

‍

Main Features

Optimized for big data processing and machine learning workloads
Collaborative environment for data scientists, engineers, and analysts
Scalable and high-performance distributed computing
Integrated with popular data and ML frameworks (e.g., Spark, TensorFlow, PyTorch)
Comprehensive security and governance features

‍

How to use ZenML with

Databricks


from zenml.integrations.databricks.flavors.databricks_orchestrator_flavor import DatabricksOrchestratorSettings

databricks_settings = DatabricksOrchestratorSettings(
    spark_version="15.3.x-scala2.12",
    num_workers="3",
    node_type_id="Standard_D4s_v5",
    policy_id=POLICY_ID,
    autoscale=(2, 3),
)

@pipeline(
    settings={
        "orchestrator.databricks": databricks_settings,
    }
)
def my_pipeline():
    load_data()
    preprocess_data()
    train_model()
    evaluate_model()

my_pipeline().run()

This code example demonstrates how to configure the Databricks orchestrator settings in ZenML. The DatabricksOrchestratorSettings object is used to specify the Spark version, number of workers, node type, autoscaling settings, and other configuration options. These settings are then passed to the @pipeline decorator using the settings parameter. Finally, the pipeline is defined with its steps and executed using my_pipeline().run().

Additional Resources

GitHub: ZenML Databricks Integration Example

ZenML Databricks Orchestrator Documentation