Ensure Data Quality and Consistency in Your ML Pipelines with Great Expectations and ZenML
Integrate Great Expectations with ZenML to seamlessly incorporate data profiling, testing, and documentation into your ML workflows. This powerful combination allows you to maintain high data quality standards, improve communication, and enhance observability throughout your ML pipeline.
Features with ZenML
- Seamless integration of Great Expectations data validation within ZenML pipelines
- Automated storage and versioning of Expectation Suites and Validation Results using ZenML's Artifact Store
- Easy visualization of Great Expectations artifacts directly in the ZenML dashboard or Jupyter notebooks
- Flexible deployment options for stores to leverage existing Great Expectations configurations or let ZenML manage the setup
Main Features
- Automated data profiling to generate validation rules (Expectations) based on dataset properties
- Comprehensive data quality checks using predefined or inferred Expectations
- Human-readable documentation of validation rules, quality checks, and results
- Support for various data formats and sources, with ZenML currently supporting pandas DataFrames
How to use ZenML with
Great Expectations
from zenml.integrations.great_expectations.steps.ge_validator import (
great_expectations_validator_step,
)
ge_validator_step = great_expectations_validator_step.with_options(
parameters={
"expectations_list": [
GreatExpectationExpectationConfig(
expectation_name="expect_column_values_to_be_between",
expectation_args={
"column": "X_Minimum",
"min_value": 0,
"max_value": 2000
},
),
],
"data_asset_name": "steel_plates_train_df",
}
)
@pipeline(enable_cache=False, settings={"docker": docker_settings})
def validation_pipeline():
imported_data = importer()
train, test = splitter(imported_data)
ge_validator_step(train)
validation_pipeline()
The code example demonstrates a simple ZenML pipeline that integrates Great Expectations for data validation. It starts by importing the great_expectations_validator_step step and defining a data importer step. We can specify our list of expectations using the GreatExpectationExpectationConfig class, where each expectation is defined through an expectation name and some expectation arguments like the column name. When you run the pipeline, the resulting artifacts are automatically stored and versioned using ZenML's Artifact Store. By default, the great validation stores for validation results and checkpoints are also configured to your active artifact store.
Additional Resources
ZenML Great Expectations Integration Docs
Great Expectations Documentation
ZenML Great Expectation SDK Docs