Files
MachineLearningNotebooks/how-to-use-azureml/work-with-data/datadrift-tutorial/datadrift-tutorial.ipynb

466 lines
17 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Copyright (c) Microsoft Corporation. All rights reserved.\n",
"\n",
"Licensed under the MIT License."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/work-with-data/datadrift-tutorial/datadrift-quickdemo.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Analyze data drift in Azure Machine Learning datasets \n",
"\n",
"In this tutorial, you will setup a data drift monitor on a weather dataset to:\n",
"\n",
"☑ Analyze historical data for drift\n",
"\n",
"☑ Setup a monitor to recieve email alerts if data drift is detected going forward\n",
"\n",
"If your workspace is Enterprise level, view and exlpore the results in the Azure Machine Learning studio. The video below shows the results from this tutorial. \n",
"\n",
"![gif](media/video.gif)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Prerequisites\n",
"If you are using an Azure Machine Learning Compute instance, you are all set. Otherwise, go through the [configuration notebook](../../../configuration.ipynb) if you haven't already established your connection to the AzureML Workspace."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Check core SDK version number\n",
"import azureml.core\n",
"\n",
"print('SDK version:', azureml.core.VERSION)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Initialize Workspace\n",
"\n",
"Initialize a workspace object from persisted configuration."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core import Workspace\n",
"\n",
"ws = Workspace.from_config()\n",
"ws"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup target and baseline datasets\n",
"\n",
"Setup the baseline and target datasets. The baseline will be used to compare each time slice of the target dataset, which is sampled by a given frequency. For further details, see [our documentation](http://aka.ms/datadrift). \n",
"\n",
"The next few cells will:\n",
" * get the default datastore\n",
" * upload the `weather-data` to the datastore\n",
" * create the Tabular dataset from the data\n",
" * add the timeseries trait by specifying the timestamp column `datetime`\n",
" * register the dataset\n",
" * create the baseline as a time slice of the target dataset\n",
" * optionally, register the baseline dataset\n",
" \n",
"The folder `weather-data` contains weather data from the [NOAA Integrated Surface Data](https://azure.microsoft.com/services/open-datasets/catalog/noaa-integrated-surface-data/) filtered down to to station names containing the string 'FLORIDA' to reduce the size of data. See `get_data.py` to see how this data is curated and modify as desired. This script may take a long time to run, hence the data is provided in the `weather-data` folder for this demo."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# use default datastore\n",
"dstore = ws.get_default_datastore()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# upload weather data\n",
"dstore.upload('weather-data', 'datadrift-data', overwrite=True, show_progress=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# import Dataset class\n",
"from azureml.core import Dataset\n",
"\n",
"# create target dataset \n",
"target = Dataset.Tabular.from_parquet_files(dstore.path('datadrift-data/**/data.parquet'))\n",
"# set the timestamp column\n",
"target = target.with_timestamp_columns('datetime')\n",
"# register the target dataset\n",
"target = target.register(ws, 'target')\n",
"# retrieve the dataset from the workspace by name\n",
"target = Dataset.get_by_name(ws, 'target')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# import datetime \n",
"from datetime import datetime\n",
"\n",
"# set baseline dataset as January 2019 weather data\n",
"baseline = Dataset.Tabular.from_parquet_files(dstore.path('datadrift-data/2019/01/data.parquet'))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# optionally, register the baseline dataset. if skipped, an unregistered dataset will be used\n",
"#baseline = baseline.register(ws, 'baseline')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create compute target\n",
"\n",
"> Note that if you have an AzureML Data Scientist role, you will not have permission to create compute resources. Talk to your workspace or IT admin to create the compute targets described in this section, if they do not already exist.\n",
"\n",
"Create an Azure Machine Learning compute cluster to run the data drift monitor and associated runs. The below cell will create a compute cluster named `'cpu-cluster'`. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.compute import AmlCompute, ComputeTarget\n",
"\n",
"compute_name = 'cpu-cluster'\n",
"\n",
"if compute_name in ws.compute_targets:\n",
" compute_target = ws.compute_targets[compute_name]\n",
" if compute_target and type(compute_target) is AmlCompute:\n",
" print('found compute target. just use it. ' + compute_name)\n",
"else:\n",
" print('creating a new compute target...')\n",
" provisioning_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D3_V2', min_nodes=0, max_nodes=2)\n",
"\n",
" # create the cluster\n",
" compute_target = ComputeTarget.create(ws, compute_name, provisioning_config)\n",
"\n",
" # can poll for a minimum number of nodes and for a specific timeout.\n",
" # if no min node count is provided it will use the scale settings for the cluster\n",
" compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)\n",
"\n",
" # For a more detailed view of current AmlCompute status, use get_status()\n",
" print(compute_target.get_status().serialize())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create data drift monitor\n",
"\n",
"See [our documentation](http://aka.ms/datadrift) for a complete description for all of the parameters. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"datadrift-remarks-sample"
]
},
"outputs": [],
"source": [
"from azureml.datadrift import DataDriftDetector, AlertConfiguration\n",
"\n",
"alert_config = AlertConfiguration(['user@contoso.com']) # replace with your email to recieve alerts from the scheduled pipeline after enabling\n",
"\n",
"monitor = DataDriftDetector.create_from_datasets(ws, 'weather-monitor', baseline, target, \n",
" compute_target='cpu-cluster', # compute target for scheduled pipeline and backfills \n",
" frequency='Week', # how often to analyze target data\n",
" feature_list=None, # list of features to detect drift on\n",
" drift_threshold=None, # threshold from 0 to 1 for email alerting\n",
" latency=0, # SLA in hours for target data to arrive in the dataset\n",
" alert_config=alert_config) # email addresses to send alert"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Update data drift monitor\n",
"\n",
"Many settings of the data drift monitor can be updated after creation. In this demo, we will update the `drift_threshold` and `feature_list`. See [our documentation](http://aka.ms/datadrift) for details on which settings can be changed."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# get monitor by name\n",
"monitor = DataDriftDetector.get_by_name(ws, 'weather-monitor')\n",
"\n",
"# create feature list - need to exclude columns that naturally drift or increment over time, such as year, day, index\n",
"columns = list(baseline.take(1).to_pandas_dataframe())\n",
"exclude = ['year', 'day', 'version', '__index_level_0__']\n",
"features = [col for col in columns if col not in exclude]\n",
"\n",
"# update the feature list\n",
"monitor = monitor.update(feature_list=features)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Analyze historical data and backfill\n",
"\n",
"You can use the `backfill` method to:\n",
" * analyze historical data\n",
" * backfill metrics after updating the settings (mainly the feature list)\n",
" * backfill metrics for failed runs\n",
" \n",
"The below cells will run two backfills that will produce data drift results for 2019 weather data, with January used as the baseline in the monitor. The output can be seen from the `show` method after the runs have completed, or viewed from the Azure Machine Learning studio for Enterprise workspaces.\n",
"\n",
"![Drift results](media/drift-results.png)"
]
},
{
"cell_type": "markdown",
"metadata": {
"jupyter": {
"source_hidden": true
}
},
"source": [
">**Tip!** When starting with the data drift capability, start by backfilling on a small section of data to get initial results. Update the feature list as needed by removing columns that are causing drift, but can be ignored, and backfill this section of data until satisfied with the results. Then, backfill on a larger slice of data and/or set the alert configuration, threshold, and enable the schedule to recieve alerts to drift on your dataset. All of this can be done through the UI (Enterprise) or Python SDK."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Although it depends on many factors, the below backfill should typically take less than 20 minutes to run. Results will show as soon as they become available, not when the backfill is completed, so you may begin to see some metrics in a few minutes."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# backfill for one month\n",
"backfill_start_date = datetime(2019, 9, 1)\n",
"backfill_end_date = datetime(2019, 10, 1)\n",
"backfill = monitor.backfill(backfill_start_date, backfill_end_date)\n",
"backfill"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Query metrics and show results in Python\n",
"\n",
"The below cell will plot some key data drift metrics, and can be used to query the results. Run `help(monitor.get_output)` for specifics on the object returned."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# make sure the backfill has completed\n",
"backfill.wait_for_completion(wait_post_processing=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# get results from Python SDK (wait for backfills or monitor runs to finish)\n",
"results, metrics = monitor.get_output(start_time=datetime(year=2019, month=9, day=1))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# plot the results from Python SDK \n",
"monitor.show(backfill_start_date, backfill_end_date)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Enable the monitor's pipeline schedule\n",
"\n",
"Turn on a scheduled pipeline which will anlayze the target dataset for drift every `frequency`. Use the latency parameter to adjust the start time of the pipeline. For instance, if it takes 24 hours for my data processing pipelines for data to arrive in the target dataset, set latency to 24. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# enable the pipeline schedule and recieve email alerts\n",
"monitor.enable_schedule()\n",
"\n",
"# disable the pipeline schedule \n",
"#monitor.disable_schedule()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Delete compute target\n",
"\n",
"Do not delete the compute target if you intend to keep using it for the data drift monitor scheduled runs or otherwise. If the minimum nodes are set to 0, it will scale down soon after jobs are completed, and scale up the next time the cluster is needed."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# optionally delete the compute target\n",
"#compute_target.delete()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Delete the DataDriftDetector\n",
"\n",
"Invoking the `delete()` method on the object deletes the the drift monitor permanently and cannot be undone. You will no longer be able to find it in the UI and the `list()` or `get()` methods. The object on which delete() was called will have its state set to deleted and name suffixed with deleted. The baseline and target datasets and model data that was collected, if any, are not deleted. The compute is not deleted. The DataDrift schedule pipeline is disabled and archived."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"monitor.delete()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Next steps\n",
"\n",
" * See [our documentation](https://aka.ms/datadrift) or [Python SDK reference](https://docs.microsoft.com/python/api/overview/azure/ml/intro)\n",
" * [Send requests or feedback](mailto:driftfeedback@microsoft.com) on data drift directly to the team\n",
" * Please open issues with data drift here on GitHub or on StackOverflow if others are likely to run into the same issue"
]
}
],
"metadata": {
"authors": [
{
"name": "jamgan"
}
],
"category": "tutorial",
"compute": [
"Remote"
],
"datasets": [
"NOAA"
],
"deployment": [
"None"
],
"exclude_from_index": false,
"framework": [
"Azure ML"
],
"friendly_name": "Data drift quickdemo",
"index_order": 1,
"kernelspec": {
"display_name": "Python 3.6",
"language": "python",
"name": "python36"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
},
"star_tag": [
"featured"
],
"tags": [
"Dataset",
"Timeseries",
"Drift"
],
"task": "Filtering"
},
"nbformat": 4,
"nbformat_minor": 4
}