571 lines
16 KiB
Plaintext
571 lines
16 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Copyright (c) Microsoft Corporation. All rights reserved.\n",
|
|
"\n",
|
|
"Licensed under the MIT License."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# AutoML 13: Prepare Data using `azureml.dataprep`\n",
|
|
"In this example we showcase how you can use `azureml.dataprep` SDK to load and prepare data for AutoML. `azureml.dataprep` can also be used standalone - full documentation can be found [here](https://github.com/Microsoft/PendletonDocs).\n",
|
|
"\n",
|
|
"Make sure you have executed the [setup](00.configuration.ipynb) before running this notebook.\n",
|
|
"\n",
|
|
"In this notebook you would see\n",
|
|
"1. Defining data loading and preparation steps in a `Dataflow` using `azureml.dataprep`\n",
|
|
"2. Passing the `Dataflow` to AutoML for local run\n",
|
|
"3. Passing the `Dataflow` to AutoML for remote run"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Install `azureml.dataprep` SDK"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Please restart your kernel after the below installs.\n",
|
|
"\n",
|
|
"Tornado must be downgraded to a pre-5 version due to a known Tornado x Jupyter event loop bug."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"!pip install --upgrade --extra-index-url https://dataprepdownloads.azureedge.net/pypi/autoML-BD0E9CABED27C837/0.1.1809.11043 azureml-dataprep --no-cache-dir --force-reinstall\n",
|
|
"!pip install tornado==4.5.1"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Diagnostics\n",
|
|
"\n",
|
|
"Opt-in diagnostics for better experience, quality, and security of future releases"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.telemetry import set_diagnostics_collection\n",
|
|
"set_diagnostics_collection(send_diagnostics=True)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Create Experiment\n",
|
|
"\n",
|
|
"As part of the setup you have already created a <b>Workspace</b>. For AutoML you would need to create an <b>Experiment</b>. An <b>Experiment</b> is a named object in a <b>Workspace</b>, which is used to run experiments."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import logging\n",
|
|
"import os\n",
|
|
"\n",
|
|
"import pandas as pd\n",
|
|
"\n",
|
|
"import azureml.core\n",
|
|
"from azureml.core.compute import DsvmCompute\n",
|
|
"from azureml.core.experiment import Experiment\n",
|
|
"from azureml.core.runconfig import CondaDependencies\n",
|
|
"from azureml.core.runconfig import RunConfiguration\n",
|
|
"from azureml.core.workspace import Workspace\n",
|
|
"import azureml.dataprep as dprep\n",
|
|
"from azureml.train.automl import AutoMLConfig"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"ws = Workspace.from_config()\n",
|
|
" \n",
|
|
"# choose a name for experiment\n",
|
|
"experiment_name = 'automl-dataprep-classification'\n",
|
|
"# project folder\n",
|
|
"project_folder = './sample_projects/automl-dataprep-classification'\n",
|
|
" \n",
|
|
"experiment=Experiment(ws, experiment_name)\n",
|
|
" \n",
|
|
"output = {}\n",
|
|
"output['SDK version'] = azureml.core.VERSION\n",
|
|
"output['Subscription ID'] = ws.subscription_id\n",
|
|
"output['Workspace Name'] = ws.name\n",
|
|
"output['Resource Group'] = ws.resource_group\n",
|
|
"output['Location'] = ws.location\n",
|
|
"output['Project Directory'] = project_folder\n",
|
|
"output['Experiment Name'] = experiment.name\n",
|
|
"pd.set_option('display.max_colwidth', -1)\n",
|
|
"pd.DataFrame(data = output, index = ['']).T"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Loading Data using DataPrep"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# You can use `smart_read_file` which intelligently figures out delimiters and datatypes of a file\n",
|
|
"# data pulled from sklearn.datasets.load_digits()\n",
|
|
"simple_example_data_root = 'https://dprepdata.blob.core.windows.net/automl-notebook-data/'\n",
|
|
"X = dprep.smart_read_file(simple_example_data_root + 'X.csv').skip(1) # remove header\n",
|
|
"\n",
|
|
"# You can also use `read_csv` and `to_*` transformations to read (with overridable delimiter) \n",
|
|
"# and convert column types manually. \n",
|
|
"# Here we read a comma delimited file and convert all columns to integers.\n",
|
|
"y = dprep.read_csv(simple_example_data_root + 'y.csv').to_long(dprep.ColumnSelector(term='.*', use_regex=True))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Review the Data Preparation Result\n",
|
|
"\n",
|
|
"You can peek the result of a Dataflow at any range using `skip(i)` and `head(j)`. Doing so evaluates only `j` records for all the steps in the Dataflow, which makes it fast even against large dataset."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"X.skip(1).head(5)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Instantiate AutoML Settings\n",
|
|
"\n",
|
|
"This creates a general Auto ML Settings applicable for both Local and Remote runs."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"automl_settings = {\n",
|
|
" \"max_time_sec\": 600,\n",
|
|
" \"iterations\": 2,\n",
|
|
" \"primary_metric\": 'AUC_weighted',\n",
|
|
" \"preprocess\": False,\n",
|
|
" \"verbosity\": logging.INFO,\n",
|
|
" \"n_cross_validations\" : 3\n",
|
|
"}"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Local Run"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Pass data with Dataflows\n",
|
|
"\n",
|
|
"The `Dataflow` objects captured above can be passed to `submit` method for local run. AutoML will retrieve the results from the `Dataflow` for model training."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"automl_config = AutoMLConfig(task = 'classification',\n",
|
|
" debug_log = 'automl_errors.log',\n",
|
|
" X = X,\n",
|
|
" y = y, \n",
|
|
" **automl_settings)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"local_run = experiment.submit(automl_config, show_output=True)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Remote Run"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Create or Attach a Remote Linux DSVM"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"dsvm_name = 'mydsvm'\n",
|
|
"try:\n",
|
|
" dsvm_compute = DsvmCompute(ws, dsvm_name)\n",
|
|
" print('found existing dsvm.')\n",
|
|
"except:\n",
|
|
" print('creating new dsvm.')\n",
|
|
" dsvm_config = DsvmCompute.provisioning_configuration(vm_size = \"Standard_D2_v2\")\n",
|
|
" dsvm_compute = DsvmCompute.create(ws, name = dsvm_name, provisioning_configuration = dsvm_config)\n",
|
|
" dsvm_compute.wait_for_completion(show_output = True)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Update Conda Dependency file to have AutoML and DataPrep SDK\n",
|
|
"\n",
|
|
"Currently AutoML and DataPrep SDK is not installed with Azure ML SDK by default. Due to this we update the conda dependency file to add such dependencies."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"cd = CondaDependencies()\n",
|
|
"cd.set_pip_index_url(index_url=\"--index-url https://azuremlsdktestpypi.azureedge.net/sdk-release/master/588E708E0DF342C4A80BD954289657CF\")\n",
|
|
"cd.set_pip_index_url(index_url=\"--extra-index-url https://dataprepdownloads.azureedge.net/pypi/autoML-BD0E9CABED27C837/0.1.1809.11043 --extra-index-url https://pypi.python.org/simple\")\n",
|
|
"cd.remove_pip_package(pip_package=\"azureml-defaults\")\n",
|
|
"cd.add_pip_package(pip_package='azureml-core')\n",
|
|
"cd.add_pip_package(pip_package='azureml-telemetry')\n",
|
|
"cd.add_pip_package(pip_package='azureml-train-automl')\n",
|
|
"cd.add_pip_package(pip_package='azureml-dataprep')\n",
|
|
"cd.add_pip_package(pip_package='tornado==4.5.1')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Create a RunConfiguration with DSVM name"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"run_config = RunConfiguration(conda_dependencies=cd)\n",
|
|
"run_config.target = dsvm_compute\n",
|
|
"run_config.auto_prepare_environment = True"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Pass data with Dataflows\n",
|
|
"\n",
|
|
"The `Dataflow` objects captured above can also be passed to `submit` method for remote run. AutoML will serialize the `Dataflow` and send to remote compute target. The `Dataflow` will not be evaluated locally."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"automl_config = AutoMLConfig(task = 'classification',\n",
|
|
" debug_log = 'automl_errors.log',\n",
|
|
" path=project_folder,\n",
|
|
" run_configuration = run_config,\n",
|
|
" X = X,\n",
|
|
" y = y,\n",
|
|
" **automl_settings)\n",
|
|
"remote_run = experiment.submit(automl_config, show_output=True)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Exploring the results"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Widget for monitoring runs\n",
|
|
"\n",
|
|
"The widget will sit on \"loading\" until the first iteration completed, then you will see an auto-updating graph and table show up. It refreshed once per minute, so you should see the graph update as child runs complete.\n",
|
|
"\n",
|
|
"NOTE: The widget displays a link at the bottom. This links to a web-ui to explore the individual run details."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.train.widgets import RunDetails\n",
|
|
"RunDetails(local_run).show() "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Retrieve All Child Runs\n",
|
|
"You can also use sdk methods to fetch all the child runs and see individual metrics that we log. "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"children = list(local_run.get_children())\n",
|
|
"metricslist = {}\n",
|
|
"for run in children:\n",
|
|
" properties = run.get_properties()\n",
|
|
" metrics = {k: v for k, v in run.get_metrics().items() if isinstance(v, float)} \n",
|
|
" metricslist[int(properties['iteration'])] = metrics\n",
|
|
" \n",
|
|
"import pandas as pd\n",
|
|
"rundata = pd.DataFrame(metricslist).sort_index(1)\n",
|
|
"rundata"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Retrieve the Best Model\n",
|
|
"\n",
|
|
"Below we select the best pipeline from our iterations. The *get_output* method on automl_classifier returns the best run and the fitted model for the last *fit* invocation. There are overloads on *get_output* that allow you to retrieve the best run and fitted model for *any* logged metric or a particular *iteration*."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"best_run, fitted_model = local_run.get_output()\n",
|
|
"print(best_run)\n",
|
|
"print(fitted_model)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Best Model based on any other metric\n",
|
|
"Give me the run and the model that has the smallest `log_loss`:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"lookup_metric = \"log_loss\"\n",
|
|
"best_run, fitted_model = local_run.get_output(metric = lookup_metric)\n",
|
|
"print(best_run)\n",
|
|
"print(fitted_model)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Best Model based on any iteration\n",
|
|
"Give me the run and the model from the 1st iteration:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"iteration = 0\n",
|
|
"best_run, fitted_model = local_run.get_output(iteration = iteration)\n",
|
|
"print(best_run)\n",
|
|
"print(fitted_model)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Testing the Fitted Model \n",
|
|
"\n",
|
|
"#### Load Test Data"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from sklearn import datasets\n",
|
|
"\n",
|
|
"digits = datasets.load_digits()\n",
|
|
"X_digits = digits.data[:10, :]\n",
|
|
"y_digits = digits.target[:10]\n",
|
|
"images = digits.images[:10]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Testing our best pipeline\n",
|
|
"We will try to predict 2 digits and see how our model works."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"#Randomly select digits and test\n",
|
|
"from matplotlib import pyplot as plt\n",
|
|
"from matplotlib.pyplot import imshow\n",
|
|
"import random\n",
|
|
"import numpy as np\n",
|
|
"\n",
|
|
"for index in np.random.choice(len(y_digits), 2):\n",
|
|
" print(index)\n",
|
|
" predicted = fitted_model.predict(X_digits[index:index + 1])[0]\n",
|
|
" label = y_digits[index]\n",
|
|
" title = \"Label value = %d Predicted value = %d \" % ( label,predicted)\n",
|
|
" fig = plt.figure(1, figsize=(3,3))\n",
|
|
" ax1 = fig.add_axes((0,0,.8,.8))\n",
|
|
" ax1.set_title(title)\n",
|
|
" plt.imshow(images[index], cmap=plt.cm.gray_r, interpolation='nearest')\n",
|
|
" plt.show()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Appendix"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Capture the Dataflows to use for AutoML later\n",
|
|
"\n",
|
|
"`Dataflow` objects are immutable. Each of them is composed of a list of data preparation steps. A `Dataflow` can be branched at any point for further usage."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# sklearn.digits.data + target\n",
|
|
"digits_complete = dprep.smart_read_file('https://dprepdata.blob.core.windows.net/automl-notebook-data/digits-complete.csv')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"`digits_complete` (sourced from `sklearn.datasets.load_digits()`)is forked into `dflow_X` to capture all the feature columns and `dflow_y` to capture the label column."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"digits_complete.to_pandas_dataframe().shape\n",
|
|
"labels_column = 'Column64'\n",
|
|
"dflow_X = digits_complete.drop_columns(columns=[labels_column])\n",
|
|
"dflow_y = digits_complete.keep_columns(columns=[labels_column])"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python [default]",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.6.6"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
}
|