mirror of
https://github.com/Azure/MachineLearningNotebooks.git
synced 2025-12-20 01:27:06 -05:00
507 lines
14 KiB
Plaintext
507 lines
14 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Copyright (c) Microsoft Corporation. All rights reserved.\n",
|
|
"\n",
|
|
"Licensed under the MIT License."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Automated Machine Learning: Prepare Data using `azureml.dataprep` for Remote Execution (DSVM)\n",
|
|
"In this example we showcase how you can use the `azureml.dataprep` SDK to load and prepare data for AutoML. `azureml.dataprep` can also be used standalone; full documentation can be found [here](https://github.com/Microsoft/PendletonDocs).\n",
|
|
"\n",
|
|
"Make sure you have executed the [configuration](../configuration.ipynb) before running this notebook.\n",
|
|
"\n",
|
|
"In this notebook you will learn how to:\n",
|
|
"1. Define data loading and preparation steps in a `Dataflow` using `azureml.dataprep`.\n",
|
|
"2. Pass the `Dataflow` to AutoML for a local run.\n",
|
|
"3. Pass the `Dataflow` to AutoML for a remote run."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Compatibility\n",
|
|
"\n",
|
|
"Currently, Data Prep only supports __Ubuntu 16__ and __Red Hat Enterprise Linux 7__. We are working on supporting more linux distros."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Diagnostics\n",
|
|
"\n",
|
|
"Opt-in diagnostics for better experience, quality, and security of future releases."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.telemetry import set_diagnostics_collection\n",
|
|
"set_diagnostics_collection(send_diagnostics = True)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Create an Experiment\n",
|
|
"\n",
|
|
"As part of the setup you have already created an Azure ML `Workspace` object. For AutoML you will need to create an `Experiment` object, which is a named object in a `Workspace` used to run experiments."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import logging\n",
|
|
"import os\n",
|
|
"import time\n",
|
|
"\n",
|
|
"import pandas as pd\n",
|
|
"\n",
|
|
"import azureml.core\n",
|
|
"from azureml.core.compute import DsvmCompute\n",
|
|
"from azureml.core.experiment import Experiment\n",
|
|
"from azureml.core.workspace import Workspace\n",
|
|
"import azureml.dataprep as dprep\n",
|
|
"from azureml.train.automl import AutoMLConfig"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"ws = Workspace.from_config()\n",
|
|
" \n",
|
|
"# choose a name for experiment\n",
|
|
"experiment_name = 'automl-dataprep-remote-dsvm'\n",
|
|
"# project folder\n",
|
|
"project_folder = './sample_projects/automl-dataprep-remote-dsvm'\n",
|
|
" \n",
|
|
"experiment = Experiment(ws, experiment_name)\n",
|
|
" \n",
|
|
"output = {}\n",
|
|
"output['SDK version'] = azureml.core.VERSION\n",
|
|
"output['Subscription ID'] = ws.subscription_id\n",
|
|
"output['Workspace Name'] = ws.name\n",
|
|
"output['Resource Group'] = ws.resource_group\n",
|
|
"output['Location'] = ws.location\n",
|
|
"output['Project Directory'] = project_folder\n",
|
|
"output['Experiment Name'] = experiment.name\n",
|
|
"pd.set_option('display.max_colwidth', -1)\n",
|
|
"pd.DataFrame(data = output, index = ['']).T"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Loading Data using DataPrep"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# You can use `auto_read_file` which intelligently figures out delimiters and datatypes of a file.\n",
|
|
"# The data referenced here was pulled from `sklearn.datasets.load_digits()`.\n",
|
|
"simple_example_data_root = 'https://dprepdata.blob.core.windows.net/automl-notebook-data/'\n",
|
|
"X = dprep.auto_read_file(simple_example_data_root + 'X.csv').skip(1) # Remove the header row.\n",
|
|
"\n",
|
|
"# You can also use `read_csv` and `to_*` transformations to read (with overridable delimiter)\n",
|
|
"# and convert column types manually.\n",
|
|
"# Here we read a comma delimited file and convert all columns to integers.\n",
|
|
"y = dprep.read_csv(simple_example_data_root + 'y.csv').to_long(dprep.ColumnSelector(term='.*', use_regex = True))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Review the Data Preparation Result\n",
|
|
"\n",
|
|
"You can peek the result of a Dataflow at any range using `skip(i)` and `head(j)`. Doing so evaluates only `j` records for all the steps in the Dataflow, which makes it fast even against large datasets."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"X.skip(1).head(5)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Configure AutoML\n",
|
|
"\n",
|
|
"This creates a general AutoML settings object applicable for both local and remote runs."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"automl_settings = {\n",
|
|
" \"iteration_timeout_minutes\" : 10,\n",
|
|
" \"iterations\" : 2,\n",
|
|
" \"primary_metric\" : 'AUC_weighted',\n",
|
|
" \"preprocess\" : False,\n",
|
|
" \"verbosity\" : logging.INFO,\n",
|
|
" \"n_cross_validations\": 3\n",
|
|
"}"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Remote Run"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Create or Attach a Remote Linux DSVM"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"dsvm_name = 'mydsvmc'\n",
|
|
"\n",
|
|
"try:\n",
|
|
" while ws.compute_targets[dsvm_name].provisioning_state == 'Creating':\n",
|
|
" time.sleep(1)\n",
|
|
" \n",
|
|
" dsvm_compute = DsvmCompute(ws, dsvm_name)\n",
|
|
" print('Found existing DVSM.')\n",
|
|
"except:\n",
|
|
" print('Creating a new DSVM.')\n",
|
|
" dsvm_config = DsvmCompute.provisioning_configuration(vm_size = \"Standard_D2_v2\")\n",
|
|
" dsvm_compute = DsvmCompute.create(ws, name = dsvm_name, provisioning_configuration = dsvm_config)\n",
|
|
" dsvm_compute.wait_for_completion(show_output = True)\n",
|
|
" print(\"Waiting one minute for ssh to be accessible\")\n",
|
|
" time.sleep(60) # Wait for ssh to be accessible"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.core.runconfig import RunConfiguration\n",
|
|
"from azureml.core.conda_dependencies import CondaDependencies\n",
|
|
"\n",
|
|
"conda_run_config = RunConfiguration(framework=\"python\")\n",
|
|
"\n",
|
|
"conda_run_config.target = dsvm_compute\n",
|
|
"\n",
|
|
"cd = CondaDependencies.create(pip_packages=['azureml-sdk[automl]'], conda_packages=['numpy'])\n",
|
|
"conda_run_config.environment.python.conda_dependencies = cd"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Pass Data with `Dataflow` Objects\n",
|
|
"\n",
|
|
"The `Dataflow` objects captured above can also be passed to the `submit` method for a remote run. AutoML will serialize the `Dataflow` object and send it to the remote compute target. The `Dataflow` will not be evaluated locally."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"automl_config = AutoMLConfig(task = 'classification',\n",
|
|
" debug_log = 'automl_errors.log',\n",
|
|
" path = project_folder,\n",
|
|
" run_configuration=conda_run_config,\n",
|
|
" X = X,\n",
|
|
" y = y,\n",
|
|
" **automl_settings)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"remote_run = experiment.submit(automl_config, show_output = True)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Explore the Results"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Widget for Monitoring Runs\n",
|
|
"\n",
|
|
"The widget will first report a \"loading\" status while running the first iteration. After completing the first iteration, an auto-updating graph and table will be shown. The widget will refresh once per minute, so you should see the graph update as child runs complete.\n",
|
|
"\n",
|
|
"**Note:** The widget displays a link at the bottom. Use this link to open a web interface to explore the individual run details."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.widgets import RunDetails\n",
|
|
"RunDetails(remote_run).show()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Retrieve All Child Runs\n",
|
|
"You can also use SDK methods to fetch all the child runs and see individual metrics that we log."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"children = list(remote_run.get_children())\n",
|
|
"metricslist = {}\n",
|
|
"for run in children:\n",
|
|
" properties = run.get_properties()\n",
|
|
" metrics = {k: v for k, v in run.get_metrics().items() if isinstance(v, float)}\n",
|
|
" metricslist[int(properties['iteration'])] = metrics\n",
|
|
" \n",
|
|
"import pandas as pd\n",
|
|
"rundata = pd.DataFrame(metricslist).sort_index(1)\n",
|
|
"rundata"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Retrieve the Best Model\n",
|
|
"\n",
|
|
"Below we select the best pipeline from our iterations. The `get_output` method returns the best run and the fitted model. Overloads on `get_output` allow you to retrieve the best run and fitted model for *any* logged metric or for a particular *iteration*."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"best_run, fitted_model = remote_run.get_output()\n",
|
|
"print(best_run)\n",
|
|
"print(fitted_model)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Best Model Based on Any Other Metric\n",
|
|
"Show the run and the model that has the smallest `log_loss` value:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"lookup_metric = \"log_loss\"\n",
|
|
"best_run, fitted_model = remote_run.get_output(metric = lookup_metric)\n",
|
|
"print(best_run)\n",
|
|
"print(fitted_model)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Model from a Specific Iteration\n",
|
|
"Show the run and the model from the first iteration:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"iteration = 0\n",
|
|
"best_run, fitted_model = remote_run.get_output(iteration = iteration)\n",
|
|
"print(best_run)\n",
|
|
"print(fitted_model)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Test the Best Fitted Model\n",
|
|
"\n",
|
|
"#### Load Test Data"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from sklearn import datasets\n",
|
|
"\n",
|
|
"digits = datasets.load_digits()\n",
|
|
"X_test = digits.data[:10, :]\n",
|
|
"y_test = digits.target[:10]\n",
|
|
"images = digits.images[:10]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Testing Our Best Fitted Model\n",
|
|
"We will try to predict 2 digits and see how our model works."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"#Randomly select digits and test\n",
|
|
"from matplotlib import pyplot as plt\n",
|
|
"from matplotlib.pyplot import imshow\n",
|
|
"import random\n",
|
|
"import numpy as np\n",
|
|
"\n",
|
|
"for index in np.random.choice(len(y_test), 2, replace = False):\n",
|
|
" print(index)\n",
|
|
" predicted = fitted_model.predict(X_test[index:index + 1])[0]\n",
|
|
" label = y_test[index]\n",
|
|
" title = \"Label value = %d Predicted value = %d \" % (label, predicted)\n",
|
|
" fig = plt.figure(1, figsize=(3,3))\n",
|
|
" ax1 = fig.add_axes((0,0,.8,.8))\n",
|
|
" ax1.set_title(title)\n",
|
|
" plt.imshow(images[index], cmap = plt.cm.gray_r, interpolation = 'nearest')\n",
|
|
" plt.show()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Appendix"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Capture the `Dataflow` Objects for Later Use in AutoML\n",
|
|
"\n",
|
|
"`Dataflow` objects are immutable and are composed of a list of data preparation steps. A `Dataflow` object can be branched at any point for further usage."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# sklearn.digits.data + target\n",
|
|
"digits_complete = dprep.auto_read_file('https://dprepdata.blob.core.windows.net/automl-notebook-data/digits-complete.csv')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"`digits_complete` (sourced from `sklearn.datasets.load_digits()`) is forked into `dflow_X` to capture all the feature columns and `dflow_y` to capture the label column."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"digits_complete.to_pandas_dataframe().shape\n",
|
|
"labels_column = 'Column64'\n",
|
|
"dflow_X = digits_complete.drop_columns(columns = [labels_column])\n",
|
|
"dflow_y = digits_complete.keep_columns(columns = [labels_column])"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"authors": [
|
|
{
|
|
"name": "savitam"
|
|
}
|
|
],
|
|
"kernelspec": {
|
|
"display_name": "Python 3.6",
|
|
"language": "python",
|
|
"name": "python36"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.6.5"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
}
|