mirror of
https://github.com/Azure/MachineLearningNotebooks.git
synced 2025-12-21 10:05:09 -05:00
496 lines
16 KiB
Plaintext
496 lines
16 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Copyright (c) Microsoft Corporation. All rights reserved.\n",
|
|
"\n",
|
|
"Licensed under the MIT License."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Auto ML : Remote Execution with Text data from Blobstorage\n",
|
|
"\n",
|
|
"In this example we use the [Burning Man 2016 dataset](https://innovate.burningman.org/datasets-page/) to showcase how you can use AutoML to handle text data from a Azure blobstorage.\n",
|
|
"\n",
|
|
"Make sure you have executed the [00.configuration](00.configuration.ipynb) before running this notebook.\n",
|
|
"\n",
|
|
"In this notebook you would see\n",
|
|
"1. Creating an Experiment using an existing Workspace\n",
|
|
"2. Attaching an existing DSVM to a workspace\n",
|
|
"3. Instantiating AutoMLConfig \n",
|
|
"4. Training the Model using the DSVM\n",
|
|
"5. Exploring the results\n",
|
|
"6. Testing the fitted model\n",
|
|
"\n",
|
|
"In addition this notebook showcases the following features\n",
|
|
"- **Parallel** Executions for iterations\n",
|
|
"- Asyncronous tracking of progress\n",
|
|
"- **Cancelling** individual iterations or the entire run\n",
|
|
"- Retrieving models for any iteration or logged metric\n",
|
|
"- specify automl settings as **kwargs**\n",
|
|
"- handling **text** data with **preprocess** flag\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Create Experiment\n",
|
|
"\n",
|
|
"As part of the setup you have already created a <b>Workspace</b>. For AutoML you would need to create an <b>Experiment</b>. An <b>Experiment</b> is a named object in a <b>Workspace</b>, which is used to run experiments."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import logging\n",
|
|
"import os\n",
|
|
"import random\n",
|
|
"\n",
|
|
"from matplotlib import pyplot as plt\n",
|
|
"from matplotlib.pyplot import imshow\n",
|
|
"import numpy as np\n",
|
|
"import pandas as pd\n",
|
|
"from sklearn import datasets\n",
|
|
"\n",
|
|
"import azureml.core\n",
|
|
"from azureml.core.experiment import Experiment\n",
|
|
"from azureml.core.workspace import Workspace\n",
|
|
"from azureml.train.automl import AutoMLConfig\n",
|
|
"from azureml.train.automl.run import AutoMLRun"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"ws = Workspace.from_config()\n",
|
|
"\n",
|
|
"# choose a name for the run history container in the workspace\n",
|
|
"experiment_name = 'automl-remote-dsvm-blobstore'\n",
|
|
"# project folder\n",
|
|
"project_folder = './sample_projects/automl-remote-dsvm-blobstore'\n",
|
|
"\n",
|
|
"experiment = Experiment(ws, experiment_name)\n",
|
|
"\n",
|
|
"output = {}\n",
|
|
"output['SDK version'] = azureml.core.VERSION\n",
|
|
"output['Subscription ID'] = ws.subscription_id\n",
|
|
"output['Workspace'] = ws.name\n",
|
|
"output['Resource Group'] = ws.resource_group\n",
|
|
"output['Location'] = ws.location\n",
|
|
"output['Project Directory'] = project_folder\n",
|
|
"output['Experiment Name'] = experiment.name\n",
|
|
"pd.set_option('display.max_colwidth', -1)\n",
|
|
"pd.DataFrame(data=output, index=['']).T"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Diagnostics\n",
|
|
"\n",
|
|
"Opt-in diagnostics for better experience, quality, and security of future releases"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.telemetry import set_diagnostics_collection\n",
|
|
"set_diagnostics_collection(send_diagnostics=True)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Attach a Remote Linux DSVM\n",
|
|
"To use remote docker commpute target:\n",
|
|
"1. Create a Linux DSVM in Azure. Here is some [quick instructions](https://docs.microsoft.com/en-us/azure/machine-learning/desktop-workbench/how-to-create-dsvm-hdi). Make sure you use the Ubuntu flavor, NOT CentOS. Make sure that disk space is available under /tmp because AutoML creates files under /tmp/azureml_runs. The DSVM should have more cores than the number of parallel runs that you plan to enable. It should also have at least 4Gb per core.\n",
|
|
"2. Enter the IP address, username and password below\n",
|
|
"\n",
|
|
"**Note**: By default SSH runs on port 22 and you don't need to specify it. But if for security reasons you can switch to a different port (such as 5022), you can append the port number to the address. [Read more](https://render.githubusercontent.com/documentation/sdk/ssh-issue.md) on this."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.core.compute import RemoteCompute\n",
|
|
"\n",
|
|
"# Add your VM information below\n",
|
|
"dsvm_name = 'mydsvm1'\n",
|
|
"dsvm_ip_addr = '<<ip_addr>>'\n",
|
|
"dsvm_username = '<<username>>'\n",
|
|
"dsvm_password = '<<password>>'\n",
|
|
"\n",
|
|
"dsvm_compute = RemoteCompute.attach(workspace=ws, name=dsvm_name, address=dsvm_ip_addr, username=dsvm_username, password=dsvm_password, ssh_port=22)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Create Get Data File\n",
|
|
"For remote executions you should author a get_data.py file containing a get_data() function. This file should be in the root directory of the project. You can encapsulate code to read data either from a blob storage or local disk in this file.\n",
|
|
"\n",
|
|
"The *get_data()* function returns a [dictionary](README.md#getdata)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"if not os.path.exists(project_folder):\n",
|
|
" os.makedirs(project_folder)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"%%writefile $project_folder/get_data.py\n",
|
|
"\n",
|
|
"import pandas as pd\n",
|
|
"from sklearn.model_selection import train_test_split\n",
|
|
"from sklearn.preprocessing import LabelEncoder\n",
|
|
"\n",
|
|
"def get_data():\n",
|
|
" # Burning man 2016 data\n",
|
|
" df = pd.read_csv(\"https://automldemods.blob.core.windows.net/datasets/PlayaEvents2016,_1.6MB,_3.4k-rows.cleaned.2.tsv\",\n",
|
|
" delimiter=\"\\t\", quotechar='\"')\n",
|
|
" # get integer labels\n",
|
|
" le = LabelEncoder()\n",
|
|
" le.fit(df[\"Label\"].values)\n",
|
|
" y = le.transform(df[\"Label\"].values)\n",
|
|
" df = df.drop([\"Label\"], axis=1)\n",
|
|
"\n",
|
|
" df_train, _, y_train, _ = train_test_split(df, y, test_size=0.1, random_state=42)\n",
|
|
"\n",
|
|
" return { \"X\" : df, \"y\" : y }"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### View data\n",
|
|
"\n",
|
|
"You can execute the *get_data()* function locally to view the *train* data"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"%run $project_folder/get_data.py\n",
|
|
"data_dict = get_data()\n",
|
|
"df = data_dict[\"X\"]\n",
|
|
"y = data_dict[\"y\"]\n",
|
|
"pd.set_option('display.max_colwidth', 15)\n",
|
|
"df['Label'] = pd.Series(y, index=df.index)\n",
|
|
"df.head()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Instantiate AutoML <a class=\"anchor\" id=\"Instatiate-AutoML-Remote-DSVM\"></a>\n",
|
|
"\n",
|
|
"You can specify automl_settings as **kwargs** as well. Also note that you can use the get_data() symantic for local excutions too. \n",
|
|
"\n",
|
|
"<i>Note: For Remote DSVM and Batch AI you cannot pass Numpy arrays directly to the fit method.</i>\n",
|
|
"\n",
|
|
"|Property|Description|\n",
|
|
"|-|-|\n",
|
|
"|**primary_metric**|This is the metric that you want to optimize.<br> Classification supports the following primary metrics <br><i>accuracy</i><br><i>AUC_weighted</i><br><i>balanced_accuracy</i><br><i>average_precision_score_weighted</i><br><i>precision_score_weighted</i>|\n",
|
|
"|**max_time_sec**|Time limit in seconds for each iteration|\n",
|
|
"|**iterations**|Number of iterations. In each iteration Auto ML trains a specific pipeline with the data|\n",
|
|
"|**n_cross_validations**|Number of cross validation splits|\n",
|
|
"|**concurrent_iterations**|Max number of iterations that would be executed in parallel. This should be less than the number of cores on the DSVM\n",
|
|
"|**preprocess**| *True/False* <br>Setting this to *True* enables AutoML to perform preprocessing <br>on the input to handle *missing data*, and perform some common *feature extraction*|\n",
|
|
"|**max_cores_per_iteration**| Indicates how many cores on the compute target would be used to train a single pipeline.<br> Default is *1*, you can set it to *-1* to use all cores|"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"automl_settings = {\n",
|
|
" \"max_time_sec\": 3600,\n",
|
|
" \"iterations\": 10,\n",
|
|
" \"n_cross_validations\": 5,\n",
|
|
" \"primary_metric\": 'AUC_weighted',\n",
|
|
" \"preprocess\": True,\n",
|
|
" \"max_cores_per_iteration\": 2\n",
|
|
"}\n",
|
|
"\n",
|
|
"automl_config = AutoMLConfig(task = 'classification',\n",
|
|
" path=project_folder,\n",
|
|
" compute_target = dsvm_compute,\n",
|
|
" data_script = project_folder + \"/get_data.py\",\n",
|
|
" **automl_settings\n",
|
|
" )\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Training the Model <a class=\"anchor\" id=\"Training-the-model-Remote-DSVM\"></a>\n",
|
|
"\n",
|
|
"For remote runs the execution is asynchronous, so you will see the iterations get populated as they complete. You can interact with the widgets/models even when the experiment is running to retreive the best model up to that point. Once you are satisfied with the model you can cancel a particular iteration or the whole run."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"remote_run = experiment.submit(automl_config)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Exploring the Results <a class=\"anchor\" id=\"Exploring-the-Results-Remote-DSVM\"></a>\n",
|
|
"#### Widget for monitoring runs\n",
|
|
"\n",
|
|
"The widget will sit on \"loading\" until the first iteration completed, then you will see an auto-updating graph and table show up. It refreshed once per minute, so you should see the graph update as child runs complete.\n",
|
|
"\n",
|
|
"You can click on a pipeline to see run properties and output logs. Logs are also available on the DSVM under /tmp/azureml_run/{iterationid}/azureml-logs\n",
|
|
"\n",
|
|
"NOTE: The widget displays a link at the bottom. This links to a web-ui to explore the individual run details."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.train.widgets import RunDetails\n",
|
|
"RunDetails(remote_run).show() "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"\n",
|
|
"#### Retrieve All Child Runs\n",
|
|
"You can also use sdk methods to fetch all the child runs and see individual metrics that we log. "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"children = list(remote_run.get_children())\n",
|
|
"metricslist = {}\n",
|
|
"for run in children:\n",
|
|
" properties = run.get_properties()\n",
|
|
" metrics = {k: v for k, v in run.get_metrics().items() if isinstance(v, float)} \n",
|
|
" metricslist[int(properties['iteration'])] = metrics\n",
|
|
"\n",
|
|
"rundata = pd.DataFrame(metricslist).sort_index(1)\n",
|
|
"rundata"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Canceling runs\n",
|
|
"You can cancel ongoing remote runs using the *cancel()* and *cancel_iteration()* functions"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Cancel the ongoing experiment and stop scheduling new iterations\n",
|
|
"remote_run.cancel()\n",
|
|
"\n",
|
|
"# Cancel iteration 1 and move onto iteration 2\n",
|
|
"# remote_run.cancel_iteration(1)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Retrieve the Best Model\n",
|
|
"\n",
|
|
"Below we select the best pipeline from our iterations. The *get_output* method on automl_classifier returns the best run and the fitted model for the last *fit* invocation. There are overloads on *get_output* that allow you to retrieve the best run and fitted model for *any* logged metric or a particular *iteration*."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"best_run, fitted_model = remote_run.get_output()\n",
|
|
"print(best_run)\n",
|
|
"print(fitted_model)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Best Model based on any other metric"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# lookup_metric = \"accuracy\"\n",
|
|
"# best_run, fitted_model = remote_run.get_output(metric=lookup_metric)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Model from a specific iteration"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"iteration = 0\n",
|
|
"zero_run, zero_model = remote_run.get_output(iteration=iteration)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Register fitted model for deployment"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"description = 'AutoML Model'\n",
|
|
"tags = None\n",
|
|
"remote_run.register_model(description=description, tags=tags)\n",
|
|
"remote_run.model_id # Use this id to deploy the model as a web service in Azure"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Testing the Fitted Model <a class=\"anchor\" id=\"Testing-the-Fitted-Model-Remote-DSVM\"></a>\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import sklearn\n",
|
|
"from sklearn.model_selection import train_test_split\n",
|
|
"from sklearn.preprocessing import LabelEncoder\n",
|
|
"from pandas_ml import ConfusionMatrix\n",
|
|
"\n",
|
|
"df = pd.read_csv(\"https://automldemods.blob.core.windows.net/datasets/PlayaEvents2016,_1.6MB,_3.4k-rows.cleaned.2.tsv\",\n",
|
|
" delimiter=\"\\t\", quotechar='\"')\n",
|
|
"\n",
|
|
"# get integer labels\n",
|
|
"le = LabelEncoder()\n",
|
|
"le.fit(df[\"Label\"].values)\n",
|
|
"y = le.transform(df[\"Label\"].values)\n",
|
|
"df = df.drop([\"Label\"], axis=1)\n",
|
|
"\n",
|
|
"_, df_test, _, y_test = train_test_split(df, y, test_size=0.1, random_state=42)\n",
|
|
"\n",
|
|
"\n",
|
|
"ypred = fitted_model.predict(df_test.values)\n",
|
|
"\n",
|
|
"\n",
|
|
"ypred_strings = le.inverse_transform(ypred)\n",
|
|
"ytest_strings = le.inverse_transform(y_test)\n",
|
|
"\n",
|
|
"cm = ConfusionMatrix(ytest_strings, ypred_strings)\n",
|
|
"\n",
|
|
"print(cm)\n",
|
|
"\n",
|
|
"cm.plot()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3.6",
|
|
"language": "python",
|
|
"name": "python36"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.6.6"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
}
|