mirror of
https://github.com/Azure/MachineLearningNotebooks.git
synced 2025-12-20 09:37:04 -05:00
484 lines
16 KiB
Plaintext
484 lines
16 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Copyright (c) Microsoft Corporation. All rights reserved.\n",
|
|
"\n",
|
|
"Licensed under the MIT License."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Automated Machine Learning\n",
|
|
"_**Classification with Local Compute**_\n",
|
|
"\n",
|
|
"## Contents\n",
|
|
"1. [Introduction](#Introduction)\n",
|
|
"1. [Setup](#Setup)\n",
|
|
"1. [Data](#Data)\n",
|
|
"1. [Train](#Train)\n",
|
|
"1. [Results](#Results)\n",
|
|
"1. [Test](#Test)\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Introduction\n",
|
|
"\n",
|
|
"In this example we use the scikit-learn's [digit dataset](http://scikit-learn.org/stable/datasets/index.html#optical-recognition-of-handwritten-digits-dataset) to showcase how you can use AutoML for a simple classification problem.\n",
|
|
"\n",
|
|
"Make sure you have executed the [configuration](../../../configuration.ipynb) before running this notebook.\n",
|
|
"\n",
|
|
"In this notebook you will learn how to:\n",
|
|
"1. Create an `Experiment` in an existing `Workspace`.\n",
|
|
"2. Configure AutoML using `AutoMLConfig`.\n",
|
|
"3. Train the model using local compute.\n",
|
|
"4. Explore the results.\n",
|
|
"5. Test the best fitted model."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Setup\n",
|
|
"\n",
|
|
"As part of the setup you have already created an Azure ML `Workspace` object. For AutoML you will need to create an `Experiment` object, which is a named object in a `Workspace` used to run experiments."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import logging\n",
|
|
"\n",
|
|
"from matplotlib import pyplot as plt\n",
|
|
"import numpy as np\n",
|
|
"import pandas as pd\n",
|
|
"from sklearn import datasets\n",
|
|
"\n",
|
|
"import azureml.core\n",
|
|
"from azureml.core.experiment import Experiment\n",
|
|
"from azureml.core.workspace import Workspace\n",
|
|
"from azureml.train.automl import AutoMLConfig"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Accessing the Azure ML workspace requires authentication with Azure.\n",
|
|
"\n",
|
|
"The default authentication is interactive authentication using the default tenant. Executing the `ws = Workspace.from_config()` line in the cell below will prompt for authentication the first time that it is run.\n",
|
|
"\n",
|
|
"If you have multiple Azure tenants, you can specify the tenant by replacing the `ws = Workspace.from_config()` line in the cell below with the following:\n",
|
|
"\n",
|
|
"```\n",
|
|
"from azureml.core.authentication import InteractiveLoginAuthentication\n",
|
|
"auth = InteractiveLoginAuthentication(tenant_id = 'mytenantid')\n",
|
|
"ws = Workspace.from_config(auth = auth)\n",
|
|
"```\n",
|
|
"\n",
|
|
"If you need to run in an environment where interactive login is not possible, you can use Service Principal authentication by replacing the `ws = Workspace.from_config()` line in the cell below with the following:\n",
|
|
"\n",
|
|
"```\n",
|
|
"from azureml.core.authentication import ServicePrincipalAuthentication\n",
|
|
"auth = auth = ServicePrincipalAuthentication('mytenantid', 'myappid', 'mypassword')\n",
|
|
"ws = Workspace.from_config(auth = auth)\n",
|
|
"```\n",
|
|
"For more details, see [aka.ms/aml-notebook-auth](http://aka.ms/aml-notebook-auth)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"ws = Workspace.from_config()\n",
|
|
"\n",
|
|
"# Choose a name for the experiment.\n",
|
|
"experiment_name = 'automl-classification'\n",
|
|
"\n",
|
|
"experiment = Experiment(ws, experiment_name)\n",
|
|
"\n",
|
|
"output = {}\n",
|
|
"output['SDK version'] = azureml.core.VERSION\n",
|
|
"output['Subscription ID'] = ws.subscription_id\n",
|
|
"output['Workspace Name'] = ws.name\n",
|
|
"output['Resource Group'] = ws.resource_group\n",
|
|
"output['Location'] = ws.location\n",
|
|
"output['Experiment Name'] = experiment.name\n",
|
|
"pd.set_option('display.max_colwidth', -1)\n",
|
|
"outputDf = pd.DataFrame(data = output, index = [''])\n",
|
|
"outputDf.T"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Data\n",
|
|
"\n",
|
|
"This uses scikit-learn's [load_digits](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html) method."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"digits = datasets.load_digits()\n",
|
|
"\n",
|
|
"# Exclude the first 100 rows from training so that they can be used for test.\n",
|
|
"X_train = digits.data[100:,:]\n",
|
|
"y_train = digits.target[100:]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Train\n",
|
|
"\n",
|
|
"Instantiate an `AutoMLConfig` object to specify the settings and data used to run the experiment.\n",
|
|
"\n",
|
|
"|Property|Description|\n",
|
|
"|-|-|\n",
|
|
"|**task**|classification or regression|\n",
|
|
"|**primary_metric**|This is the metric that you want to optimize. Classification supports the following primary metrics: <br><i>accuracy</i><br><i>AUC_weighted</i><br><i>average_precision_score_weighted</i><br><i>norm_macro_recall</i><br><i>precision_score_weighted</i>|\n",
|
|
"|**X**|(sparse) array-like, shape = [n_samples, n_features]|\n",
|
|
"|**y**|(sparse) array-like, shape = [n_samples, ], Multi-class targets.|\n",
|
|
"|**n_cross_validations**|Number of cross validation splits.|\n",
|
|
"|\n",
|
|
"\n",
|
|
"Automated machine learning trains multiple machine learning pipelines. Each pipelines training is known as an iteration.\n",
|
|
"* You can specify a maximum number of iterations using the `iterations` parameter.\n",
|
|
"* You can specify a maximum time for the run using the `experiment_timeout_minutes` parameter.\n",
|
|
"* If you specify neither the `iterations` nor the `experiment_timeout_minutes`, automated ML keeps running iterations while it continues to see improvements in the scores.\n",
|
|
"\n",
|
|
"The following example doesn't specify `iterations` or `experiment_timeout_minutes` and so runs until the scores stop improving.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"automl_config = AutoMLConfig(task = 'classification',\n",
|
|
" primary_metric = 'AUC_weighted',\n",
|
|
" X = X_train, \n",
|
|
" y = y_train,\n",
|
|
" n_cross_validations = 3)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Call the `submit` method on the experiment object and pass the run configuration. Execution of local runs is synchronous. Depending on the data and the number of iterations this can run for a while.\n",
|
|
"In this example, we specify `show_output = True` to print currently running iterations to the console."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"local_run = experiment.submit(automl_config, show_output = True)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"local_run"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Optionally, you can continue an interrupted local run by calling `continue_experiment` without the `iterations` parameter, or run more iterations for a completed run by specifying the `iterations` parameter:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"local_run = local_run.continue_experiment(X = X_train, \n",
|
|
" y = y_train, \n",
|
|
" show_output = True,\n",
|
|
" iterations = 5)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Results"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Widget for Monitoring Runs\n",
|
|
"\n",
|
|
"The widget will first report a \"loading\" status while running the first iteration. After completing the first iteration, an auto-updating graph and table will be shown. The widget will refresh once per minute, so you should see the graph update as child runs complete.\n",
|
|
"\n",
|
|
"**Note:** The widget displays a link at the bottom. Use this link to open a web interface to explore the individual run details."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"tags": [
|
|
"widget-rundetails-sample"
|
|
]
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.widgets import RunDetails\n",
|
|
"RunDetails(local_run).show() "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"\n",
|
|
"#### Retrieve All Child Runs\n",
|
|
"You can also use SDK methods to fetch all the child runs and see individual metrics that we log."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"children = list(local_run.get_children())\n",
|
|
"metricslist = {}\n",
|
|
"for run in children:\n",
|
|
" properties = run.get_properties()\n",
|
|
" metrics = {k: v for k, v in run.get_metrics().items() if isinstance(v, float)}\n",
|
|
" metricslist[int(properties['iteration'])] = metrics\n",
|
|
"\n",
|
|
"rundata = pd.DataFrame(metricslist).sort_index(1)\n",
|
|
"rundata"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Retrieve the Best Model\n",
|
|
"\n",
|
|
"Below we select the best pipeline from our iterations. The `get_output` method returns the best run and the fitted model. The Model includes the pipeline and any pre-processing. Overloads on `get_output` allow you to retrieve the best run and fitted model for *any* logged metric or for a particular *iteration*."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"best_run, fitted_model = local_run.get_output()\n",
|
|
"print(best_run)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Print the properties of the model\n",
|
|
"The fitted_model is a python object and you can read the different properties of the object.\n",
|
|
"The following shows printing hyperparameters for each step in the pipeline."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from pprint import pprint\n",
|
|
"\n",
|
|
"def print_model(model, prefix=\"\"):\n",
|
|
" for step in model.steps:\n",
|
|
" print(prefix + step[0])\n",
|
|
" if hasattr(step[1], 'estimators') and hasattr(step[1], 'weights'):\n",
|
|
" pprint({'estimators': list(e[0] for e in step[1].estimators), 'weights': step[1].weights})\n",
|
|
" print()\n",
|
|
" for estimator in step[1].estimators:\n",
|
|
" print_model(estimator[1], estimator[0]+ ' - ')\n",
|
|
" elif hasattr(step[1], '_base_learners') and hasattr(step[1], '_meta_learner'):\n",
|
|
" print(\"\\nMeta Learner\")\n",
|
|
" pprint(step[1]._meta_learner)\n",
|
|
" print()\n",
|
|
" for estimator in step[1]._base_learners:\n",
|
|
" print_model(estimator[1], estimator[0]+ ' - ')\n",
|
|
" else:\n",
|
|
" pprint(step[1].get_params())\n",
|
|
" print()\n",
|
|
" \n",
|
|
"print_model(fitted_model)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Best Model Based on Any Other Metric\n",
|
|
"Show the run and the model that has the smallest `log_loss` value:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"lookup_metric = \"log_loss\"\n",
|
|
"best_run, fitted_model = local_run.get_output(metric = lookup_metric)\n",
|
|
"print(best_run)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"print_model(fitted_model)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Model from a Specific Iteration\n",
|
|
"Show the run and the model from the third iteration:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"iteration = 3\n",
|
|
"third_run, third_model = local_run.get_output(iteration = iteration)\n",
|
|
"print(third_run)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"print_model(third_model)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Test \n",
|
|
"\n",
|
|
"#### Load Test Data"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"digits = datasets.load_digits()\n",
|
|
"X_test = digits.data[:10, :]\n",
|
|
"y_test = digits.target[:10]\n",
|
|
"images = digits.images[:10]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Testing Our Best Fitted Model\n",
|
|
"We will try to predict 2 digits and see how our model works."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Randomly select digits and test.\n",
|
|
"for index in np.random.choice(len(y_test), 2, replace = False):\n",
|
|
" print(index)\n",
|
|
" predicted = fitted_model.predict(X_test[index:index + 1])[0]\n",
|
|
" label = y_test[index]\n",
|
|
" title = \"Label value = %d Predicted value = %d \" % (label, predicted)\n",
|
|
" fig = plt.figure(1, figsize = (3,3))\n",
|
|
" ax1 = fig.add_axes((0,0,.8,.8))\n",
|
|
" ax1.set_title(title)\n",
|
|
" plt.imshow(images[index], cmap = plt.cm.gray_r, interpolation = 'nearest')\n",
|
|
" plt.show()"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"authors": [
|
|
{
|
|
"name": "savitam"
|
|
}
|
|
],
|
|
"kernelspec": {
|
|
"display_name": "Python 3.6",
|
|
"language": "python",
|
|
"name": "python36"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.6.6"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
} |