From fa7685f6fa4382e979dc26736ea2ed2e21a9ac6e Mon Sep 17 00:00:00 2001 From: Jeff Shepherd Date: Fri, 15 Mar 2019 13:18:17 -0700 Subject: [PATCH] Added example of printing model hyperparameters --- .../auto-ml-classification.ipynb | 835 +++++++++--------- 1 file changed, 441 insertions(+), 394 deletions(-) diff --git a/how-to-use-azureml/automated-machine-learning/classification/auto-ml-classification.ipynb b/how-to-use-azureml/automated-machine-learning/classification/auto-ml-classification.ipynb index 5455c02c..3594af18 100644 --- a/how-to-use-azureml/automated-machine-learning/classification/auto-ml-classification.ipynb +++ b/how-to-use-azureml/automated-machine-learning/classification/auto-ml-classification.ipynb @@ -1,396 +1,443 @@ { - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Copyright (c) Microsoft Corporation. All rights reserved.\n", - "\n", - "Licensed under the MIT License." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Automated Machine Learning\n", - "_**Classification with Local Compute**_\n", - "\n", - "## Contents\n", - "1. [Introduction](#Introduction)\n", - "1. [Setup](#Setup)\n", - "1. [Data](#Data)\n", - "1. [Train](#Train)\n", - "1. [Results](#Results)\n", - "1. [Test](#Test)\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Introduction\n", - "\n", - "In this example we use the scikit-learn's [digit dataset](http://scikit-learn.org/stable/datasets/index.html#optical-recognition-of-handwritten-digits-dataset) to showcase how you can use AutoML for a simple classification problem.\n", - "\n", - "Make sure you have executed the [configuration](../../../configuration.ipynb) before running this notebook.\n", - "\n", - "In this notebook you will learn how to:\n", - "1. Create an `Experiment` in an existing `Workspace`.\n", - "2. Configure AutoML using `AutoMLConfig`.\n", - "3. Train the model using local compute.\n", - "4. Explore the results.\n", - "5. Test the best fitted model." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Setup\n", - "\n", - "As part of the setup you have already created an Azure ML `Workspace` object. For AutoML you will need to create an `Experiment` object, which is a named object in a `Workspace` used to run experiments." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import logging\n", - "\n", - "from matplotlib import pyplot as plt\n", - "import numpy as np\n", - "import pandas as pd\n", - "from sklearn import datasets\n", - "\n", - "import azureml.core\n", - "from azureml.core.experiment import Experiment\n", - "from azureml.core.workspace import Workspace\n", - "from azureml.train.automl import AutoMLConfig" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "ws = Workspace.from_config()\n", - "\n", - "# Choose a name for the experiment and specify the project folder.\n", - "experiment_name = 'automl-classification'\n", - "project_folder = './sample_projects/automl-classification'\n", - "\n", - "experiment = Experiment(ws, experiment_name)\n", - "\n", - "output = {}\n", - "output['SDK version'] = azureml.core.VERSION\n", - "output['Subscription ID'] = ws.subscription_id\n", - "output['Workspace Name'] = ws.name\n", - "output['Resource Group'] = ws.resource_group\n", - "output['Location'] = ws.location\n", - "output['Project Directory'] = project_folder\n", - "output['Experiment Name'] = experiment.name\n", - "pd.set_option('display.max_colwidth', -1)\n", - "outputDf = pd.DataFrame(data = output, index = [''])\n", - "outputDf.T" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Data\n", - "\n", - "This uses scikit-learn's [load_digits](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html) method." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "digits = datasets.load_digits()\n", - "\n", - "# Exclude the first 100 rows from training so that they can be used for test.\n", - "X_train = digits.data[100:,:]\n", - "y_train = digits.target[100:]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Train\n", - "\n", - "Instantiate an `AutoMLConfig` object to specify the settings and data used to run the experiment.\n", - "\n", - "|Property|Description|\n", - "|-|-|\n", - "|**task**|classification or regression|\n", - "|**primary_metric**|This is the metric that you want to optimize. Classification supports the following primary metrics:
accuracy
AUC_weighted
average_precision_score_weighted
norm_macro_recall
precision_score_weighted|\n", - "|**iteration_timeout_minutes**|Time limit in minutes for each iteration.|\n", - "|**iterations**|Number of iterations. In each iteration AutoML trains a specific pipeline with the data.|\n", - "|**n_cross_validations**|Number of cross validation splits.|\n", - "|**X**|(sparse) array-like, shape = [n_samples, n_features]|\n", - "|**y**|(sparse) array-like, shape = [n_samples, ], [n_samples, n_classes]
Multi-class targets. An indicator matrix turns on multilabel classification. This should be an array of integers.|\n", - "|**path**|Relative path to the project folder. AutoML stores configuration files for the experiment under this folder. You can specify a new empty folder.|" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "automl_config = AutoMLConfig(task = 'classification',\n", - " debug_log = 'automl_errors.log',\n", - " primary_metric = 'AUC_weighted',\n", - " iteration_timeout_minutes = 60,\n", - " iterations = 25,\n", - " n_cross_validations = 3,\n", - " verbosity = logging.INFO,\n", - " X = X_train, \n", - " y = y_train,\n", - " path = project_folder)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Call the `submit` method on the experiment object and pass the run configuration. Execution of local runs is synchronous. Depending on the data and the number of iterations this can run for a while.\n", - "In this example, we specify `show_output = True` to print currently running iterations to the console." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "local_run = experiment.submit(automl_config, show_output = True)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "local_run" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Optionally, you can continue an interrupted local run by calling `continue_experiment` without the `iterations` parameter, or run more iterations for a completed run by specifying the `iterations` parameter:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "local_run = local_run.continue_experiment(X = X_train, \n", - " y = y_train, \n", - " show_output = True,\n", - " iterations = 5)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Results" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Widget for Monitoring Runs\n", - "\n", - "The widget will first report a \"loading\" status while running the first iteration. After completing the first iteration, an auto-updating graph and table will be shown. The widget will refresh once per minute, so you should see the graph update as child runs complete.\n", - "\n", - "**Note:** The widget displays a link at the bottom. Use this link to open a web interface to explore the individual run details." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from azureml.widgets import RunDetails\n", - "RunDetails(local_run).show() " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "#### Retrieve All Child Runs\n", - "You can also use SDK methods to fetch all the child runs and see individual metrics that we log." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "children = list(local_run.get_children())\n", - "metricslist = {}\n", - "for run in children:\n", - " properties = run.get_properties()\n", - " metrics = {k: v for k, v in run.get_metrics().items() if isinstance(v, float)}\n", - " metricslist[int(properties['iteration'])] = metrics\n", - "\n", - "rundata = pd.DataFrame(metricslist).sort_index(1)\n", - "rundata" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Retrieve the Best Model\n", - "\n", - "Below we select the best pipeline from our iterations. The `get_output` method returns the best run and the fitted model. The Model includes the pipeline and any pre-processing. Overloads on `get_output` allow you to retrieve the best run and fitted model for *any* logged metric or for a particular *iteration*." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "best_run, fitted_model = local_run.get_output()\n", - "print(best_run)\n", - "print(fitted_model)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Best Model Based on Any Other Metric\n", - "Show the run and the model that has the smallest `log_loss` value:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "lookup_metric = \"log_loss\"\n", - "best_run, fitted_model = local_run.get_output(metric = lookup_metric)\n", - "print(best_run)\n", - "print(fitted_model)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Model from a Specific Iteration\n", - "Show the run and the model from the third iteration:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "iteration = 3\n", - "third_run, third_model = local_run.get_output(iteration = iteration)\n", - "print(third_run)\n", - "print(third_model)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Test \n", - "\n", - "#### Load Test Data" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "digits = datasets.load_digits()\n", - "X_test = digits.data[:10, :]\n", - "y_test = digits.target[:10]\n", - "images = digits.images[:10]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Testing Our Best Fitted Model\n", - "We will try to predict 2 digits and see how our model works." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Randomly select digits and test.\n", - "for index in np.random.choice(len(y_test), 2, replace = False):\n", - " print(index)\n", - " predicted = fitted_model.predict(X_test[index:index + 1])[0]\n", - " label = y_test[index]\n", - " title = \"Label value = %d Predicted value = %d \" % (label, predicted)\n", - " fig = plt.figure(1, figsize = (3,3))\n", - " ax1 = fig.add_axes((0,0,.8,.8))\n", - " ax1.set_title(title)\n", - " plt.imshow(images[index], cmap = plt.cm.gray_r, interpolation = 'nearest')\n", - " plt.show()" - ] - } - ], - "metadata": { - "authors": [ - { - "name": "savitam" - } - ], - "kernelspec": { - "display_name": "Python 3.6", - "language": "python", - "name": "python36" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.6.6" - } + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Copyright (c) Microsoft Corporation. All rights reserved.\n", + "\n", + "Licensed under the MIT License." + ] }, - "nbformat": 4, - "nbformat_minor": 2 -} \ No newline at end of file + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Automated Machine Learning\n", + "_**Classification with Local Compute**_\n", + "\n", + "## Contents\n", + "1. [Introduction](#Introduction)\n", + "1. [Setup](#Setup)\n", + "1. [Data](#Data)\n", + "1. [Train](#Train)\n", + "1. [Results](#Results)\n", + "1. [Test](#Test)\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Introduction\n", + "\n", + "In this example we use the scikit-learn's [digit dataset](http://scikit-learn.org/stable/datasets/index.html#optical-recognition-of-handwritten-digits-dataset) to showcase how you can use AutoML for a simple classification problem.\n", + "\n", + "Make sure you have executed the [configuration](../../../configuration.ipynb) before running this notebook.\n", + "\n", + "In this notebook you will learn how to:\n", + "1. Create an `Experiment` in an existing `Workspace`.\n", + "2. Configure AutoML using `AutoMLConfig`.\n", + "3. Train the model using local compute.\n", + "4. Explore the results.\n", + "5. Test the best fitted model." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Setup\n", + "\n", + "As part of the setup you have already created an Azure ML `Workspace` object. For AutoML you will need to create an `Experiment` object, which is a named object in a `Workspace` used to run experiments." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import logging\n", + "\n", + "from matplotlib import pyplot as plt\n", + "import numpy as np\n", + "import pandas as pd\n", + "from sklearn import datasets\n", + "\n", + "import azureml.core\n", + "from azureml.core.experiment import Experiment\n", + "from azureml.core.workspace import Workspace\n", + "from azureml.train.automl import AutoMLConfig" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ws = Workspace.from_config()\n", + "\n", + "# Choose a name for the experiment and specify the project folder.\n", + "experiment_name = 'automl-classification'\n", + "project_folder = './sample_projects/automl-classification'\n", + "\n", + "experiment = Experiment(ws, experiment_name)\n", + "\n", + "output = {}\n", + "output['SDK version'] = azureml.core.VERSION\n", + "output['Subscription ID'] = ws.subscription_id\n", + "output['Workspace Name'] = ws.name\n", + "output['Resource Group'] = ws.resource_group\n", + "output['Location'] = ws.location\n", + "output['Project Directory'] = project_folder\n", + "output['Experiment Name'] = experiment.name\n", + "pd.set_option('display.max_colwidth', -1)\n", + "outputDf = pd.DataFrame(data = output, index = [''])\n", + "outputDf.T" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Data\n", + "\n", + "This uses scikit-learn's [load_digits](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html) method." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "digits = datasets.load_digits()\n", + "\n", + "# Exclude the first 100 rows from training so that they can be used for test.\n", + "X_train = digits.data[100:,:]\n", + "y_train = digits.target[100:]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Train\n", + "\n", + "Instantiate an `AutoMLConfig` object to specify the settings and data used to run the experiment.\n", + "\n", + "|Property|Description|\n", + "|-|-|\n", + "|**task**|classification or regression|\n", + "|**primary_metric**|This is the metric that you want to optimize. Classification supports the following primary metrics:
accuracy
AUC_weighted
average_precision_score_weighted
norm_macro_recall
precision_score_weighted|\n", + "|**iteration_timeout_minutes**|Time limit in minutes for each iteration.|\n", + "|**iterations**|Number of iterations. In each iteration AutoML trains a specific pipeline with the data.|\n", + "|**n_cross_validations**|Number of cross validation splits.|\n", + "|**X**|(sparse) array-like, shape = [n_samples, n_features]|\n", + "|**y**|(sparse) array-like, shape = [n_samples, ], Multi-class targets.|\n", + "|**path**|Relative path to the project folder. AutoML stores configuration files for the experiment under this folder. You can specify a new empty folder.|" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "automl_config = AutoMLConfig(task = 'classification',\n", + " debug_log = 'automl_errors.log',\n", + " primary_metric = 'AUC_weighted',\n", + " iteration_timeout_minutes = 60,\n", + " iterations = 25,\n", + " n_cross_validations = 3,\n", + " verbosity = logging.INFO,\n", + " X = X_train, \n", + " y = y_train,\n", + " path = project_folder)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Call the `submit` method on the experiment object and pass the run configuration. Execution of local runs is synchronous. Depending on the data and the number of iterations this can run for a while.\n", + "In this example, we specify `show_output = True` to print currently running iterations to the console." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "local_run = experiment.submit(automl_config, show_output = True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "local_run" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Optionally, you can continue an interrupted local run by calling `continue_experiment` without the `iterations` parameter, or run more iterations for a completed run by specifying the `iterations` parameter:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "local_run = local_run.continue_experiment(X = X_train, \n", + " y = y_train, \n", + " show_output = True,\n", + " iterations = 5)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Results" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Widget for Monitoring Runs\n", + "\n", + "The widget will first report a \"loading\" status while running the first iteration. After completing the first iteration, an auto-updating graph and table will be shown. The widget will refresh once per minute, so you should see the graph update as child runs complete.\n", + "\n", + "**Note:** The widget displays a link at the bottom. Use this link to open a web interface to explore the individual run details." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from azureml.widgets import RunDetails\n", + "RunDetails(local_run).show() " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "#### Retrieve All Child Runs\n", + "You can also use SDK methods to fetch all the child runs and see individual metrics that we log." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "children = list(local_run.get_children())\n", + "metricslist = {}\n", + "for run in children:\n", + " properties = run.get_properties()\n", + " metrics = {k: v for k, v in run.get_metrics().items() if isinstance(v, float)}\n", + " metricslist[int(properties['iteration'])] = metrics\n", + "\n", + "rundata = pd.DataFrame(metricslist).sort_index(1)\n", + "rundata" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Retrieve the Best Model\n", + "\n", + "Below we select the best pipeline from our iterations. The `get_output` method returns the best run and the fitted model. The Model includes the pipeline and any pre-processing. Overloads on `get_output` allow you to retrieve the best run and fitted model for *any* logged metric or for a particular *iteration*." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "best_run, fitted_model = local_run.get_output()\n", + "print(best_run)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Print the properties of the model\n", + "The fitted_model is a python object and you can read the different properties of the object.\n", + "The following shows printing hyperparameters for each step in the pipeline." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from pprint import pprint\n", + "\n", + "def print_model(model, prefix=\"\"):\n", + " for step in model.steps:\n", + " print(prefix + step[0])\n", + " if hasattr(step[1], 'estimators') and hasattr(step[1], 'weights'):\n", + " pprint({'estimators': list(e[0] for e in step[1].estimators), 'weights': step[1].weights})\n", + " print()\n", + " for estimator in step[1].estimators:\n", + " print_model(estimator[1], estimator[0]+ ' - ')\n", + " else:\n", + " pprint(step[1].get_params())\n", + " print()\n", + " \n", + "print_model(fitted_model)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Best Model Based on Any Other Metric\n", + "Show the run and the model that has the smallest `log_loss` value:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "lookup_metric = \"log_loss\"\n", + "best_run, fitted_model = local_run.get_output(metric = lookup_metric)\n", + "print(best_run)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print_model(fitted_model)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Model from a Specific Iteration\n", + "Show the run and the model from the third iteration:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "iteration = 3\n", + "third_run, third_model = local_run.get_output(iteration = iteration)\n", + "print(third_run)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print_model(third_model)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Test \n", + "\n", + "#### Load Test Data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "digits = datasets.load_digits()\n", + "X_test = digits.data[:10, :]\n", + "y_test = digits.target[:10]\n", + "images = digits.images[:10]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Testing Our Best Fitted Model\n", + "We will try to predict 2 digits and see how our model works." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Randomly select digits and test.\n", + "for index in np.random.choice(len(y_test), 2, replace = False):\n", + " print(index)\n", + " predicted = fitted_model.predict(X_test[index:index + 1])[0]\n", + " label = y_test[index]\n", + " title = \"Label value = %d Predicted value = %d \" % (label, predicted)\n", + " fig = plt.figure(1, figsize = (3,3))\n", + " ax1 = fig.add_axes((0,0,.8,.8))\n", + " ax1.set_title(title)\n", + " plt.imshow(images[index], cmap = plt.cm.gray_r, interpolation = 'nearest')\n", + " plt.show()" + ] + } + ], + "metadata": { + "authors": [ + { + "name": "savitam" + } + ], + "kernelspec": { + "display_name": "Python 3.6", + "language": "python", + "name": "python36" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.6" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}