{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Copyright (c) Microsoft Corporation. All rights reserved.\n", "\n", "Licensed under the MIT License." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/classification/auto-ml-classification.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Automated Machine Learning\n", "_**Classification with Local Compute**_\n", "\n", "## Contents\n", "1. [Introduction](#Introduction)\n", "1. [Setup](#Setup)\n", "1. [Data](#Data)\n", "1. [Train](#Train)\n", "1. [Results](#Results)\n", "1. [Test](#Test)\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction\n", "\n", "In this example we use the scikit-learn's [digit dataset](http://scikit-learn.org/stable/datasets/index.html#optical-recognition-of-handwritten-digits-dataset) to showcase how you can use AutoML for a simple classification problem.\n", "\n", "Make sure you have executed the [configuration](../../../configuration.ipynb) before running this notebook.\n", "\n", "In this notebook you will learn how to:\n", "1. Create an `Experiment` in an existing `Workspace`.\n", "2. Configure AutoML using `AutoMLConfig`.\n", "3. Train the model using local compute.\n", "4. Explore the results.\n", "5. Test the best fitted model." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup\n", "\n", "As part of the setup you have already created an Azure ML `Workspace` object. For AutoML you will need to create an `Experiment` object, which is a named object in a `Workspace` used to run experiments." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import logging\n", "\n", "from matplotlib import pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "from sklearn import datasets\n", "\n", "import azureml.core\n", "from azureml.core.experiment import Experiment\n", "from azureml.core.workspace import Workspace\n", "from azureml.train.automl import AutoMLConfig" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Accessing the Azure ML workspace requires authentication with Azure.\n", "\n", "The default authentication is interactive authentication using the default tenant. Executing the `ws = Workspace.from_config()` line in the cell below will prompt for authentication the first time that it is run.\n", "\n", "If you have multiple Azure tenants, you can specify the tenant by replacing the `ws = Workspace.from_config()` line in the cell below with the following:\n", "\n", "```\n", "from azureml.core.authentication import InteractiveLoginAuthentication\n", "auth = InteractiveLoginAuthentication(tenant_id = 'mytenantid')\n", "ws = Workspace.from_config(auth = auth)\n", "```\n", "\n", "If you need to run in an environment where interactive login is not possible, you can use Service Principal authentication by replacing the `ws = Workspace.from_config()` line in the cell below with the following:\n", "\n", "```\n", "from azureml.core.authentication import ServicePrincipalAuthentication\n", "auth = auth = ServicePrincipalAuthentication('mytenantid', 'myappid', 'mypassword')\n", "ws = Workspace.from_config(auth = auth)\n", "```\n", "For more details, see [aka.ms/aml-notebook-auth](http://aka.ms/aml-notebook-auth)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ws = Workspace.from_config()\n", "\n", "# Choose a name for the experiment and specify the project folder.\n", "experiment_name = 'automl-classification'\n", "project_folder = './sample_projects/automl-classification'\n", "\n", "experiment = Experiment(ws, experiment_name)\n", "\n", "output = {}\n", "output['SDK version'] = azureml.core.VERSION\n", "output['Subscription ID'] = ws.subscription_id\n", "output['Workspace Name'] = ws.name\n", "output['Resource Group'] = ws.resource_group\n", "output['Location'] = ws.location\n", "output['Project Directory'] = project_folder\n", "output['Experiment Name'] = experiment.name\n", "pd.set_option('display.max_colwidth', -1)\n", "outputDf = pd.DataFrame(data = output, index = [''])\n", "outputDf.T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data\n", "\n", "This uses scikit-learn's [load_digits](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html) method." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "digits = datasets.load_digits()\n", "\n", "# Exclude the first 100 rows from training so that they can be used for test.\n", "X_train = digits.data[100:,:]\n", "y_train = digits.target[100:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Train\n", "\n", "Instantiate an `AutoMLConfig` object to specify the settings and data used to run the experiment.\n", "\n", "|Property|Description|\n", "|-|-|\n", "|**task**|classification or regression|\n", "|**primary_metric**|This is the metric that you want to optimize. Classification supports the following primary metrics:
accuracy
AUC_weighted
average_precision_score_weighted
norm_macro_recall
precision_score_weighted|\n", "|**X**|(sparse) array-like, shape = [n_samples, n_features]|\n", "|**y**|(sparse) array-like, shape = [n_samples, ], Multi-class targets.|\n", "|**n_cross_validations**|Number of cross validation splits.|\n", "|\n", "\n", "Automated machine learning trains multiple machine learning pipelines. Each pipelines training is known as an iteration.\n", "* You can specify a maximum number of iterations using the `iterations` parameter.\n", "* You can specify a maximum time for the run using the `experiment_timeout_minutes` parameter.\n", "* If you specify neither the `iterations` nor the `experiment_timeout_minutes`, automated ML keeps running iterations while it continues to see improvements in the scores.\n", "\n", "The following example doesn't specify `iterations` or `experiment_timeout_minutes` and so runs until the scores stop improving.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "automl_config = AutoMLConfig(task = 'classification',\n", " primary_metric = 'AUC_weighted',\n", " X = X_train, \n", " y = y_train,\n", " n_cross_validations = 3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Call the `submit` method on the experiment object and pass the run configuration. Execution of local runs is synchronous. Depending on the data and the number of iterations this can run for a while.\n", "In this example, we specify `show_output = True` to print currently running iterations to the console." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "local_run = experiment.submit(automl_config, show_output = True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "local_run" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Optionally, you can continue an interrupted local run by calling `continue_experiment` without the `iterations` parameter, or run more iterations for a completed run by specifying the `iterations` parameter:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "local_run = local_run.continue_experiment(X = X_train, \n", " y = y_train, \n", " show_output = True,\n", " iterations = 5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Widget for Monitoring Runs\n", "\n", "The widget will first report a \"loading\" status while running the first iteration. After completing the first iteration, an auto-updating graph and table will be shown. The widget will refresh once per minute, so you should see the graph update as child runs complete.\n", "\n", "**Note:** The widget displays a link at the bottom. Use this link to open a web interface to explore the individual run details." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "widget-rundetails-sample" ] }, "outputs": [], "source": [ "from azureml.widgets import RunDetails\n", "RunDetails(local_run).show() " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "#### Retrieve All Child Runs\n", "You can also use SDK methods to fetch all the child runs and see individual metrics that we log." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "children = list(local_run.get_children())\n", "metricslist = {}\n", "for run in children:\n", " properties = run.get_properties()\n", " metrics = {k: v for k, v in run.get_metrics().items() if isinstance(v, float)}\n", " metricslist[int(properties['iteration'])] = metrics\n", "\n", "rundata = pd.DataFrame(metricslist).sort_index(1)\n", "rundata" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Retrieve the Best Model\n", "\n", "Below we select the best pipeline from our iterations. The `get_output` method returns the best run and the fitted model. The Model includes the pipeline and any pre-processing. Overloads on `get_output` allow you to retrieve the best run and fitted model for *any* logged metric or for a particular *iteration*." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "best_run, fitted_model = local_run.get_output()\n", "print(best_run)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Print the properties of the model\n", "The fitted_model is a python object and you can read the different properties of the object.\n", "The following shows printing hyperparameters for each step in the pipeline." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from pprint import pprint\n", "\n", "def print_model(model, prefix=\"\"):\n", " for step in model.steps:\n", " print(prefix + step[0])\n", " if hasattr(step[1], 'estimators') and hasattr(step[1], 'weights'):\n", " pprint({'estimators': list(e[0] for e in step[1].estimators), 'weights': step[1].weights})\n", " print()\n", " for estimator in step[1].estimators:\n", " print_model(estimator[1], estimator[0]+ ' - ')\n", " elif hasattr(step[1], '_base_learners') and hasattr(step[1], '_meta_learner'):\n", " print(\"\\nMeta Learner\")\n", " pprint(step[1]._meta_learner)\n", " print()\n", " for estimator in step[1]._base_learners:\n", " print_model(estimator[1], estimator[0]+ ' - ')\n", " else:\n", " pprint(step[1].get_params())\n", " print()\n", " \n", "print_model(fitted_model)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Best Model Based on Any Other Metric\n", "Show the run and the model that has the smallest `log_loss` value:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lookup_metric = \"log_loss\"\n", "best_run, fitted_model = local_run.get_output(metric = lookup_metric)\n", "print(best_run)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print_model(fitted_model)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Model from a Specific Iteration\n", "Show the run and the model from the third iteration:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "iteration = 3\n", "third_run, third_model = local_run.get_output(iteration = iteration)\n", "print(third_run)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print_model(third_model)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Test \n", "\n", "#### Load Test Data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "digits = datasets.load_digits()\n", "X_test = digits.data[:10, :]\n", "y_test = digits.target[:10]\n", "images = digits.images[:10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Testing Our Best Fitted Model\n", "We will try to predict 2 digits and see how our model works." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Randomly select digits and test.\n", "for index in np.random.choice(len(y_test), 2, replace = False):\n", " print(index)\n", " predicted = fitted_model.predict(X_test[index:index + 1])[0]\n", " label = y_test[index]\n", " title = \"Label value = %d Predicted value = %d \" % (label, predicted)\n", " fig = plt.figure(1, figsize = (3,3))\n", " ax1 = fig.add_axes((0,0,.8,.8))\n", " ax1.set_title(title)\n", " plt.imshow(images[index], cmap = plt.cm.gray_r, interpolation = 'nearest')\n", " plt.show()" ] } ], "metadata": { "authors": [ { "name": "savitam" } ], "kernelspec": { "display_name": "Python 3.6", "language": "python", "name": "python36" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.6" } }, "nbformat": 4, "nbformat_minor": 2 }