{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Copyright (c) Microsoft Corporation. All rights reserved.\n", "\n", "Licensed under the MIT License." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# AutoML 05: Blacklisting Models, Early Termination, and Handling Missing Data\n", "\n", "In this example we use the scikit-learn's [digit dataset](http://scikit-learn.org/stable/datasets/index.html#optical-recognition-of-handwritten-digits-dataset) to showcase how you can use AutoML for handling missing values in data. We also provide a stopping metric indicating a target for the primary metrics so that AutoML can terminate the run without necessarly going through all the iterations. Finally, if you want to avoid a certain pipeline, we allow you to specify a blacklist of algorithms that AutoML will ignore for this run.\n", "\n", "Make sure you have executed the [00.configuration](00.configuration.ipynb) before running this notebook.\n", "\n", "In this notebook you will learn how to:\n", "1. Create an `Experiment` in an existing `Workspace`.\n", "2. Configure AutoML using `AutoMLConfig`.\n", "4. Train the model.\n", "5. Explore the results.\n", "6. Test the best fitted model.\n", "\n", "In addition this notebook showcases the following features\n", "- **Blacklisting** certain pipelines\n", "- Specifying **target metrics** to indicate stopping criteria\n", "- Handling **missing data** in the input\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create an Experiment\n", "\n", "As part of the setup you have already created an Azure ML `Workspace` object. For AutoML you will need to create an `Experiment` object, which is a named object in a `Workspace` used to run experiments." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import logging\n", "import os\n", "import random\n", "\n", "from matplotlib import pyplot as plt\n", "from matplotlib.pyplot import imshow\n", "import numpy as np\n", "import pandas as pd\n", "from sklearn import datasets\n", "\n", "import azureml.core\n", "from azureml.core.experiment import Experiment\n", "from azureml.core.workspace import Workspace\n", "from azureml.train.automl import AutoMLConfig\n", "from azureml.train.automl.run import AutoMLRun" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ws = Workspace.from_config()\n", "\n", "# Choose a name for the experiment.\n", "experiment_name = 'automl-local-missing-data'\n", "project_folder = './sample_projects/automl-local-missing-data'\n", "\n", "experiment = Experiment(ws, experiment_name)\n", "\n", "output = {}\n", "output['SDK version'] = azureml.core.VERSION\n", "output['Subscription ID'] = ws.subscription_id\n", "output['Workspace'] = ws.name\n", "output['Resource Group'] = ws.resource_group\n", "output['Location'] = ws.location\n", "output['Project Directory'] = project_folder\n", "output['Experiment Name'] = experiment.name\n", "pd.set_option('display.max_colwidth', -1)\n", "pd.DataFrame(data=output, index=['']).T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Diagnostics\n", "\n", "Opt-in diagnostics for better experience, quality, and security of future releases." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from azureml.telemetry import set_diagnostics_collection\n", "set_diagnostics_collection(send_diagnostics = True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Creating missing data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from scipy import sparse\n", "\n", "digits = datasets.load_digits()\n", "X_train = digits.data[10:,:]\n", "y_train = digits.target[10:]\n", "\n", "# Add missing values in 75% of the lines.\n", "missing_rate = 0.75\n", "n_missing_samples = int(np.floor(X_train.shape[0] * missing_rate))\n", "missing_samples = np.hstack((np.zeros(X_train.shape[0] - n_missing_samples, dtype=np.bool), np.ones(n_missing_samples, dtype=np.bool)))\n", "rng = np.random.RandomState(0)\n", "rng.shuffle(missing_samples)\n", "missing_features = rng.randint(0, X_train.shape[1], n_missing_samples)\n", "X_train[np.where(missing_samples)[0], missing_features] = np.nan" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = pd.DataFrame(data = X_train)\n", "df['Label'] = pd.Series(y_train, index=df.index)\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Configure AutoML\n", "\n", "Instantiate an `AutoMLConfig` object to specify the settings and data used to run the experiment. This includes setting `exit_score`, which should cause the run to complete before the `iterations` count is reached.\n", "\n", "|Property|Description|\n", "|-|-|\n", "|**task**|classification or regression|\n", "|**primary_metric**|This is the metric that you want to optimize. Classification supports the following primary metrics:
accuracy
AUC_weighted
balanced_accuracy
average_precision_score_weighted
precision_score_weighted|\n", "|**max_time_sec**|Time limit in seconds for each iteration.|\n", "|**iterations**|Number of iterations. In each iteration AutoML trains a specific pipeline with the data.|\n", "|**n_cross_validations**|Number of cross validation splits.|\n", "|**preprocess**|Setting this to *True* enables AutoML to perform preprocessing on the input to handle *missing data*, and to perform some common *feature extraction*.|\n", "|**exit_score**|*double* value indicating the target for *primary_metric*.
Once the target is surpassed the run terminates.|\n", "|**blacklist_algos**|*List* of *strings* indicating machine learning algorithms for AutoML to avoid in this run.

Allowed values for **Classification**
LogisticRegression
SGDClassifierWrapper
NBWrapper
BernoulliNB
SVCWrapper
LinearSVMWrapper
KNeighborsClassifier
DecisionTreeClassifier
RandomForestClassifier
ExtraTreesClassifier
LightGBMClassifier

Allowed values for **Regression**
ElasticNet
GradientBoostingRegressor
DecisionTreeRegressor
KNeighborsRegressor
LassoLars
SGDRegressor
RandomForestRegressor
ExtraTreesRegressor|\n", "|**X**|(sparse) array-like, shape = [n_samples, n_features]|\n", "|**y**|(sparse) array-like, shape = [n_samples, ], [n_samples, n_classes]
Multi-class targets. An indicator matrix turns on multilabel classification. This should be an array of integers.|\n", "|**path**|Relative path to the project folder. AutoML stores configuration files for the experiment under this folder. You can specify a new empty folder.|" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "automl_config = AutoMLConfig(task = 'classification',\n", " debug_log = 'automl_errors.log',\n", " primary_metric = 'AUC_weighted',\n", " max_time_sec = 3600,\n", " iterations = 20,\n", " n_cross_validations = 5,\n", " preprocess = True,\n", " exit_score = 0.9984,\n", " blacklist_algos = ['KNeighborsClassifier','LinearSVMWrapper'],\n", " verbosity = logging.INFO,\n", " X = X_train, \n", " y = y_train,\n", " path = project_folder)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Train the Models\n", "\n", "Call the `submit` method on the experiment object and pass the run configuration. Execution of local runs is synchronous. Depending on the data and the number of iterations this can run for a while.\n", "In this example, we specify `show_output = True` to print currently running iterations to the console." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "local_run = experiment.submit(automl_config, show_output = True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Explore the Results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Widget for Monitoring Runs\n", "\n", "The widget will first report a \"loading\" status while running the first iteration. After completing the first iteration, an auto-updating graph and table will be shown. The widget will refresh once per minute, so you should see the graph update as child runs complete.\n", "\n", "**Note:** The widget displays a link at the bottom. Use this link to open a web interface to explore the individual run details." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from azureml.train.widgets import RunDetails\n", "RunDetails(local_run).show() " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "#### Retrieve All Child Runs\n", "You can also use SDK methods to fetch all the child runs and see individual metrics that we log." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "children = list(local_run.get_children())\n", "metricslist = {}\n", "for run in children:\n", " properties = run.get_properties()\n", " metrics = {k: v for k, v in run.get_metrics().items() if isinstance(v, float)}\n", " metricslist[int(properties['iteration'])] = metrics\n", "\n", "rundata = pd.DataFrame(metricslist).sort_index(1)\n", "rundata" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Retrieve the Best Model\n", "\n", "Below we select the best pipeline from our iterations. The `get_output` method returns the best run and the fitted model. The Model includes the pipeline and any pre-processing. Overloads on `get_output` allow you to retrieve the best run and fitted model for *any* logged metric or for a particular *iteration*." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "best_run, fitted_model = local_run.get_output()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Best Model Based on Any Other Metric\n", "Show the run and the model which has the smallest `accuracy` value:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# lookup_metric = \"accuracy\"\n", "# best_run, fitted_model = local_run.get_output(metric = lookup_metric)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Model from a Specific Iteration\n", "Show the run and the model from the third iteration:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# iteration = 3\n", "# best_run, fitted_model = local_run.get_output(iteration = iteration)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Testing the best Fitted Model" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "digits = datasets.load_digits()\n", "X_test = digits.data[:10, :]\n", "y_test = digits.target[:10]\n", "images = digits.images[:10]\n", "\n", "# Randomly select digits and test.\n", "for index in np.random.choice(len(y_test), 2, replace = False):\n", " print(index)\n", " predicted = fitted_model.predict(X_test[index:index + 1])[0]\n", " label = y_test[index]\n", " title = \"Label value = %d Predicted value = %d \" % (label, predicted)\n", " fig = plt.figure(1, figsize=(3,3))\n", " ax1 = fig.add_axes((0,0,.8,.8))\n", " ax1.set_title(title)\n", " plt.imshow(images[index], cmap = plt.cm.gray_r, interpolation = 'nearest')\n", " plt.show()\n" ] } ], "metadata": { "authors": [ { "name": "savitam" } ], "kernelspec": { "display_name": "Python 3.6", "language": "python", "name": "python36" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.6" } }, "nbformat": 4, "nbformat_minor": 2 }