From dac7c5d5ca0c11ca38440eff38972ff8103b0b1b Mon Sep 17 00:00:00 2001 From: Roope Astala Date: Fri, 21 Sep 2018 17:30:03 -0400 Subject: [PATCH] Delete 14b.auto-ml-regression-ensemble.ipynb --- automl/14b.auto-ml-regression-ensemble.ipynb | 442 ------------------- 1 file changed, 442 deletions(-) delete mode 100644 automl/14b.auto-ml-regression-ensemble.ipynb diff --git a/automl/14b.auto-ml-regression-ensemble.ipynb b/automl/14b.auto-ml-regression-ensemble.ipynb deleted file mode 100644 index d3855a37..00000000 --- a/automl/14b.auto-ml-regression-ensemble.ipynb +++ /dev/null @@ -1,442 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Copyright (c) Microsoft Corporation. All rights reserved.\n", - "\n", - "Licensed under the MIT License." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# AutoML 14b: Regression with ensembling on local compute\n", - "\n", - "In this example we use the scikit learn's [diabetes dataset](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_diabetes.html) to showcase how you can use AutoML for a simple regression problem.\n", - "\n", - "Make sure you have executed the [00.configuration](00.configuration.ipynb) before running this notebook.\n", - "\n", - "In this notebook you would see\n", - "1. Creating an Experiment using an existing Workspace\n", - "2. Instantiating AutoMLConfig which enables an extra ensembling iteration\n", - "3. Training the Model using local compute\n", - "4. Exploring the results\n", - "5. Testing the fitted model\n", - "\n", - "\n", - "** Disclaimers / Limitations **\n", - "- currently only local compute is supported for the ensembling iteration; support for remote compute will be coming soon\n", - "- currently only Train/Validation split is supported; support for cross-validation will be coming soon" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Create Experiment\n", - "\n", - "As part of the setup you have already created a Workspace. For AutoML you would need to create an Experiment. An Experiment is a named object in a Workspace, which is used to run experiments." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import logging\n", - "import os\n", - "import random\n", - "\n", - "from matplotlib import pyplot as plt\n", - "from matplotlib.pyplot import imshow\n", - "import numpy as np\n", - "import pandas as pd\n", - "from sklearn import datasets\n", - "\n", - "import azureml.core\n", - "from azureml.core.experiment import Experiment\n", - "from azureml.core.workspace import Workspace\n", - "from azureml.train.automl import AutoMLConfig\n", - "from azureml.train.automl.run import AutoMLRun" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "ws = Workspace.from_config()\n", - "\n", - "# choose a name for the experiment\n", - "experiment_name = 'automl-local-regression'\n", - "# project folder\n", - "project_folder = './sample_projects/automl-local-regression'\n", - "\n", - "experiment = Experiment(ws, experiment_name)\n", - "\n", - "output = {}\n", - "output['SDK version'] = azureml.core.VERSION\n", - "output['Subscription ID'] = ws.subscription_id\n", - "output['Workspace Name'] = ws.name\n", - "output['Resource Group'] = ws.resource_group\n", - "output['Location'] = ws.location\n", - "output['Project Directory'] = project_folder\n", - "output['Experiment Name'] = experiment.name\n", - "pd.set_option('display.max_colwidth', -1)\n", - "pd.DataFrame(data = output, index = ['']).T" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Diagnostics\n", - "\n", - "Opt-in diagnostics for better experience, quality, and security of future releases" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from azureml.telemetry import set_diagnostics_collection\n", - "set_diagnostics_collection(send_diagnostics=True)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Read Data" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# load diabetes dataset, a well-known built-in small dataset that comes with scikit-learn\n", - "from sklearn.datasets import load_diabetes\n", - "from sklearn.linear_model import Ridge\n", - "from sklearn.metrics import mean_squared_error\n", - "from sklearn.model_selection import train_test_split\n", - "\n", - "X, y = load_diabetes(return_X_y = True)\n", - "\n", - "columns = ['age', 'gender', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']\n", - "\n", - "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Instantiate Auto ML Config\n", - "\n", - "Instantiate a AutoMLConfig object. This defines the settings and data used to run the experiment.\n", - "\n", - "|Property|Description|\n", - "|-|-|\n", - "|**task**|classification or regression|\n", - "|**primary_metric**|This is the metric that you want to optimize.
Regression supports the following primary metrics
spearman_correlation
normalized_root_mean_squared_error
r2_score
normalized_mean_absolute_error
normalized_root_mean_squared_log_error|\n", - "|**max_time_sec**|Time limit in seconds for each iteration|\n", - "|**iterations**|Number of iterations. In each iteration Auto ML trains a specific pipeline with the data|\n", - "|**n_cross_validations**|Number of cross validation splits|\n", - "|**X**|(sparse) array-like, shape = [n_samples, n_features]|\n", - "|**y**|(sparse) array-like, shape = [n_samples, ], [n_samples, n_classes]
Multi-class targets. An indicator matrix turns on multilabel classification. This should be an array of integers. |\n", - "|**enable_ensembling**|Flag to enable an ensembling iteration after all the other iterations complete|\n", - "|**ensemble_iterations**|Number of iterations during which we choose a fitted model to be part of the final ensemble|" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "automl_config = AutoMLConfig(task='regression',\n", - " max_time_sec = 600,\n", - " iterations = 10,\n", - " primary_metric = 'spearman_correlation', \n", - " debug_log = 'automl.log',\n", - " verbosity = logging.INFO,\n", - " X = X_train, \n", - " y = y_train,\n", - " X_valid = x_test,\n", - " y_valid = y_test,\n", - " enable_ensembling = True,\n", - " ensemble_iterations = 5)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Training the Model\n", - "\n", - "You can call the submit method on the experiment object and pass the run configuration. For Local runs the execution is synchronous. Depending on the data and number of iterations this can run for while.\n", - "You will see the currently running iterations printing to the console." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "local_run = experiment.submit(automl_config, show_output=True)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "local_run" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Exploring the results" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Widget for monitoring runs\n", - "\n", - "The widget will sit on \"loading\" until the first iteration completed, then you will see an auto-updating graph and table show up. It refreshed once per minute, so you should see the graph update as child runs complete.\n", - "\n", - "NOTE: The widget displays a link at the bottom. This links to a web-ui to explore the individual run details." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from azureml.train.widgets import RunDetails\n", - "RunDetails(local_run).show() " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "#### Retrieve All Child Runs\n", - "You can also use sdk methods to fetch all the child runs and see individual metrics that we log. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "children = list(local_run.get_children())\n", - "metricslist = {}\n", - "for run in children:\n", - " properties = run.get_properties()\n", - " metrics = {k: v for k, v in run.get_metrics().items() if isinstance(v, float)} \n", - " metricslist[int(properties['iteration'])] = metrics\n", - " \n", - "rundata = pd.DataFrame(metricslist).sort_index(1)\n", - "rundata" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Retrieve the Best Model\n", - "\n", - "Below we select the best pipeline from our iterations. The *get_output* method on automl_classifier returns the best run and the fitted model for the last *fit* invocation. There are overloads on *get_output* that allow you to retrieve the best run and fitted model for *any* logged metric or a particular *iteration*." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "best_run, fitted_model = local_run.get_output()\n", - "print(best_run)\n", - "print(fitted_model)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Best Model based on any other metric\n", - "Show the run and model that has the smallest `root_mean_squared_error` (which turned out to be the same as the one with largest `spearman_correlation` value):" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "lookup_metric = \"root_mean_squared_error\"\n", - "best_run, fitted_model = local_run.get_output(metric=lookup_metric)\n", - "print(best_run)\n", - "print(fitted_model)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Model from a specific iteration\n", - "Simply show the run and model from the 3rd iteration:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "iteration = 3\n", - "third_run, third_model = local_run.get_output(iteration = iteration)\n", - "print(third_run)\n", - "print(third_model)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Register fitted model for deployment" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "description = 'AutoML Model'\n", - "tags = None\n", - "local_run.register_model(description = description, tags = tags)\n", - "print(local_run.model_id) # Use this id to deploy the model as a web service in Azure" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Testing the Fitted Model" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Predict on training and test set, and calculate residual values." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "y_pred_train = fitted_model.predict(X_train)\n", - "y_residual_train = y_train - y_pred_train\n", - "\n", - "y_pred_test = fitted_model.predict(X_test)\n", - "y_residual_test = y_test - y_pred_test" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%matplotlib inline\n", - "import matplotlib.pyplot as plt\n", - "import numpy as np\n", - "from sklearn import datasets\n", - "from sklearn.metrics import mean_squared_error, r2_score\n", - "\n", - "# set up a multi-plot chart\n", - "f, (a0, a1) = plt.subplots(1, 2, gridspec_kw = {'width_ratios':[1, 1], 'wspace':0, 'hspace': 0})\n", - "f.suptitle('Regression Residual Values', fontsize = 18)\n", - "f.set_figheight(6)\n", - "f.set_figwidth(16)\n", - "\n", - "# plot residual values of training set\n", - "a0.axis([0, 360, -200, 200])\n", - "a0.plot(y_residual_train, 'bo', alpha = 0.5)\n", - "a0.plot([-10,360],[0,0], 'r-', lw = 3)\n", - "a0.text(16,170,'RMSE = {0:.2f}'.format(np.sqrt(mean_squared_error(y_train, y_pred_train))), fontsize = 12)\n", - "a0.text(16,140,'Variance = {0:.2f}'.format(r2_score(y_train, y_pred_train)), fontsize = 12)\n", - "a0.set_xlabel('Training samples', fontsize = 12)\n", - "a0.set_ylabel('Residual Values', fontsize = 12)\n", - "# plot histogram\n", - "a0.hist(y_residual_train, orientation = 'horizontal', color = 'b', bins = 10, histtype = 'step');\n", - "a0.hist(y_residual_train, orientation = 'horizontal', color = 'b', alpha = 0.2, bins = 10);\n", - "\n", - "# plot residual values of test set\n", - "a1.axis([0, 90, -200, 200])\n", - "a1.plot(y_residual_test, 'bo', alpha = 0.5)\n", - "a1.plot([-10,360],[0,0], 'r-', lw = 3)\n", - "a1.text(5,170,'RMSE = {0:.2f}'.format(np.sqrt(mean_squared_error(y_test, y_pred_test))), fontsize = 12)\n", - "a1.text(5,140,'Variance = {0:.2f}'.format(r2_score(y_test, y_pred_test)), fontsize = 12)\n", - "a1.set_xlabel('Test samples', fontsize = 12)\n", - "a1.set_yticklabels([])\n", - "# plot histogram\n", - "a1.hist(y_residual_test, orientation = 'horizontal', color = 'b', bins = 10, histtype = 'step');\n", - "a1.hist(y_residual_test, orientation = 'horizontal', color = 'b', alpha = 0.2, bins = 10);\n", - "\n", - "plt.show()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python [default]", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.6.6" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -}