{
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3.6",
      "name": "python36",
      "language": "python"
    },
    "authors": [
      {
        "name": "savitam"
      }
    ],
    "language_info": {
      "mimetype": "text/x-python",
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "pygments_lexer": "ipython3",
      "name": "python",
      "file_extension": ".py",
      "nbconvert_exporter": "python",
      "version": "3.6.6"
    }
  },
  "nbformat": 4,
  "cells": [
    {
      "metadata": {},
      "source": [
        "Copyright (c) Microsoft Corporation. All rights reserved.\n",
        "\n",
        "Licensed under the MIT License."
      ],
      "cell_type": "markdown"
    },
    {
      "metadata": {},
      "source": [
        "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/regression/auto-ml-regression.png)"
      ],
      "cell_type": "markdown"
    },
    {
      "metadata": {},
      "source": [
        "# Automated Machine Learning\n",
        "_**Regression with Local Compute**_\n",
        "\n",
        "## Contents\n",
        "1. [Introduction](#Introduction)\n",
        "1. [Setup](#Setup)\n",
        "1. [Data](#Data)\n",
        "1. [Train](#Train)\n",
        "1. [Results](#Results)\n",
        "1. [Test](#Test)\n"
      ],
      "cell_type": "markdown"
    },
    {
      "metadata": {},
      "source": [
        "## Introduction\n",
        "In this example we use the scikit-learn's [diabetes dataset](http://scikit-learn.org/stable/datasets/index.html#diabetes-dataset) to showcase how you can use AutoML for a simple regression problem.\n",
        "\n",
        "Make sure you have executed the [configuration](../../../configuration.ipynb) before running this notebook.\n",
        "\n",
        "In this notebook you will learn how to:\n",
        "1. Create an `Experiment` in an existing `Workspace`.\n",
        "2. Configure AutoML using `AutoMLConfig`.\n",
        "3. Train the model using local compute.\n",
        "4. Explore the results.\n",
        "5. Test the best fitted model."
      ],
      "cell_type": "markdown"
    },
    {
      "metadata": {},
      "source": [
        "## Setup\n",
        "\n",
        "As part of the setup you have already created an Azure ML `Workspace` object. For AutoML you will need to create an `Experiment` object, which is a named object in a `Workspace` used to run experiments."
      ],
      "cell_type": "markdown"
    },
    {
      "metadata": {},
      "outputs": [],
      "execution_count": null,
      "source": [
        "import logging\n",
        "\n",
        "from matplotlib import pyplot as plt\n",
        "import numpy as np\n",
        "import pandas as pd\n",
        "\n",
        "import azureml.core\n",
        "from azureml.core.experiment import Experiment\n",
        "from azureml.core.workspace import Workspace\n",
        "from azureml.train.automl import AutoMLConfig"
      ],
      "cell_type": "code"
    },
    {
      "metadata": {},
      "outputs": [],
      "execution_count": null,
      "source": [
        "ws = Workspace.from_config()\n",
        "\n",
        "# Choose a name for the experiment and specify the project folder.\n",
        "experiment_name = 'automl-local-regression'\n",
        "project_folder = './sample_projects/automl-local-regression'\n",
        "\n",
        "experiment = Experiment(ws, experiment_name)\n",
        "\n",
        "output = {}\n",
        "output['SDK version'] = azureml.core.VERSION\n",
        "output['Subscription ID'] = ws.subscription_id\n",
        "output['Workspace Name'] = ws.name\n",
        "output['Resource Group'] = ws.resource_group\n",
        "output['Location'] = ws.location\n",
        "output['Project Directory'] = project_folder\n",
        "output['Experiment Name'] = experiment.name\n",
        "pd.set_option('display.max_colwidth', -1)\n",
        "outputDf = pd.DataFrame(data = output, index = [''])\n",
        "outputDf.T"
      ],
      "cell_type": "code"
    },
    {
      "metadata": {},
      "source": [
        "## Data\n",
        "This uses scikit-learn's [load_diabetes](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_diabetes.html) method."
      ],
      "cell_type": "markdown"
    },
    {
      "metadata": {},
      "outputs": [],
      "execution_count": null,
      "source": [
        "# Load the diabetes dataset, a well-known built-in small dataset that comes with scikit-learn.\n",
        "from sklearn.datasets import load_diabetes\n",
        "from sklearn.model_selection import train_test_split\n",
        "\n",
        "X, y = load_diabetes(return_X_y = True)\n",
        "\n",
        "columns = ['age', 'gender', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']\n",
        "\n",
        "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)"
      ],
      "cell_type": "code"
    },
    {
      "metadata": {},
      "source": [
        "## Train\n",
        "\n",
        "Instantiate an `AutoMLConfig` object to specify the settings and data used to run the experiment.\n",
        "\n",
        "|Property|Description|\n",
        "|-|-|\n",
        "|**task**|classification or regression|\n",
        "|**primary_metric**|This is the metric that you want to optimize. Regression supports the following primary metrics: <br><i>spearman_correlation</i><br><i>normalized_root_mean_squared_error</i><br><i>r2_score</i><br><i>normalized_mean_absolute_error</i>|\n",
        "|**iteration_timeout_minutes**|Time limit in minutes for each iteration.|\n",
        "|**iterations**|Number of iterations. In each iteration AutoML trains a specific pipeline with the data.|\n",
        "|**n_cross_validations**|Number of cross validation splits.|\n",
        "|**X**|(sparse) array-like, shape = [n_samples, n_features]|\n",
        "|**y**|(sparse) array-like, shape = [n_samples, ], targets values.|\n",
        "|**path**|Relative path to the project folder. AutoML stores configuration files for the experiment under this folder. You can specify a new empty folder.|"
      ],
      "cell_type": "markdown"
    },
    {
      "metadata": {},
      "outputs": [],
      "execution_count": null,
      "source": [
        "automl_config = AutoMLConfig(task = 'regression',\n",
        "                             iteration_timeout_minutes = 10,\n",
        "                             iterations = 10,\n",
        "                             primary_metric = 'spearman_correlation',\n",
        "                             n_cross_validations = 5,\n",
        "                             debug_log = 'automl.log',\n",
        "                             verbosity = logging.INFO,\n",
        "                             X = X_train, \n",
        "                             y = y_train,\n",
        "                             path = project_folder)"
      ],
      "cell_type": "code"
    },
    {
      "metadata": {},
      "source": [
        "Call the `submit` method on the experiment object and pass the run configuration. Execution of local runs is synchronous. Depending on the data and the number of iterations this can run for a while.\n",
        "In this example, we specify `show_output = True` to print currently running iterations to the console."
      ],
      "cell_type": "markdown"
    },
    {
      "metadata": {},
      "outputs": [],
      "execution_count": null,
      "source": [
        "local_run = experiment.submit(automl_config, show_output = True)"
      ],
      "cell_type": "code"
    },
    {
      "metadata": {},
      "outputs": [],
      "execution_count": null,
      "source": [
        "local_run"
      ],
      "cell_type": "code"
    },
    {
      "metadata": {},
      "source": [
        "## Results"
      ],
      "cell_type": "markdown"
    },
    {
      "metadata": {},
      "source": [
        "#### Widget for Monitoring Runs\n",
        "\n",
        "The widget will first report a \"loading\" status while running the first iteration. After completing the first iteration, an auto-updating graph and table will be shown. The widget will refresh once per minute, so you should see the graph update as child runs complete.\n",
        "\n",
        "**Note:** The widget displays a link at the bottom. Use this link to open a web interface to explore the individual run details."
      ],
      "cell_type": "markdown"
    },
    {
      "metadata": {},
      "outputs": [],
      "execution_count": null,
      "source": [
        "from azureml.widgets import RunDetails\n",
        "RunDetails(local_run).show() "
      ],
      "cell_type": "code"
    },
    {
      "metadata": {},
      "source": [
        "\n",
        "#### Retrieve All Child Runs\n",
        "You can also use SDK methods to fetch all the child runs and see individual metrics that we log."
      ],
      "cell_type": "markdown"
    },
    {
      "metadata": {},
      "outputs": [],
      "execution_count": null,
      "source": [
        "children = list(local_run.get_children())\n",
        "metricslist = {}\n",
        "for run in children:\n",
        "    properties = run.get_properties()\n",
        "    metrics = {k: v for k, v in run.get_metrics().items() if isinstance(v, float)}\n",
        "    metricslist[int(properties['iteration'])] = metrics\n",
        "\n",
        "rundata = pd.DataFrame(metricslist).sort_index(1)\n",
        "rundata"
      ],
      "cell_type": "code"
    },
    {
      "metadata": {},
      "source": [
        "### Retrieve the Best Model\n",
        "\n",
        "Below we select the best pipeline from our iterations. The `get_output` method returns the best run and the fitted model. The Model includes the pipeline and any pre-processing.  Overloads on `get_output` allow you to retrieve the best run and fitted model for *any* logged metric or for a particular *iteration*."
      ],
      "cell_type": "markdown"
    },
    {
      "metadata": {},
      "outputs": [],
      "execution_count": null,
      "source": [
        "best_run, fitted_model = local_run.get_output()\n",
        "print(best_run)\n",
        "print(fitted_model)"
      ],
      "cell_type": "code"
    },
    {
      "metadata": {},
      "source": [
        "#### Best Model Based on Any Other Metric\n",
        "Show the run and the model that has the smallest `root_mean_squared_error` value (which turned out to be the same as the one with largest `spearman_correlation` value):"
      ],
      "cell_type": "markdown"
    },
    {
      "metadata": {},
      "outputs": [],
      "execution_count": null,
      "source": [
        "lookup_metric = \"root_mean_squared_error\"\n",
        "best_run, fitted_model = local_run.get_output(metric = lookup_metric)\n",
        "print(best_run)\n",
        "print(fitted_model)"
      ],
      "cell_type": "code"
    },
    {
      "metadata": {},
      "source": [
        "#### Model from a Specific Iteration\n",
        "Show the run and the model from the third iteration:"
      ],
      "cell_type": "markdown"
    },
    {
      "metadata": {},
      "outputs": [],
      "execution_count": null,
      "source": [
        "iteration = 3\n",
        "third_run, third_model = local_run.get_output(iteration = iteration)\n",
        "print(third_run)\n",
        "print(third_model)"
      ],
      "cell_type": "code"
    },
    {
      "metadata": {},
      "source": [
        "## Test"
      ],
      "cell_type": "markdown"
    },
    {
      "metadata": {},
      "source": [
        "Predict on training and test set, and calculate residual values."
      ],
      "cell_type": "markdown"
    },
    {
      "metadata": {},
      "outputs": [],
      "execution_count": null,
      "source": [
        "y_pred_train = fitted_model.predict(X_train)\n",
        "y_residual_train = y_train - y_pred_train\n",
        "\n",
        "y_pred_test = fitted_model.predict(X_test)\n",
        "y_residual_test = y_test - y_pred_test"
      ],
      "cell_type": "code"
    },
    {
      "metadata": {},
      "outputs": [],
      "execution_count": null,
      "source": [
        "%matplotlib inline\n",
        "from sklearn.metrics import mean_squared_error, r2_score\n",
        "\n",
        "# Set up a multi-plot chart.\n",
        "f, (a0, a1) = plt.subplots(1, 2, gridspec_kw = {'width_ratios':[1, 1], 'wspace':0, 'hspace': 0})\n",
        "f.suptitle('Regression Residual Values', fontsize = 18)\n",
        "f.set_figheight(6)\n",
        "f.set_figwidth(16)\n",
        "\n",
        "# Plot residual values of training set.\n",
        "a0.axis([0, 360, -200, 200])\n",
        "a0.plot(y_residual_train, 'bo', alpha = 0.5)\n",
        "a0.plot([-10,360],[0,0], 'r-', lw = 3)\n",
        "a0.text(16,170,'RMSE = {0:.2f}'.format(np.sqrt(mean_squared_error(y_train, y_pred_train))), fontsize = 12)\n",
        "a0.text(16,140,'R2 score = {0:.2f}'.format(r2_score(y_train, y_pred_train)), fontsize = 12)\n",
        "a0.set_xlabel('Training samples', fontsize = 12)\n",
        "a0.set_ylabel('Residual Values', fontsize = 12)\n",
        "\n",
        "# Plot a histogram.\n",
        "a0.hist(y_residual_train, orientation = 'horizontal', color = 'b', bins = 10, histtype = 'step')\n",
        "a0.hist(y_residual_train, orientation = 'horizontal', color = 'b', alpha = 0.2, bins = 10)\n",
        "\n",
        "# Plot residual values of test set.\n",
        "a1.axis([0, 90, -200, 200])\n",
        "a1.plot(y_residual_test, 'bo', alpha = 0.5)\n",
        "a1.plot([-10,360],[0,0], 'r-', lw = 3)\n",
        "a1.text(5,170,'RMSE = {0:.2f}'.format(np.sqrt(mean_squared_error(y_test, y_pred_test))), fontsize = 12)\n",
        "a1.text(5,140,'R2 score = {0:.2f}'.format(r2_score(y_test, y_pred_test)), fontsize = 12)\n",
        "a1.set_xlabel('Test samples', fontsize = 12)\n",
        "a1.set_yticklabels([])\n",
        "\n",
        "# Plot a histogram.\n",
        "a1.hist(y_residual_test, orientation = 'horizontal', color = 'b', bins = 10, histtype = 'step')\n",
        "a1.hist(y_residual_test, orientation = 'horizontal', color = 'b', alpha = 0.2, bins = 10)\n",
        "\n",
        "plt.show()"
      ],
      "cell_type": "code"
    }
  ],
  "nbformat_minor": 2
}