MachineLearningNotebooks/tutorials/regression-part2-automated-ml.ipynb

{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Copyright (c) Microsoft Corporation. All rights reserved."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Tutorial (part 2): Use automated machine learning to build your regression model \n",
        "\n",
        "This tutorial is **part two of a two-part tutorial series**. In the previous tutorial, you [prepared the NYC taxi data for regression modeling](regression-part1-data-prep.ipynb).\n",
        "\n",
        "Now, you're ready to start building your model with Azure Machine Learning service. In this part of the tutorial, you will use the prepared data and automatically generate a regression model to predict taxi fare prices. Using the automated ML capabilities of the service, you define your machine learning goals and constraints, launch the automated machine learning process and then allow the algorithm selection and hyperparameter-tuning to happen for you. The automated ML technique iterates over many combinations of algorithms and hyperparameters until it finds the best model based on your criterion.\n",
        "\n",
        "In this tutorial, you learn how to:\n",
        "\n",
        "> * Setup a Python environment and import the SDK packages\n",
        "> * Configure an Azure Machine Learning service workspace\n",
        "> * Auto-train a regression model \n",
        "> * Run the model locally with custom parameters\n",
        "> * Explore the results\n",
        "> * Register the best model\n",
        "\n",
        "If you don\u00e2\u20ac\u2122t have an Azure subscription, create a [free account](https://aka.ms/AMLfree) before you begin. \n",
        "\n",
        "> Code in this article was tested with Azure Machine Learning SDK version 1.0.0\n",
        "\n",
        "\n",
        "## Prerequisites\n",
        "\n",
        "> * [Run the data preparation tutorial](regression-part1-data-prep.ipynb)\n",
        "\n",
        "> * Automated machine learning configured environment e.g. Azure notebooks, Local Python environment or Data Science Virtual Machine. [Setup](https://docs.microsoft.com/azure/machine-learning/service/samples-notebooks) automated machine learning."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Import packages\n",
        "Import Python packages you need in this tutorial."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import azureml.core\n",
        "import pandas as pd\n",
        "from azureml.core.workspace import Workspace\n",
        "import logging"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Configure workspace\n",
        "\n",
        "Create a workspace object from the existing workspace. A `Workspace` is a class that accepts your Azure subscription and resource information, and creates a cloud resource to monitor and track your model runs. `Workspace.from_config()` reads the file **aml_config/config.json** and loads the details into an object named `ws`.  `ws` is used throughout the rest of the code in this tutorial.\n",
        "\n",
        "Once you have a workspace object, specify a name for the experiment and create and register a local directory with the workspace. The history of all runs is recorded under the specified experiment."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "ws = Workspace.from_config()\n",
        "# choose a name for the run history container in the workspace\n",
        "experiment_name = 'automated-ml-regression'\n",
        "# project folder\n",
        "project_folder = './automated-ml-regression'\n",
        "\n",
        "import os\n",
        "\n",
        "output = {}\n",
        "output['SDK version'] = azureml.core.VERSION\n",
        "output['Subscription ID'] = ws.subscription_id\n",
        "output['Workspace'] = ws.name\n",
        "output['Resource Group'] = ws.resource_group\n",
        "output['Location'] = ws.location\n",
        "output['Project Directory'] = project_folder\n",
        "pd.set_option('display.max_colwidth', -1)\n",
        "outputDf = pd.DataFrame(data = output, index = [''])\n",
        "outputDf.T"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Explore data\n",
        "\n",
        "Utilize the data flow object created in the previous tutorial. Open and execute the data flow and review the results."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import azureml.dataprep as dprep\n",
        "\n",
        "file_path = os.path.join(os.getcwd(), \"dflows.dprep\")\n",
        "\n",
        "package_saved = dprep.Package.open(file_path)\n",
        "dflow_prepared = package_saved.dataflows[0]\n",
        "dflow_prepared.get_profile()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "You prepare the data for the experiment by adding columns to `dflow_X` to be features for our model creation. You define `dflow_y` to be our prediction value; cost.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "dflow_X = dflow_prepared.keep_columns(['pickup_weekday','pickup_hour', 'distance','passengers', 'vendor'])\n",
        "dflow_y = dflow_prepared.keep_columns('cost')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Split data into train and test sets\n",
        "\n",
        "Now you split the data into training and test sets using the `train_test_split` function in the `sklearn` library. This function segregates the data into the x (features) data set for model training and the y (values to predict) data set for testing. The `test_size` parameter determines the percentage of data to allocate to testing. The `random_state` parameter sets a seed to the random generator, so that your train-test splits are always deterministic."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from sklearn.model_selection import train_test_split\n",
        "\n",
        "\n",
        "x_df = dflow_X.to_pandas_dataframe()\n",
        "y_df = dflow_y.to_pandas_dataframe()\n",
        "\n",
        "x_train, x_test, y_train, y_test = train_test_split(x_df, y_df, test_size=0.2, random_state=223)\n",
        "# flatten y_train to 1d array\n",
        "y_train.values.flatten()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "You now have the necessary packages and data ready for auto training for your model. \n",
        "\n",
        "## Automatically train a model\n",
        "\n",
        "To automatically train a model:\n",
        "1. Define settings for the experiment run\n",
        "1. Submit the experiment for model tuning\n",
        "\n",
        "\n",
        "### Define settings for autogeneration and tuning\n",
        "\n",
        "Define the experiment parameters and models settings for autogeneration and tuning. View the full list of [settings](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-configure-auto-train).\n",
        "\n",
        "\n",
        "|Property| Value in this tutorial |Description|\n",
        "|----|----|---|\n",
        "|**iteration_timeout_minutes**|10|Time limit in minutes for each iteration|\n",
        "|**iterations**|30|Number of iterations. In each iteration, the model trains with the data with a specific pipeline|\n",
        "|**primary_metric**|spearman_correlation | Metric that you want to optimize.|\n",
        "|**preprocess**| True | True enables experiment to perform preprocessing on the input.|\n",
        "|**verbosity**| logging.INFO | Controls the level of logging.|\n",
        "|**n_cross_validationss**|5|Number of cross validation splits\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "automl_settings = {\n",
        "    \"iteration_timeout_minutes\" : 10,\n",
        "    \"iterations\" : 30,\n",
        "    \"primary_metric\" : 'spearman_correlation',\n",
        "    \"preprocess\" : True,\n",
        "    \"verbosity\" : logging.INFO,\n",
        "    \"n_cross_validations\": 5\n",
        "}"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "tags": [
          "configure automl"
        ]
      },
      "outputs": [],
      "source": [
        "from azureml.train.automl import AutoMLConfig\n",
        "\n",
        "# local compute \n",
        "automated_ml_config = AutoMLConfig(task = 'regression',\n",
        "                             debug_log = 'automated_ml_errors.log',\n",
        "                             path = project_folder,\n",
        "                             X = x_train.values,\n",
        "                             y = y_train.values.flatten(),\n",
        "                             **automl_settings)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Train the automatic regression model\n",
        "\n",
        "Start the experiment to run locally. Pass the defined `automated_ml_config` object to the experiment, and set the output to `true` to view progress during the experiment."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "tags": [
          "local submitted run",
          "automl"
        ]
      },
      "outputs": [],
      "source": [
        "from azureml.core.experiment import Experiment\n",
        "experiment=Experiment(ws, experiment_name)\n",
        "local_run = experiment.submit(automated_ml_config, show_output=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Explore the results\n",
        "\n",
        "Explore the results of automatic training with a Jupyter widget or by examining the experiment history.\n",
        "\n",
        "### Option 1: Add a Jupyter widget to see results\n",
        "\n",
        "Use the Jupyter notebook widget to see a graph and a table of all results."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "tags": [
          "use notebook widget"
        ]
      },
      "outputs": [],
      "source": [
        "from azureml.widgets import RunDetails\n",
        "RunDetails(local_run).show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Option 2: Get and examine all run iterations in Python\n",
        "\n",
        "Alternatively, you can retrieve the history of each experiment and explore the individual metrics for each iteration run."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "tags": [
          "get metrics",
          "query history"
        ]
      },
      "outputs": [],
      "source": [
        "children = list(local_run.get_children())\n",
        "metricslist = {}\n",
        "for run in children:\n",
        "    properties = run.get_properties()\n",
        "    metrics = {k: v for k, v in run.get_metrics().items() if isinstance(v, float)}\n",
        "    metricslist[int(properties['iteration'])] = metrics\n",
        "\n",
        "rundata = pd.DataFrame(metricslist).sort_index(1)\n",
        "rundata"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Retrieve the best model\n",
        "\n",
        "Select the best pipeline from our iterations. The `get_output` method on `automl_classifier` returns the best run and the fitted model for the last fit invocation. There are overloads on `get_output` that allow you to retrieve the best run and fitted model for any logged metric or a particular iteration."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "best_run, fitted_model = local_run.get_output()\n",
        "print(best_run)\n",
        "print(fitted_model)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Register the model\n",
        "\n",
        "Register the model in your Azure Machine Learning Workspace."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "description = 'Automated Machine Learning Model'\n",
        "tags = None\n",
        "local_run.register_model(description=description, tags=tags)\n",
        "print(local_run.model_id) # Use this id to deploy the model as a web service in Azure"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Test the best model accuracy\n",
        "\n",
        "Use the best model to run predictions on the test data set. The function `predict` uses the best model, and predicts the values of y (trip cost) from the `x_test` data set. Print the first 10 predicted cost values from `y_predict`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "y_predict = fitted_model.predict(x_test.values) \n",
        "print(y_predict[:10])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Create a scatter plot to visualize the predicted cost values compared to the actual cost values. The following code uses the `distance` feature as the x-axis, and trip `cost` as the y-axis. The first 100 predicted and actual cost values are created as separate series, in order to compare the variance of predicted cost at each trip distance value. Examining the plot shows that the distance/cost relationship is nearly linear, and the predicted cost values are in most cases very close to the actual cost values for the same trip distance."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import matplotlib.pyplot as plt\n",
        "\n",
        "fig = plt.figure(figsize=(14, 10))\n",
        "ax1 = fig.add_subplot(111)\n",
        "\n",
        "distance_vals = [x[4] for x in x_test.values]\n",
        "y_actual = y_test.values.flatten().tolist()\n",
        "\n",
        "ax1.scatter(distance_vals[:100], y_predict[:100], s=18, c='b', marker=\"s\", label='Predicted')\n",
        "ax1.scatter(distance_vals[:100], y_actual[:100], s=18, c='r', marker=\"o\", label='Actual')\n",
        "\n",
        "ax1.set_xlabel('distance (mi)')\n",
        "ax1.set_title('Predicted and Actual Cost/Distance')\n",
        "ax1.set_ylabel('Cost ($)')\n",
        "\n",
        "plt.legend(loc='upper left', prop={'size': 12})\n",
        "plt.rcParams.update({'font.size': 14})\n",
        "plt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Calculate the `root mean squared error` of the results. Use the `y_test` dataframe, and convert it to a list to compare to the predicted values. The function `mean_squared_error` takes two arrays of values, and calculates the average squared error between them. Taking the square root of the result gives an error in the same units as the y variable (cost), and indicates roughly how far your predictions are from the actual value. "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from sklearn.metrics import mean_squared_error\n",
        "from math import sqrt\n",
        "\n",
        "rmse = sqrt(mean_squared_error(y_actual, y_predict))\n",
        "rmse"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Run the following code to calculate MAPE (mean absolute percent error) using the full `y_actual` and `y_predict` data sets. This metric calculates an absolute difference between each predicted and actual value, sums all the differences, and then expresses that sum as a percent of the total of the actual values."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "sum_actuals = sum_errors = 0\n",
        "\n",
        "for actual_val, predict_val in zip(y_actual, y_predict):\n",
        "    abs_error = actual_val - predict_val\n",
        "    if abs_error < 0:\n",
        "        abs_error = abs_error * -1\n",
        "    \n",
        "    sum_errors = sum_errors + abs_error\n",
        "    sum_actuals = sum_actuals + actual_val\n",
        "    \n",
        "mean_abs_percent_error = sum_errors / sum_actuals\n",
        "print(\"Model MAPE:\")\n",
        "print(mean_abs_percent_error)\n",
        "print()\n",
        "print(\"Model Accuracy:\")\n",
        "print(1 - mean_abs_percent_error)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Next steps\n",
        "\n",
        "In this automated machine learning tutorial, you:\n",
        "\n",
        "\n",
        "> * Configured a workspace and prepared data for an experiment\n",
        "> * Trained using an automated regression model locally with custom parameters\n",
        "> * Explored and reviewed training results\n",
        "> * Registered the best model\n",
        "\n",
        "[Deploy your model](02.deploy-models.ipynb) with Azure Machine Learning."
      ]
    }
  ],
  "metadata": {
    "authors": [
      {
        "name": "jeffshep"
      }
    ],
    "kernelspec": {
      "display_name": "Python 3.6",
      "language": "python",
      "name": "python36"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.7"
    },
    "msauthor": "sgilley"
  },
  "nbformat": 4,
  "nbformat_minor": 2
}