MachineLearningNotebooks/tutorials/regression-part2-automated-ml.ipynb

{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Copyright (c) Microsoft Corporation. All rights reserved."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Tutorial: Use automated machine learning to build your regression model\n",
        "\n",
        "This tutorial is **part two of a two-part tutorial series**. In the previous tutorial, you [prepared the NYC taxi data for regression modeling](regression-part1-data-prep.ipynb).\n",
        "\n",
        "Now you're ready to start building your model with Azure Machine Learning service. In this part of the tutorial, you use the prepared data and automatically generate a regression model to predict taxi fare prices. By using the automated machine learning capabilities of the service, you define your machine learning goals and constraints. You launch the automated machine learning process. Then allow the algorithm selection and hyperparameter tuning to happen for you. The automated machine learning technique iterates over many combinations of algorithms and hyperparameters until it finds the best model based on your criterion.\n",
        "\n",
        "In this tutorial, you learn the following tasks:\n",
        "\n",
        "> * Set up a Python environment and import the SDK packages\n",
        "> * Configure an Azure Machine Learning service workspace\n",
        "> * Auto-train a regression model \n",
        "> * Run the model locally with custom parameters\n",
        "> * Explore the results\n",
        "\n",
        "If you do not have an Azure subscription, create a [free account](https://aka.ms/AMLfree) before you begin. \n",
        "\n",
        "> Code in this article was tested with Azure Machine Learning SDK version 1.0.0\n",
        "\n",
        "\n",
        "## Prerequisites\n",
        "\n",
        "To run the notebook you will need:\n",
        "\n",
        "* [Run the data preparation tutorial](regression-part1-data-prep.ipynb).\n",
        "* A Python 3.6 notebook server with the following installed:\n",
        "    * The Azure Machine Learning SDK for Python with `automl` and `notebooks` extras\n",
        "    * `matplotlib`\n",
        "* The tutorial notebook\n",
        "* A machine learning workspace\n",
        "* The configuration file for the workspace in the same directory as the notebook\n",
        "\n",
        "Navigate back to the [tutorial page](https://docs.microsoft.com/azure/machine-learning/service/tutorial-auto-train-models) for specific environment setup instructions.\n",
        "\n",
        "## <a name=\"start\"></a>Set up your development environment\n",
        "\n",
        "All the setup for your development work can be accomplished in a Python notebook. Setup includes the following actions:\n",
        "\n",
        "* Install the SDK\n",
        "* Import Python packages\n",
        "* Configure your workspace"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Install and import packages\n",
        "\n",
        "If you are following the tutorial in your own Python environment, use the following to install necessary packages.\n",
        "\n",
        "```shell\n",
        "pip install azureml-sdk[automl,notebooks] matplotlib\n",
        "```\n",
        "\n",
        "Import the Python packages you need in this tutorial:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import azureml.core\n",
        "import pandas as pd\n",
        "from azureml.core.workspace import Workspace\n",
        "import logging\n",
        "import os"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Configure workspace\n",
        "\n",
        "Create a workspace object from the existing workspace. A `Workspace` is a class that accepts your Azure subscription and resource information. It also creates a cloud resource to monitor and track your model runs.\n",
        "\n",
        "`Workspace.from_config()` reads the file **aml_config/config.json** and loads the details into an object named `ws`.  `ws` is used throughout the rest of the code in this tutorial.\n",
        "\n",
        "After you have a workspace object, specify a name for the experiment. Create and register a local directory with the workspace. The history of all runs is recorded under the specified experiment and in the [Azure portal](https://portal.azure.com)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "ws = Workspace.from_config()\n",
        "# choose a name for the run history container in the workspace\n",
        "experiment_name = 'automated-ml-regression'\n",
        "# project folder\n",
        "project_folder = './automated-ml-regression'\n",
        "\n",
        "output = {}\n",
        "output['SDK version'] = azureml.core.VERSION\n",
        "output['Subscription ID'] = ws.subscription_id\n",
        "output['Workspace'] = ws.name\n",
        "output['Resource Group'] = ws.resource_group\n",
        "output['Location'] = ws.location\n",
        "output['Project Directory'] = project_folder\n",
        "pd.set_option('display.max_colwidth', -1)\n",
        "outputDf = pd.DataFrame(data = output, index = [''])\n",
        "outputDf.T"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Explore data\n",
        "\n",
        "Use the data flow object created in the previous tutorial. To summarize, part 1 of this tutorial cleaned the NYC Taxi data so it could be used in a machine learning model. Now, you use various features from the data set and allow an automated model to build relationships between the features and the price of a taxi trip. Open and run the data flow and review the results:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import azureml.dataprep as dprep\n",
        "\n",
        "file_path = os.path.join(os.getcwd(), \"dflows.dprep\")\n",
        "\n",
        "dflow_prepared = dprep.Dataflow.open(file_path)\n",
        "dflow_prepared.get_profile()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "You prepare the data for the experiment by adding columns to `dflow_x` to be features for our model creation. You define `dflow_y` to be our prediction value, **cost**:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "dflow_X = dflow_prepared.keep_columns(['pickup_weekday','pickup_hour', 'distance','passengers', 'vendor'])\n",
        "dflow_y = dflow_prepared.keep_columns('cost')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Split data into train and test sets\n",
        "\n",
        "Now you split the data into training and test sets by using the `train_test_split` function in the `sklearn` library. This function segregates the data into the x, **features**, dataset for model training and the y, **values to predict**, dataset for testing. The `test_size` parameter determines the percentage of data to allocate to testing. The `random_state` parameter sets a seed to the random generator, so that your train-test splits are always deterministic:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from sklearn.model_selection import train_test_split\n",
        "\n",
        "\n",
        "x_df = dflow_X.to_pandas_dataframe()\n",
        "y_df = dflow_y.to_pandas_dataframe()\n",
        "\n",
        "x_train, x_test, y_train, y_test = train_test_split(x_df, y_df, test_size=0.2, random_state=223)\n",
        "# flatten y_train to 1d array\n",
        "y_train.values.flatten()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The purpose of this step is to have data points to test the finished model that haven't been used to train the model, in order to measure true accuracy. In other words, a well-trained model should be able to accurately make predictions from data it hasn't already seen. You now have the necessary packages and data ready for autotraining your model.\n",
        "\n",
        "## Automatically train a model\n",
        "\n",
        "To automatically train a model, take the following steps:\n",
        "1. Define settings for the experiment run. Attach your training data to the configuration, and modify settings that control the training process.\n",
        "1. Submit the experiment for model tuning. After submitting the experiment, the process iterates through different machine learning algorithms and hyperparameter settings, adhering to your defined constraints. It chooses the best-fit model by optimizing an accuracy metric.\n",
        "\n",
        "\n",
        "### Define settings for autogeneration and tuning\n",
        "\n",
        "Define the experiment parameters and models settings for autogeneration and tuning. View the full list of [settings](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-configure-auto-train). Submitting the experiment with these default settings will take approximately 10-15 min, but if you want a shorter run time, reduce either `iterations` or `iteration_timeout_minutes`.\n",
        "\n",
        "\n",
        "|Property| Value in this tutorial |Description|\n",
        "|----|----|---|\n",
        "|**iteration_timeout_minutes**|10|Time limit in minutes for each iteration. Reduce this value to decrease total runtime.|\n",
        "|**iterations**|30|Number of iterations. In each iteration, a new machine learning model is trained with your data. This is the primary value that affects total run time.|\n",
        "|**primary_metric**|spearman_correlation | Metric that you want to optimize. The best-fit model will be chosen based on this metric.|\n",
        "|**preprocess**| True | By using **True**, the experiment can preprocess the input data (handling missing data, converting text to numeric, etc.)|\n",
        "|**verbosity**| logging.INFO | Controls the level of logging.|\n",
        "|**n_cross_validationss**|5| Number of cross-validation splits to perform when validation data is not specified.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "automl_settings = {\n",
        "    \"iteration_timeout_minutes\" : 10,\n",
        "    \"iterations\" : 30,\n",
        "    \"primary_metric\" : 'spearman_correlation',\n",
        "    \"preprocess\" : True,\n",
        "    \"verbosity\" : logging.INFO,\n",
        "    \"n_cross_validations\": 5\n",
        "}"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Use your defined training settings as a parameter to an `AutoMLConfig` object. Additionally, specify your training data and the type of model, which is `regression` in this case."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "tags": [
          "configure automl"
        ]
      },
      "outputs": [],
      "source": [
        "from azureml.train.automl import AutoMLConfig\n",
        "\n",
        "# local compute \n",
        "automated_ml_config = AutoMLConfig(task = 'regression',\n",
        "                             debug_log = 'automated_ml_errors.log',\n",
        "                             path = project_folder,\n",
        "                             X = x_train.values,\n",
        "                             y = y_train.values.flatten(),\n",
        "                             **automl_settings)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Train the automatic regression model\n",
        "\n",
        "Start the experiment to run locally. Pass the defined `automated_ml_config` object to the experiment. Set the output to `True` to view progress during the experiment:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "tags": [
          "local submitted run",
          "automl"
        ]
      },
      "outputs": [],
      "source": [
        "from azureml.core.experiment import Experiment\n",
        "experiment=Experiment(ws, experiment_name)\n",
        "local_run = experiment.submit(automated_ml_config, show_output=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The output shown updates live as the experiment runs. For each iteration, you see the model type, the run duration, and the training accuracy. The field `BEST` tracks the best running training score based on your metric type."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Explore the results\n",
        "\n",
        "Explore the results of automatic training with a Jupyter widget or by examining the experiment history.\n",
        "\n",
        "### Option 1: Add a Jupyter widget to see results\n",
        "\n",
        "If you use a Jupyter notebook, use this Jupyter notebook widget to see a graph and a table of all results:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "tags": [
          "use notebook widget"
        ]
      },
      "outputs": [],
      "source": [
        "from azureml.widgets import RunDetails\n",
        "RunDetails(local_run).show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Option 2: Get and examine all run iterations in Python\n",
        "\n",
        "You can also retrieve the history of each experiment and explore the individual metrics for each iteration run. By examining RMSE (root_mean_squared_error) for each individual model run, you see that most iterations are predicting the taxi fair cost within a reasonable margin ($3-4).\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "tags": [
          "get metrics",
          "query history"
        ]
      },
      "outputs": [],
      "source": [
        "children = list(local_run.get_children())\n",
        "metricslist = {}\n",
        "for run in children:\n",
        "    properties = run.get_properties()\n",
        "    metrics = {k: v for k, v in run.get_metrics().items() if isinstance(v, float)}\n",
        "    metricslist[int(properties['iteration'])] = metrics\n",
        "\n",
        "rundata = pd.DataFrame(metricslist).sort_index(1)\n",
        "rundata"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Retrieve the best model\n",
        "\n",
        "Select the best pipeline from our iterations. The `get_output` method on `automl_classifier` returns the best run and the fitted model for the last fit invocation. By using the overloads on `get_output`, you can retrieve the best run and fitted model for any logged metric or a particular iteration:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "best_run, fitted_model = local_run.get_output()\n",
        "print(best_run)\n",
        "print(fitted_model)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Test the best model accuracy\n",
        "\n",
        "Use the best model to run predictions on the test dataset to predict taxi fares. The function `predict` uses the best model and predicts the values of y, **trip cost**, from the `x_test` dataset. Print the first 10 predicted cost values from `y_predict`:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "y_predict = fitted_model.predict(x_test.values) \n",
        "print(y_predict[:10])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Create a scatter plot to visualize the predicted cost values compared to the actual cost values. The following code uses the `distance` feature as the x-axis and trip `cost` as the y-axis. To compare the variance of predicted cost at each trip distance value, the first 100 predicted and actual cost values are created as separate series. Examining the plot shows that the distance/cost relationship is nearly linear, and the predicted cost values are in most cases very close to the actual cost values for the same trip distance."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import matplotlib.pyplot as plt\n",
        "\n",
        "fig = plt.figure(figsize=(14, 10))\n",
        "ax1 = fig.add_subplot(111)\n",
        "\n",
        "distance_vals = [x[4] for x in x_test.values]\n",
        "y_actual = y_test.values.flatten().tolist()\n",
        "\n",
        "ax1.scatter(distance_vals[:100], y_predict[:100], s=18, c='b', marker=\"s\", label='Predicted')\n",
        "ax1.scatter(distance_vals[:100], y_actual[:100], s=18, c='r', marker=\"o\", label='Actual')\n",
        "\n",
        "ax1.set_xlabel('distance (mi)')\n",
        "ax1.set_title('Predicted and Actual Cost/Distance')\n",
        "ax1.set_ylabel('Cost ($)')\n",
        "\n",
        "plt.legend(loc='upper left', prop={'size': 12})\n",
        "plt.rcParams.update({'font.size': 14})\n",
        "plt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        " Calculate the `root mean squared error` of the results. Use the `y_test` dataframe. Convert it to a list to compare to the predicted values. The function `mean_squared_error` takes two arrays of values and calculates the average squared error between them. Taking the square root of the result gives an error in the same units as the y variable, **cost**. It indicates roughly how far the taxi fare predictions are from the actual fares:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from sklearn.metrics import mean_squared_error\n",
        "from math import sqrt\n",
        "\n",
        "rmse = sqrt(mean_squared_error(y_actual, y_predict))\n",
        "rmse"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Run the following code to calculate mean absolute percent error (MAPE) by using the full `y_actual` and `y_predict` datasets. This metric calculates an absolute difference between each predicted and actual value and sums all the differences. Then it expresses that sum as a percent of the total of the actual values:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "sum_actuals = sum_errors = 0\n",
        "\n",
        "for actual_val, predict_val in zip(y_actual, y_predict):\n",
        "    abs_error = actual_val - predict_val\n",
        "    if abs_error < 0:\n",
        "        abs_error = abs_error * -1\n",
        "    \n",
        "    sum_errors = sum_errors + abs_error\n",
        "    sum_actuals = sum_actuals + actual_val\n",
        "    \n",
        "mean_abs_percent_error = sum_errors / sum_actuals\n",
        "print(\"Model MAPE:\")\n",
        "print(mean_abs_percent_error)\n",
        "print()\n",
        "print(\"Model Accuracy:\")\n",
        "print(1 - mean_abs_percent_error)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "From the final prediction accuracy metrics, you see that the model is fairly good at predicting taxi fares from the data set's features, typically within +- $3.00. The traditional machine learning model development process is highly resource-intensive, and requires significant domain knowledge and time investment to run and compare the results of dozens of models. Using automated machine learning is a great way to rapidly test many different models for your scenario."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Clean up resources\n",
        "\n",
        ">The resources you created can be used as prerequisites to other Azure Machine Learning service tutorials and how-to articles. \n",
        "\n",
        "\n",
        "If you do not plan to use the resources you created, delete them, so you do not incur any charges:\n",
        "\n",
        "1. In the Azure portal, select **Resource groups** on the far left.\n",
        "\n",
        "1. From the list, select the resource group you created.\n",
        "\n",
        "1. Select **Delete resource group**.\n",
        "\n",
        "1. Enter the resource group name. Then select **Delete**."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Next steps\n",
        "\n",
        "In this automated machine learning tutorial, you did the following tasks:\n",
        "\n",
        "* Configured a workspace and prepared data for an experiment.\n",
        "* Trained by using an automated regression model locally with custom parameters.\n",
        "* Explored and reviewed training results.\n",
        "\n",
        "[Deploy your model](https://docs.microsoft.com/azure/machine-learning/service/tutorial-deploy-models-with-aml) with Azure Machine Learning."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/tutorials/regression-part2-automated-ml.png)"
      ]
    }
  ],
  "metadata": {
    "authors": [
      {
        "name": "jeffshep"
      }
    ],
    "kernelspec": {
      "display_name": "Python 3.6",
      "language": "python",
      "name": "python36"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.7"
    },
    "msauthor": "sgilley"
  },
  "nbformat": 4,
  "nbformat_minor": 2
}