MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/forecasting-grouping/auto-ml-forecasting-grouping.ipynb

{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Copyright (c) Microsoft Corporation. All rights reserved.\n",
        "\n",
        "Licensed under the MIT License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/forecasting-grouping/auto-ml-forecasting-grouping.png)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Automated Machine Learning\n",
        "\n",
        "_**Forecasting with grouping using Pipelines**_\n",
        "\n",
        "## Contents\n",
        "\n",
        "1. [Introduction](#Introduction)\n",
        "2. [Setup](#Setup)\n",
        "3. [Data](#Data)\n",
        "4. [Compute](#Compute)\n",
        "4. [AutoMLConfig](#AutoMLConfig)\n",
        "5. [Pipeline](#Pipeline)\n",
        "5. [Train](#Train)\n",
        "6. [Test](#Test)\n",
        "\n",
        "\n",
        "## Introduction\n",
        "In this example we use Automated ML and Pipelines to train, select, and operationalize forecasting models for multiple time-series.\n",
        "\n",
        "If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, go through the [configuration notebook](../../../configuration.ipynb) first if you haven't already to establish your connection to the AzureML Workspace.\n",
        "\n",
        "In this notebook you will learn how to:\n",
        "\n",
        "* Create an Experiment in an existing Workspace.\n",
        "* Configure AutoML using AutoMLConfig.\n",
        "* Use our helper script to generate pipeline steps to split, train, and deploy the models.\n",
        "* Explore the results.\n",
        "* Test the models.\n",
        "\n",
        "It is advised you ensure your cluster has at least one node per group.\n",
        "\n",
        "An Enterprise workspace is required for this notebook. To learn more about creating an Enterprise workspace or upgrading to an Enterprise workspace from the Azure portal, please visit our [Workspace page.](https://docs.microsoft.com/azure/machine-learning/service/concept-workspace#upgrade)\n",
        "\n",
        "## Setup\n",
        "As part of the setup you have already created an Azure ML `Workspace` object. For Automated ML you will need to create an `Experiment` object, which is a named object in a `Workspace` used to run experiments. "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import json\n",
        "import logging\n",
        "import warnings\n",
        "\n",
        "import numpy as np\n",
        "import pandas as pd\n",
        "\n",
        "import azureml.core\n",
        "\n",
        "from azureml.core.workspace import Workspace\n",
        "from azureml.core.experiment import Experiment\n",
        "from azureml.train.automl import AutoMLConfig"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Accessing the Azure ML workspace requires authentication with Azure.\n",
        "\n",
        "The default authentication is interactive authentication using the default tenant. Executing the ws = Workspace.from_config() line in the cell below will prompt for authentication the first time that it is run.\n",
        "\n",
        "If you have multiple Azure tenants, you can specify the tenant by replacing the ws = Workspace.from_config() line in the cell below with the following:\n",
        "```\n",
        "from azureml.core.authentication import InteractiveLoginAuthentication\n",
        "auth = InteractiveLoginAuthentication(tenant_id = 'mytenantid')\n",
        "ws = Workspace.from_config(auth = auth)\n",
        "```\n",
        "If you need to run in an environment where interactive login is not possible, you can use Service Principal authentication by replacing the ws = Workspace.from_config() line in the cell below with the following:\n",
        "```\n",
        "from azureml.core.authentication import ServicePrincipalAuthentication\n",
        "auth = auth = ServicePrincipalAuthentication('mytenantid', 'myappid', 'mypassword')\n",
        "ws = Workspace.from_config(auth = auth)\n",
        "```\n",
        "For more details, see aka.ms/aml-notebook-auth"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "ws = Workspace.from_config()\n",
        "ds = ws.get_default_datastore()\n",
        "\n",
        "# choose a name for the run history container in the workspace\n",
        "experiment_name = 'automl-grouping-oj'\n",
        "# project folder\n",
        "project_folder = './sample_projects/{}'.format(experiment_name)\n",
        "\n",
        "experiment = Experiment(ws, experiment_name)\n",
        "\n",
        "output = {}\n",
        "output['SDK version'] = azureml.core.VERSION\n",
        "output['Subscription ID'] = ws.subscription_id\n",
        "output['Workspace'] = ws.name\n",
        "output['Resource Group'] = ws.resource_group\n",
        "output['Location'] = ws.location\n",
        "output['Project Directory'] = project_folder\n",
        "output['Run History Name'] = experiment_name\n",
        "pd.set_option('display.max_colwidth', -1)\n",
        "outputDf = pd.DataFrame(data = output, index = [''])\n",
        "outputDf.T"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Data\n",
        "Upload data to your default datastore and then load it as a `TabularDataset`"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.core.dataset import Dataset"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# upload training and test data to your default datastore\n",
        "ds = ws.get_default_datastore()\n",
        "ds.upload(src_dir='./data', target_path='groupdata', overwrite=True, show_progress=True)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# load data from your datastore\n",
        "data = Dataset.Tabular.from_delimited_files(path=ds.path('groupdata/dominicks_OJ_2_5_8_train.csv'))\n",
        "data_test = Dataset.Tabular.from_delimited_files(path=ds.path('groupdata/dominicks_OJ_2_5_8_test.csv'))\n",
        "\n",
        "data.take(5).to_pandas_dataframe()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Compute \n",
        "\n",
        "#### Create or Attach existing AmlCompute\n",
        "\n",
        "You will need to create a compute target for your automated ML run. In this tutorial, you create AmlCompute as your training compute resource.\n",
        "#### Creation of AmlCompute takes approximately 5 minutes. \n",
        "If the AmlCompute with that name is already in your workspace this code will skip the creation process.\n",
        "As with other Azure services, there are limits on certain resources (e.g. AmlCompute) associated with the Azure Machine Learning service. Please read this article on the default limits and how to request more quota."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.core.compute import AmlCompute\n",
        "from azureml.core.compute import ComputeTarget\n",
        "\n",
        "# Choose a name for your cluster.\n",
        "amlcompute_cluster_name = \"cpu-cluster-11\"\n",
        "\n",
        "found = False\n",
        "# Check if this compute target already exists in the workspace.\n",
        "cts = ws.compute_targets\n",
        "if amlcompute_cluster_name in cts and cts[amlcompute_cluster_name].type == 'AmlCompute':\n",
        "    found = True\n",
        "    print('Found existing compute target.')\n",
        "    compute_target = cts[amlcompute_cluster_name]\n",
        "    \n",
        "if not found:\n",
        "    print('Creating a new compute target...')\n",
        "    provisioning_config = AmlCompute.provisioning_configuration(vm_size = \"STANDARD_D2_V2\", # for GPU, use \"STANDARD_NC6\"\n",
        "                                                                #vm_priority = 'lowpriority', # optional\n",
        "                                                                max_nodes = 6)\n",
        "\n",
        "    # Create the cluster.\n",
        "    compute_target = ComputeTarget.create(ws, amlcompute_cluster_name, provisioning_config)\n",
        "    \n",
        "print('Checking cluster status...')\n",
        "# Can poll for a minimum number of nodes and for a specific timeout.\n",
        "# If no min_node_count is provided, it will use the scale settings for the cluster.\n",
        "compute_target.wait_for_completion(show_output = True, min_node_count = None, timeout_in_minutes = 20)\n",
        "    \n",
        "# For a more detailed view of current AmlCompute status, use get_status()."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## AutoMLConfig\n",
        "#### Create a base AutoMLConfig\n",
        "This configuration will be used for all the groups in the pipeline."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "target_column = 'Quantity'\n",
        "time_column_name = 'WeekStarting'\n",
        "grain_column_names = ['Brand']\n",
        "group_column_names = ['Store']\n",
        "max_horizon = 20"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "automl_settings = {\n",
        "    \"iteration_timeout_minutes\" : 5,\n",
        "    \"experiment_timeout_minutes\" : 15,\n",
        "    \"primary_metric\" : 'normalized_mean_absolute_error',\n",
        "    \"time_column_name\": time_column_name,\n",
        "    \"grain_column_names\": grain_column_names,\n",
        "    \"max_horizon\": max_horizon,\n",
        "    \"drop_column_names\": ['logQuantity'],\n",
        "    \"max_concurrent_iterations\": 2,\n",
        "    \"max_cores_per_iteration\": -1\n",
        "}\n",
        "base_configuration = AutoMLConfig(task = 'forecasting',\n",
        "                             path = project_folder,\n",
        "                             n_cross_validations=3,\n",
        "                             **automl_settings\n",
        "                            )"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Pipeline\n",
        "We've written a script to generate the individual pipeline steps used to create each automl step. Calling this script will return a list of PipelineSteps that will train multiple groups concurrently and then deploy these models.\n",
        "\n",
        "This step requires an Enterprise workspace to gain access to this feature. To learn more about creating an Enterprise workspace or upgrading to an Enterprise workspace from the Azure portal, please visit our [Workspace page.](https://docs.microsoft.com/azure/machine-learning/service/concept-workspace#upgrade).\n",
        "\n",
        "### Call the method to build pipeline steps\n",
        "\n",
        "`build_pipeline_steps()` takes as input:\n",
        "* **automlconfig**: This is the configuration used for every automl step\n",
        "* **df**: This is the dataset to be used for training\n",
        "* **target_column**: This is the target column of the dataset\n",
        "* **compute_target**: The compute to be used for training\n",
        "* **deploy**: The option on to deploy the models after training, if set to true an extra step will be added to deploy a webservice with all the models (default is `True`)\n",
        "* **service_name**: The service name for the model query endpoint\n",
        "* **time_column_name**: The time column of the data"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.core.webservice import Webservice\n",
        "from azureml.exceptions import WebserviceException\n",
        "\n",
        "service_name = 'grouped-model'\n",
        "try:\n",
        "    # if you want to get existing service below is the command\n",
        "    # since aci name needs to be unique in subscription deleting existing aci if any\n",
        "    # we use aci_service_name to create azure aci\n",
        "    service = Webservice(ws, name=service_name)\n",
        "    if service:\n",
        "        service.delete()\n",
        "except WebserviceException as e:\n",
        "    pass"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from build import build_pipeline_steps\n",
        "\n",
        "steps = build_pipeline_steps(\n",
        "    base_configuration, \n",
        "    data, \n",
        "    target_column,\n",
        "    compute_target, \n",
        "    group_column_names=group_column_names, \n",
        "    deploy=True, \n",
        "    service_name=service_name, \n",
        "    time_column_name=time_column_name\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Train\n",
        "Use the list of steps generated from above to build the pipeline and submit it to your compute for remote training."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.pipeline.core import Pipeline\n",
        "pipeline = Pipeline(\n",
        "    description=\"A pipeline with one model per data group using Automated ML.\",\n",
        "    workspace=ws,    \n",
        "    steps=steps)\n",
        "\n",
        "pipeline_run = experiment.submit(pipeline)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.widgets import RunDetails\n",
        "RunDetails(pipeline_run).show()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "pipeline_run.wait_for_completion(show_output=False)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Test\n",
        "\n",
        "Now we can use the holdout set to test our models and ensure our web-service is running as expected."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.core.webservice import AciWebservice\n",
        "service = AciWebservice(ws, service_name)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "X_test = data_test.to_pandas_dataframe()\n",
        "# Drop the column we are trying to predict (target column)\n",
        "x_pred = X_test.drop(target_column, inplace=False, axis=1)\n",
        "x_pred.head()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Get Predictions\n",
        "test_sample = X_test.drop(target_column, inplace=False, axis=1).to_json()\n",
        "predictions = service.run(input_data=test_sample)\n",
        "print(predictions)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Convert predictions from JSON to DataFrame\n",
        "pred_dict =json.loads(predictions)\n",
        "X_pred = pd.read_json(pred_dict['predictions'])\n",
        "X_pred.head()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Fix the index\n",
        "PRED = 'pred_target'\n",
        "X_pred[time_column_name] = pd.to_datetime(X_pred[time_column_name], unit='ms')\n",
        "\n",
        "X_pred.set_index([time_column_name] + grain_column_names, inplace=True, drop=True)\n",
        "X_pred.rename({'_automl_target_col': PRED}, inplace=True, axis=1)\n",
        "# Drop all but the target column and index\n",
        "X_pred.drop(list(set(X_pred.columns.values).difference({PRED})), axis=1, inplace=True)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "X_test[time_column_name] = pd.to_datetime(X_test[time_column_name])\n",
        "X_test.set_index([time_column_name] + grain_column_names, inplace=True, drop=True)\n",
        "# Merge predictions with raw features\n",
        "pred_test = X_test.merge(X_pred, left_index=True, right_index=True)\n",
        "pred_test.head()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from sklearn.metrics import mean_absolute_error, mean_squared_error\n",
        "def MAPE(actual, pred):\n",
        "    \"\"\"\n",
        "    Calculate mean absolute percentage error.\n",
        "    Remove NA and values where actual is close to zero\n",
        "    \"\"\"\n",
        "    not_na = ~(np.isnan(actual) | np.isnan(pred))\n",
        "    not_zero = ~np.isclose(actual, 0.0)\n",
        "    actual_safe = actual[not_na & not_zero]\n",
        "    pred_safe = pred[not_na & not_zero]\n",
        "    APE = 100*np.abs((actual_safe - pred_safe)/actual_safe)\n",
        "    return np.mean(APE)\n",
        "\n",
        "def get_metrics(actuals, preds):\n",
        "    return pd.Series(\n",
        "    {\n",
        "        \"RMSE\": np.sqrt(mean_squared_error(actuals, preds)),\n",
        "        \"NormRMSE\": np.sqrt(mean_squared_error(actuals, preds))/np.abs(actuals.max()-actuals.min()),\n",
        "        \"MAE\": mean_absolute_error(actuals, preds),\n",
        "        \"MAPE\": MAPE(actuals, preds)},\n",
        "    )"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "get_metrics(pred_test[PRED].values, pred_test[target_column].values)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": []
    }
  ],
  "metadata": {
    "authors": [
      {
        "name": "alyerman"
      }
    ],
    "category": "other",
    "compute": [
      "AML Compute"
    ],
    "datasets": [
      "Orange Juice Sales"
    ],
    "deployment": [
      "Azure Container Instance"
    ],
    "exclude_from_index": false,
    "framework": [
      "Scikit-learn",
      "Pytorch"
    ],
    "friendly_name": "Automated ML Grouping with Pipeline.",
    "index_order": 10,
    "kernelspec": {
      "display_name": "Python 3.6",
      "language": "python",
      "name": "python36"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.6"
    },
    "tags": [
      "AutomatedML"
    ],
    "task": "Use AzureML Pipeline to trigger multiple Automated ML runs."
  },
  "nbformat": 4,
  "nbformat_minor": 2
}