MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/forecasting-recipes-univariate/auto-ml-forecasting-univariate-recipe-run-experiment.ipynb

{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Copyright (c) Microsoft Corporation. All rights reserved.\n",
        "\n",
        "Licensed under the MIT License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/forecasting-recipes-univariate/2_run_experiment.png)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Running AutoML experiments\n",
        "\n",
        "See the `auto-ml-forecasting-univariate-recipe-experiment-settings` notebook on how to determine settings for seasonal features, target lags and whether the series needs to be differenced or not. To make experimentation user-friendly, the user has to specify several parameters: DIFFERENCE_SERIES, TARGET_LAGS and STL_TYPE. Once these parameters are set, the notebook will generate correct transformations and settings to run experiments, generate forecasts, compute inference set metrics and plot forecast vs actuals. It will also convert the forecast from first differences to levels (original units of measurement) if the DIFFERENCE_SERIES parameter is set to True before calculating inference set metrics.\n",
        "\n",
        "<br/>\n",
        "\n",
        "The output generated by this notebook is saved in the `experiment_output`folder."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Setup"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import os\n",
        "import logging\n",
        "import pandas as pd\n",
        "import numpy as np\n",
        "\n",
        "import azureml.automl.runtime\n",
        "from azureml.core.compute import AmlCompute\n",
        "from azureml.core.compute import ComputeTarget\n",
        "import matplotlib.pyplot as plt\n",
        "from helper_functions import ts_train_test_split, compute_metrics\n",
        "\n",
        "import azureml.core\n",
        "from azureml.core.workspace import Workspace\n",
        "from azureml.core.experiment import Experiment\n",
        "from azureml.train.automl import AutoMLConfig\n",
        "\n",
        "\n",
        "# set printing options\n",
        "np.set_printoptions(precision=4, suppress=True, linewidth=100)\n",
        "pd.set_option(\"display.max_columns\", 500)\n",
        "pd.set_option(\"display.width\", 1000)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "As part of the setup you have already created a **Workspace**. You will also need to create a [compute target](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-set-up-training-targets#amlcompute) for your AutoML run. In this tutorial, you create AmlCompute as your training compute resource.\n",
        "> Note that if you have an AzureML Data Scientist role, you will not have permission to create compute resources. Talk to your workspace or IT admin to create the compute targets described in this section, if they do not already exist."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "ws = Workspace.from_config()\n",
        "amlcompute_cluster_name = \"recipe-cluster\"\n",
        "\n",
        "found = False\n",
        "# Check if this compute target already exists in the workspace.\n",
        "cts = ws.compute_targets\n",
        "if amlcompute_cluster_name in cts and cts[amlcompute_cluster_name].type == \"AmlCompute\":\n",
        "    found = True\n",
        "    print(\"Found existing compute target.\")\n",
        "    compute_target = cts[amlcompute_cluster_name]\n",
        "\n",
        "if not found:\n",
        "    print(\"Creating a new compute target...\")\n",
        "    provisioning_config = AmlCompute.provisioning_configuration(\n",
        "        vm_size=\"STANDARD_D2_V2\", max_nodes=6\n",
        "    )\n",
        "\n",
        "    # Create the cluster.\\n\",\n",
        "    compute_target = ComputeTarget.create(\n",
        "        ws, amlcompute_cluster_name, provisioning_config\n",
        "    )\n",
        "\n",
        "print(\"Checking cluster status...\")\n",
        "# Can poll for a minimum number of nodes and for a specific timeout.\n",
        "# If no min_node_count is provided, it will use the scale settings for the cluster.\n",
        "compute_target.wait_for_completion(\n",
        "    show_output=True, min_node_count=None, timeout_in_minutes=20\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Data\n",
        "\n",
        "Here, we will load the data from the csv file and drop the Covid period."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "main_data_loc = \"data\"\n",
        "train_file_name = \"S4248SM144SCEN.csv\"\n",
        "\n",
        "TARGET_COLNAME = \"S4248SM144SCEN\"\n",
        "TIME_COLNAME = \"observation_date\"\n",
        "COVID_PERIOD_START = (\n",
        "    \"2020-03-01\"  # start of the covid period. To be excluded from evaluation.\n",
        ")\n",
        "\n",
        "# load data\n",
        "df = pd.read_csv(os.path.join(main_data_loc, train_file_name))\n",
        "df[TIME_COLNAME] = pd.to_datetime(df[TIME_COLNAME], format=\"%Y-%m-%d\")\n",
        "df.sort_values(by=TIME_COLNAME, inplace=True)\n",
        "\n",
        "# remove the Covid period\n",
        "df = df.query('{} <= \"{}\"'.format(TIME_COLNAME, COVID_PERIOD_START))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Set parameters\n",
        "\n",
        "The first set of parameters is based on the analysis performed in the `auto-ml-forecasting-univariate-recipe-experiment-settings` notebook. "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# set parameters based on the settings notebook analysis\n",
        "DIFFERENCE_SERIES = True\n",
        "TARGET_LAGS = None\n",
        "STL_TYPE = None"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Next, define additional parameters to be used in the <a href=\"https://docs.microsoft.com/en-us/python/api/azureml-train-automl-client/azureml.train.automl.automlconfig?view=azure-ml-py\"> AutoML config </a> class.\n",
        "\n",
        "<ul> \n",
        "    <li> FORECAST_HORIZON:  The forecast horizon is the number of periods into the future that the model should predict. Here, we set the horizon to 12 periods (i.e. 12 quarters). For more discussion of forecast horizons and guiding principles for setting them, please see the <a href=\"https://github.com/Azure/MachineLearningNotebooks/tree/master/how-to-use-azureml/automated-machine-learning/forecasting-energy-demand\"> energy demand notebook </a>. \n",
        "    </li>\n",
        "    <li> TIME_SERIES_ID_COLNAMES: The names of columns used to group a timeseries. It can be used to create multiple series. If time series identifier is not defined, the data set is assumed to be one time-series. This parameter is used with task type forecasting. Since we are working with a single series, this list is empty.\n",
        "    </li>\n",
        "    <li> BLOCKED_MODELS: Optional list of models to be blocked from consideration during model selection stage. At this point we want to consider all ML and Time Series models.\n",
        "        <ul>\n",
        "            <li> See the following <a href=\"https://docs.microsoft.com/en-us/python/api/azureml-train-automl-client/azureml.train.automl.constants.supportedmodels.forecasting?view=azure-ml-py\"> link </a> for a list of supported Forecasting models</li>\n",
        "        </ul>\n",
        "    </li>\n",
        "</ul>\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# set other parameters\n",
        "FORECAST_HORIZON = 12\n",
        "TIME_SERIES_ID_COLNAMES = []\n",
        "BLOCKED_MODELS = []"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "To run AutoML, you also need to create an **Experiment**. An Experiment corresponds to a prediction problem you are trying to solve, while a Run corresponds to a specific approach to the problem."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# choose a name for the run history container in the workspace\n",
        "if isinstance(TARGET_LAGS, list):\n",
        "    TARGET_LAGS_STR = (\n",
        "        \"-\".join(map(str, TARGET_LAGS)) if (len(TARGET_LAGS) > 0) else None\n",
        "    )\n",
        "else:\n",
        "    TARGET_LAGS_STR = TARGET_LAGS\n",
        "\n",
        "experiment_desc = \"diff-{}_lags-{}_STL-{}\".format(\n",
        "    DIFFERENCE_SERIES, TARGET_LAGS_STR, STL_TYPE\n",
        ")\n",
        "experiment_name = \"alcohol_{}\".format(experiment_desc)\n",
        "experiment = Experiment(ws, experiment_name)\n",
        "\n",
        "output = {}\n",
        "output[\"SDK version\"] = azureml.core.VERSION\n",
        "output[\"Subscription ID\"] = ws.subscription_id\n",
        "output[\"Workspace\"] = ws.name\n",
        "output[\"SKU\"] = ws.sku\n",
        "output[\"Resource Group\"] = ws.resource_group\n",
        "output[\"Location\"] = ws.location\n",
        "output[\"Run History Name\"] = experiment_name\n",
        "pd.set_option(\"display.max_colwidth\", None)\n",
        "outputDf = pd.DataFrame(data=output, index=[\"\"])\n",
        "print(outputDf.T)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# create output directory\n",
        "output_dir = \"experiment_output/{}\".format(experiment_desc)\n",
        "if not os.path.exists(output_dir):\n",
        "    os.makedirs(output_dir)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# difference data and test for unit root\n",
        "if DIFFERENCE_SERIES:\n",
        "    df_delta = df.copy()\n",
        "    df_delta[TARGET_COLNAME] = df[TARGET_COLNAME].diff()\n",
        "    df_delta.dropna(axis=0, inplace=True)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# split the data into train and test set\n",
        "if DIFFERENCE_SERIES:\n",
        "    # generate train/inference sets using data in first differences\n",
        "    df_train, df_test = ts_train_test_split(\n",
        "        df_input=df_delta,\n",
        "        n=FORECAST_HORIZON,\n",
        "        time_colname=TIME_COLNAME,\n",
        "        ts_id_colnames=TIME_SERIES_ID_COLNAMES,\n",
        "    )\n",
        "else:\n",
        "    df_train, df_test = ts_train_test_split(\n",
        "        df_input=df,\n",
        "        n=FORECAST_HORIZON,\n",
        "        time_colname=TIME_COLNAME,\n",
        "        ts_id_colnames=TIME_SERIES_ID_COLNAMES,\n",
        "    )"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Upload files to the Datastore\n",
        "The [Machine Learning service workspace](https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-workspace) is paired with the storage account, which contains the default data store. We will use it to upload the bike share data and create [tabular dataset](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.tabulardataset?view=azure-ml-py) for training. A tabular dataset defines a series of lazily-evaluated, immutable operations to load data from the data source into tabular representation."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "df_train.to_csv(\"train.csv\", index=False)\n",
        "df_test.to_csv(\"test.csv\", index=False)\n",
        "\n",
        "datastore = ws.get_default_datastore()\n",
        "datastore.upload_files(\n",
        "    files=[\"./train.csv\"],\n",
        "    target_path=\"uni-recipe-dataset/tabular/\",\n",
        "    overwrite=True,\n",
        "    show_progress=True,\n",
        ")\n",
        "datastore.upload_files(\n",
        "    files=[\"./test.csv\"],\n",
        "    target_path=\"uni-recipe-dataset/tabular/\",\n",
        "    overwrite=True,\n",
        "    show_progress=True,\n",
        ")\n",
        "\n",
        "from azureml.core import Dataset\n",
        "\n",
        "train_dataset = Dataset.Tabular.from_delimited_files(\n",
        "    path=[(datastore, \"uni-recipe-dataset/tabular/train.csv\")]\n",
        ")\n",
        "test_dataset = Dataset.Tabular.from_delimited_files(\n",
        "    path=[(datastore, \"uni-recipe-dataset/tabular/test.csv\")]\n",
        ")\n",
        "\n",
        "# print the first 5 rows of the Dataset\n",
        "train_dataset.to_pandas_dataframe().reset_index(drop=True).head(5)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Config AutoML"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "time_series_settings = {\n",
        "    \"time_column_name\": TIME_COLNAME,\n",
        "    \"forecast_horizon\": FORECAST_HORIZON,\n",
        "    \"target_lags\": TARGET_LAGS,\n",
        "    \"use_stl\": STL_TYPE,\n",
        "    \"blocked_models\": BLOCKED_MODELS,\n",
        "    \"time_series_id_column_names\": TIME_SERIES_ID_COLNAMES,\n",
        "}\n",
        "\n",
        "automl_config = AutoMLConfig(\n",
        "    task=\"forecasting\",\n",
        "    debug_log=\"sample_experiment.log\",\n",
        "    primary_metric=\"normalized_root_mean_squared_error\",\n",
        "    experiment_timeout_minutes=20,\n",
        "    iteration_timeout_minutes=5,\n",
        "    enable_early_stopping=True,\n",
        "    training_data=train_dataset,\n",
        "    label_column_name=TARGET_COLNAME,\n",
        "    n_cross_validations=\"auto\",  # Feel free to set to a small integer (>=2) if runtime is an issue.\n",
        "    cv_step_size=\"auto\",\n",
        "    verbosity=logging.INFO,\n",
        "    max_cores_per_iteration=-1,\n",
        "    compute_target=compute_target,\n",
        "    **time_series_settings,\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We will now run the experiment, you can go to Azure ML portal to view the run details."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "remote_run = experiment.submit(automl_config, show_output=False)\n",
        "remote_run.wait_for_completion()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Retrieve the Best Run details\n",
        "Below we retrieve the best Run object from among all the runs in the experiment."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "best_run = remote_run.get_best_child()\n",
        "best_run"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Inference\n",
        "\n",
        "We now use the best fitted model from the AutoML Run to make forecasts for the test set. We will do batch scoring on the test dataset which should have the same schema as training dataset.\n",
        "\n",
        "The inference will run on a remote compute. In this example, it will re-use the training compute."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "test_experiment = Experiment(ws, experiment_name + \"_inference\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Retreiving forecasts from the model\n",
        "We have created a function called `run_forecast` that submits the test data to the best model determined during the training run and retrieves forecasts. This function uses a helper script `forecasting_script` which is uploaded and expecuted on the remote compute."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from run_forecast import run_remote_inference\n",
        "\n",
        "remote_run = run_remote_inference(\n",
        "    test_experiment=test_experiment,\n",
        "    compute_target=compute_target,\n",
        "    train_run=best_run,\n",
        "    test_dataset=test_dataset,\n",
        "    target_column_name=TARGET_COLNAME,\n",
        ")\n",
        "remote_run.wait_for_completion(show_output=False)\n",
        "\n",
        "remote_run.download_file(\"outputs/predictions.csv\", f\"{output_dir}/predictions.csv\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Download the prediction result for metrics calcuation\n",
        "The test data with predictions are saved in artifact `outputs/predictions.csv`. We will use it to calculate accuracy metrics and vizualize predictions versus actuals."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "X_trans = pd.read_csv(f\"{output_dir}/predictions.csv\", parse_dates=[TIME_COLNAME])\n",
        "X_trans.head()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# convert forecast in differences to levels\n",
        "def convert_fcst_diff_to_levels(fcst, yt, df_orig):\n",
        "    \"\"\"Convert forecast from first differences to levels.\"\"\"\n",
        "    fcst = fcst.reset_index(drop=False, inplace=False)\n",
        "    fcst[\"predicted_level\"] = fcst[\"predicted\"].cumsum()\n",
        "    fcst[\"predicted_level\"] = fcst[\"predicted_level\"].astype(float) + float(yt)\n",
        "    # merge actuals\n",
        "    out = pd.merge(\n",
        "        fcst, df_orig[[TIME_COLNAME, TARGET_COLNAME]], on=[TIME_COLNAME], how=\"inner\"\n",
        "    )\n",
        "    out.rename(columns={TARGET_COLNAME: \"actual_level\"}, inplace=True)\n",
        "    return out"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "if DIFFERENCE_SERIES:\n",
        "    # convert forecast in differences to the levels\n",
        "    INFORMATION_SET_DATE = max(df_train[TIME_COLNAME])\n",
        "    YT = df.query(\"{} == @INFORMATION_SET_DATE\".format(TIME_COLNAME))[TARGET_COLNAME]\n",
        "\n",
        "    fcst_df = convert_fcst_diff_to_levels(fcst=X_trans, yt=YT, df_orig=df)\n",
        "else:\n",
        "    fcst_df = X_trans.copy()\n",
        "    fcst_df[\"actual_level\"] = y_test\n",
        "    fcst_df[\"predicted_level\"] = y_predictions\n",
        "\n",
        "del X_trans"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Calculate metrics and save output"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# compute metrics\n",
        "metrics_df = compute_metrics(fcst_df=fcst_df, metric_name=None, ts_id_colnames=None)\n",
        "# save output\n",
        "metrics_file_name = \"{}_metrics.csv\".format(experiment_name)\n",
        "fcst_file_name = \"{}_forecst.csv\".format(experiment_name)\n",
        "plot_file_name = \"{}_plot.pdf\".format(experiment_name)\n",
        "\n",
        "metrics_df.to_csv(os.path.join(output_dir, metrics_file_name), index=True)\n",
        "fcst_df.to_csv(os.path.join(output_dir, fcst_file_name), index=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Generate and save visuals"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "plot_df = df.query('{} > \"2010-01-01\"'.format(TIME_COLNAME))\n",
        "plot_df.set_index(TIME_COLNAME, inplace=True)\n",
        "fcst_df.set_index(TIME_COLNAME, inplace=True)\n",
        "\n",
        "# generate and save plots\n",
        "fig, ax = plt.subplots(dpi=180)\n",
        "ax.plot(plot_df[TARGET_COLNAME], \"-g\", label=\"Historical\")\n",
        "ax.plot(fcst_df[\"actual_level\"], \"-b\", label=\"Actual\")\n",
        "ax.plot(fcst_df[\"predicted_level\"], \"-r\", label=\"Forecast\")\n",
        "ax.legend()\n",
        "ax.set_title(\"Forecast vs Actuals\")\n",
        "ax.set_xlabel(TIME_COLNAME)\n",
        "ax.set_ylabel(TARGET_COLNAME)\n",
        "locs, labels = plt.xticks()\n",
        "\n",
        "plt.setp(labels, rotation=45)\n",
        "plt.savefig(os.path.join(output_dir, plot_file_name))"
      ]
    }
  ],
  "metadata": {
    "authors": [
      {
        "name": "vlbejan"
      }
    ],
    "kernelspec": {
      "display_name": "Python 3.6",
      "language": "python",
      "name": "python36"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.8.5"
    },
    "vscode": {
      "interpreter": {
        "hash": "6bd77c88278e012ef31757c15997a7bea8c943977c43d6909403c00ae11d43ca"
      }
    }
  },
  "nbformat": 4,
  "nbformat_minor": 4
}