Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.
![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/auto-ml-forecasting-backtest-single-model.png)

# Automated MachineLearning
_**The model backtesting**_

## Contents
1. [Introduction](#Introduction)
2. [Setup](#Setup)
3. [Data](#Data)
4. [Prepare remote compute and data.](#prepare_remote)
5. [Create the configuration for AutoML backtesting](#train)
6. [Backtest AutoML](#backtest_automl)
7. [View metrics](#Metrics)
8. [Backtest the best model](#backtest_model)

## Introduction
Model backtesting is used to evaluate its performance on historical data. To do that we step back on the backtesting period by the data set several times and split the data to train and test sets. Then these data sets are used for training and evaluation of model.<br>
This notebook is intended to demonstrate backtesting on a single model, this is the best solution for small data sets with a few or one time series in it. For scenarios where we would like to choose the best AutoML model for every backtest iteration, please see [AutoML Forecasting Backtest Many Models Example](../forecasting-backtest-many-models/auto-ml-forecasting-backtest-many-models.ipynb) notebook.
![Backtesting](Backtesting.png)
This notebook demonstrates two ways of backtesting:
- AutoML backtesting: we will train separate AutoML models for historical data
- Model backtesting: from the first run we will select the best model trained on the most recent data, retrain it on the past data and evaluate.

## Setup

In [None]:
import os
import numpy as np
import pandas as pd
import shutil

import azureml.core
from azureml.core import Experiment, Model, Workspace

This notebook is compatible with Azure ML SDK version 1.35.1 or later.

In [None]:
print("You are currently using version", azureml.core.VERSION, "of the Azure ML SDK")

As part of the setup you have already created a <b>Workspace</b>.

In [None]:
ws = Workspace.from_config()

output = {}
output["Subscription ID"] = ws.subscription_id
output["Workspace"] = ws.name
output["SKU"] = ws.sku
output["Resource Group"] = ws.resource_group
output["Location"] = ws.location
output["SDK Version"] = azureml.core.VERSION
pd.set_option("display.max_colwidth", None)
outputDf = pd.DataFrame(data=output, index=[""])
outputDf.T

## Data
For the demonstration purposes we will simulate one year of daily data. To do this we need to specify the following parameters: time column name, time series ID column names and label column name. Our intention is to forecast for two weeks ahead.

In [None]:
TIME_COLUMN_NAME = "date"
TIME_SERIES_ID_COLUMN_NAMES = "time_series_id"
LABEL_COLUMN_NAME = "y"
FORECAST_HORIZON = 14
FREQUENCY = "D"


def simulate_timeseries_data(
    train_len: int,
    test_len: int,
    time_column_name: str,
    target_column_name: str,
    time_series_id_column_name: str,
    time_series_number: int = 1,
    freq: str = "H",
):
    """
    Return the time series of designed length.

    :param train_len: The length of training data (one series).
    :type train_len: int
    :param test_len: The length of testing data (one series).
    :type test_len: int
    :param time_column_name: The desired name of a time column.
    :type time_column_name: str
    :param time_series_number: The number of time series in the data set.
    :type time_series_number: int
    :param freq: The frequency string representing pandas offset.
                 see https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html
    :type freq: str
    :returns: the tuple of train and test data sets.
    :rtype: tuple

    """
    data_train = []  # type: List[pd.DataFrame]
    data_test = []  # type: List[pd.DataFrame]
    data_length = train_len + test_len
    for i in range(time_series_number):
        X = pd.DataFrame(
            {
                time_column_name: pd.date_range(
                    start="2000-01-01", periods=data_length, freq=freq
                ),
                target_column_name: np.arange(data_length).astype(float)
                + np.random.rand(data_length)
                + i * 5,
                "ext_predictor": np.asarray(range(42, 42 + data_length)),
                time_series_id_column_name: np.repeat("ts{}".format(i), data_length),
            }
        )
        data_train.append(X[:train_len])
        data_test.append(X[train_len:])
    train = pd.concat(data_train)
    label_train = train.pop(target_column_name).values
    test = pd.concat(data_test)
    label_test = test.pop(target_column_name).values
    return train, label_train, test, label_test


n_test_periods = FORECAST_HORIZON
n_train_periods = 365
X_train, y_train, X_test, y_test = simulate_timeseries_data(
    train_len=n_train_periods,
    test_len=n_test_periods,
    time_column_name=TIME_COLUMN_NAME,
    target_column_name=LABEL_COLUMN_NAME,
    time_series_id_column_name=TIME_SERIES_ID_COLUMN_NAMES,
    time_series_number=2,
    freq=FREQUENCY,
)
X_train[LABEL_COLUMN_NAME] = y_train

Let's see what the training data looks like.

In [None]:
X_train.tail()

### Prepare remote compute and data. <a id="prepare_remote"></a>
The [Machine Learning service workspace](https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-workspace), is paired with the storage account, which contains the default data store. We will use it to upload the artificial data and create [tabular dataset](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.tabulardataset?view=azure-ml-py) for training. A tabular dataset defines a series of lazily-evaluated, immutable operations to load data from the data source into tabular representation.

In [None]:
from azureml.data.dataset_factory import TabularDatasetFactory

ds = ws.get_default_datastore()
# Upload saved data to the default data store.
train_data = TabularDatasetFactory.register_pandas_dataframe(
    X_train, target=(ds, "data"), name="data_backtest"
)

You will need to create a compute target for backtesting. In this [tutorial](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-set-up-training-targets#amlcompute), you create AmlCompute as your training compute resource.

> Note that if you have an AzureML Data Scientist role, you will not have permission to create compute resources. Talk to your workspace or IT admin to create the compute targets described in this section, if they do not already exist.

In [None]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# Choose a name for your CPU cluster
amlcompute_cluster_name = "backtest-cluster"

# Verify that cluster does not exist already
try:
    compute_target = ComputeTarget(workspace=ws, name=amlcompute_cluster_name)
    print("Found existing cluster, use it.")
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(
        vm_size="STANDARD_DS12_V2", max_nodes=6
    )
    compute_target = ComputeTarget.create(ws, amlcompute_cluster_name, compute_config)

compute_target.wait_for_completion(show_output=True)

## Create the configuration for AutoML backtesting <a id="train"></a>

This dictionary defines the AutoML and many models settings. For this forecasting task we need to define several settings including the name of the time column, the maximum forecast horizon, and the partition column name definition.

| Property                           | Description|
| :---------------                   | :------------------- |
| **task**                           | forecasting |
| **primary_metric**                 | This is the metric that you want to optimize.<br> Forecasting supports the following primary metrics <br><i>normalized_root_mean_squared_error</i><br><i>normalized_mean_absolute_error</i> |
| **iteration_timeout_minutes**      | Maximum amount of time in minutes that the model can train. This is optional but provides customers with greater control on exit criteria. |
| **iterations**                     | Number of models to train. This is optional but provides customers with greater control on exit criteria. |
| **experiment_timeout_hours**       | Maximum amount of time in hours that the experiment can take before it terminates. This is optional but provides customers with greater control on exit criteria. |
| **label_column_name**              | The name of the label column. |
| **max_horizon**               | The forecast horizon is how many periods forward you would like to forecast. This integer horizon is in units of the timeseries frequency (e.g. daily, weekly). Periods are inferred from your data. |
| **n_cross_validations**            | Number of cross validation splits. The default value is "auto", in which case AutoMl determines the number of cross-validations automatically, if a validation set is not provided. Or users could specify an integer value. Rolling Origin Validation is used to split time-series in a temporally consistent way. |
|**cv_step_size**|Number of periods between two consecutive cross-validation folds. The default value is "auto", in which case AutoMl determines the cross-validation step size automatically, if a validation set is not provided. Or users could specify an integer value.
| **time_column_name**               | The name of your time column. |
| **grain_column_names**     | The column names used to uniquely identify timeseries in data that has multiple rows with the same timestamp. |

In [None]:
automl_settings = {
    "task": "forecasting",
    "primary_metric": "normalized_root_mean_squared_error",
    "iteration_timeout_minutes": 10,  # This needs to be changed based on the dataset. We ask customer to explore how long training is taking before settings this value
    "iterations": 15,
    "experiment_timeout_hours": 1,  # This also needs to be changed based on the dataset. For larger data set this number needs to be bigger.
    "label_column_name": LABEL_COLUMN_NAME,
    "n_cross_validations": "auto",  # Feel free to set to a small integer (>=2) if runtime is an issue.
    "cv_step_size": "auto",
    "time_column_name": TIME_COLUMN_NAME,
    "max_horizon": FORECAST_HORIZON,
    "track_child_runs": False,
    "grain_column_names": TIME_SERIES_ID_COLUMN_NAMES,
}

## Backtest AutoML <a id="backtest_automl"></a>
First we set backtesting parameters: we will step back by 30 days and will make 5 such steps; for each step we will forecast for next two weeks.

In [None]:
# The number of periods to step back on each backtest iteration.
BACKTESTING_PERIOD = 30
# The number of times we will back test the model.
NUMBER_OF_BACKTESTS = 5

To train AutoML on backtesting folds we will use the [Azure Machine Learning pipeline](https://docs.microsoft.com/en-us/azure/machine-learning/concept-ml-pipelines). It will generate backtest folds, then train model for each of them and calculate the accuracy metrics. To run pipeline, you also need to create an <b>Experiment</b>. An Experiment corresponds to a prediction problem you are trying to solve (here, it is a forecasting), while a Run corresponds to a specific approach to the problem.

In [None]:
from uuid import uuid1

from pipeline_helper import get_backtest_pipeline

pipeline_exp = Experiment(ws, "automl-backtesting")

# We will create the unique identifier to mark our models.
model_uid = str(uuid1())

pipeline = get_backtest_pipeline(
    experiment=pipeline_exp,
    dataset=train_data,
    # The STANDARD_DS12_V2 has 4 vCPU per node, we will set 2 process per node to be safe.
    process_per_node=2,
    # The maximum number of nodes for our compute is 6.
    node_count=6,
    compute_target=compute_target,
    automl_settings=automl_settings,
    step_size=BACKTESTING_PERIOD,
    step_number=NUMBER_OF_BACKTESTS,
    model_uid=model_uid,
    forecast_quantiles=[0.025, 0.975],  # Optional
)

Run the pipeline and wait for results.

In [None]:
pipeline_run = pipeline_exp.submit(pipeline)
pipeline_run.wait_for_completion(show_output=False)

After the run is complete, we can download the results. 

In [None]:
metrics_output = pipeline_run.get_pipeline_output("results")
metrics_output.download("backtest_metrics")

## View metrics<a id="Metrics"></a>
To distinguish these metrics from the model backtest, which we will obtain in the next section, we will move the directory with metrics out of the backtest_metrics and will remove the parent folder. We will create the utility function for that.

In [None]:
def copy_scoring_directory(new_name):
    scores_path = os.path.join("backtest_metrics", "azureml")
    directory_list = [os.path.join(scores_path, d) for d in os.listdir(scores_path)]
    latest_file = max(directory_list, key=os.path.getctime)
    print(
        f"The output directory {latest_file} was created on {pd.Timestamp(os.path.getctime(latest_file), unit='s')} GMT."
    )
    shutil.move(os.path.join(latest_file, "results"), new_name)
    shutil.rmtree("backtest_metrics")

Move the directory and list its contents.

In [None]:
copy_scoring_directory("automl_backtest")
pd.DataFrame({"File": os.listdir("automl_backtest")})

The directory contains a set of files with results:
- forecast.csv contains forecasts for all backtest iterations. The backtest_iteration column contains iteration identifier with the last training date as a suffix
- scores.csv contains all metrics. If data set contains several time series, the metrics are given for all combinations of time series id and iterations, as well as scores for all iterations and time series id are marked as "all_sets"
- plots_fcst_vs_actual.pdf contains the predictions vs forecast plots for each iteration and time series.

For demonstration purposes we will display the table of metrics for one of the time series with ID "ts0". Again, we will create the utility function, which will be re used in model backtesting.

In [None]:
def get_metrics_for_ts(all_metrics, ts):
    """
    Get the metrics for the time series with ID ts and return it as pandas data frame.

    :param all_metrics: The table with all the metrics.
    :param ts: The ID of a time series of interest.
    :return: The pandas DataFrame with metrics for one time series.
    """
    results_df = None
    for ts_id, one_series in all_metrics.groupby("time_series_id"):
        if not ts_id.startswith(ts):
            continue
        iteration = ts_id.split("|")[-1]
        df = one_series[["metric_name", "metric"]]
        df.rename({"metric": iteration}, axis=1, inplace=True)
        df.set_index("metric_name", inplace=True)
        if results_df is None:
            results_df = df
        else:
            results_df = results_df.merge(
                df, how="inner", left_index=True, right_index=True
            )
    results_df.sort_index(axis=1, inplace=True)
    return results_df


metrics_df = pd.read_csv(os.path.join("automl_backtest", "scores.csv"))
ts_id = "ts0"
get_metrics_for_ts(metrics_df, ts_id)

Forecast vs actuals plots.

In [None]:
from IPython.display import IFrame

IFrame("./automl_backtest/plots_fcst_vs_actual.pdf", width=800, height=300)

# <font color='blue'>Backtest the best model</font> <a id="backtest_model"></a>

For model backtesting we will use the same parameters we used to backtest AutoML. All the models, we have obtained in the previous run were registered in our workspace. To identify the model, each was assigned a tag with the last trainig date.

In [None]:
model_list = Model.list(ws, tags=[["experiment", "automl-backtesting"]])
model_data = {"name": [], "last_training_date": []}
for model in model_list:
    if (
        "last_training_date" not in model.tags
        or "model_uid" not in model.tags
        or model.tags["model_uid"] != model_uid
    ):
        continue
    model_data["name"].append(model.name)
    model_data["last_training_date"].append(
        pd.Timestamp(model.tags["last_training_date"])
    )
df_models = pd.DataFrame(model_data)
df_models.sort_values(["last_training_date"], inplace=True)
df_models.reset_index(inplace=True, drop=True)
df_models

We will backtest the model trained on the most recet data.

In [None]:
model_name = df_models["name"].iloc[-1]
model_name

### Retrain the models.
Assemble the pipeline, which will retrain the best model from AutoML run on historical data.

In [None]:
pipeline_exp = Experiment(ws, "model-backtesting")

pipeline = get_backtest_pipeline(
    experiment=pipeline_exp,
    dataset=train_data,
    # The STANDARD_DS12_V2 has 4 vCPU per node, we will set 2 process per node to be safe.
    process_per_node=2,
    # The maximum number of nodes for our compute is 6.
    node_count=6,
    compute_target=compute_target,
    automl_settings=automl_settings,
    step_size=BACKTESTING_PERIOD,
    step_number=NUMBER_OF_BACKTESTS,
    model_name=model_name,
    forecast_quantiles=[0.025, 0.975],
)

Launch the backtesting pipeline.

In [None]:
pipeline_run = pipeline_exp.submit(pipeline)
pipeline_run.wait_for_completion(show_output=False)

The metrics are stored in the pipeline output named "score". The next code will download the table with metrics.

In [None]:
metrics_output = pipeline_run.get_pipeline_output("results")
metrics_output.download("backtest_metrics")

Again, we will copy the data files from the downloaded directory, but in this case we will call the folder "model_backtest"; it will contain the same files as the one for AutoML backtesting.

In [None]:
copy_scoring_directory("model_backtest")

Finally, we will display the metrics.

In [None]:
model_metrics_df = pd.read_csv(os.path.join("model_backtest", "scores.csv"))
get_metrics_for_ts(model_metrics_df, ts_id)

Forecast vs actuals plots.

In [None]:
from IPython.display import IFrame

IFrame("./model_backtest/plots_fcst_vs_actual.pdf", width=800, height=300)