# Training and Inferencing AutoML Forecasting Model Using Pipelines

## Introduction

In this notebook, we demonstrate how to use piplines to train and inference on AutoML Forecasting model. Two pipelines will be created: one for training AutoML model, and the other is for inference on AutoML model. We'll also demonstrate how to schedule the inference pipeline so you can get inference results periodically (with refreshed test dataset). Make sure you have executed the configuration notebook before running this notebook. In this notebook you will learn how to:

- Configure AutoML using AutoMLConfig for forecasting tasks using pipeline AutoMLSteps.
- Create and register an AutoML model using AzureML pipeline.
- Inference and schdelue the pipeline using registered model.

## Setup

As part of the setup you have already created an Azure ML `Workspace` object. For AutoML you will need to create an `Experiment` object, which is a named object in a `Workspace` used to run experiments.

In [None]:
import json
import logging
import os

from matplotlib import pyplot as plt
import pandas as pd

import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.train.automl import AutoMLConfig

This sample notebook may use features that are not available in previous versions of the Azure ML SDK.

In [None]:
print("This notebook was created using version 1.38.0 of the Azure ML SDK")
print("You are currently using version", azureml.core.VERSION, "of the Azure ML SDK")

Accessing the Azure ML workspace requires authentication with Azure.

The default authentication is interactive authentication using the default tenant. Executing the ws = Workspace.from_config() line in the cell below will prompt for authentication the first time that it is run.

If you have multiple Azure tenants, you can specify the tenant by replacing the ws = Workspace.from_config() line in the cell below with the following:
```
from azureml.core.authentication import InteractiveLoginAuthentication
auth = InteractiveLoginAuthentication(tenant_id = 'mytenantid')
ws = Workspace.from_config(auth = auth)
```
If you need to run in an environment where interactive login is not possible, you can use Service Principal authentication by replacing the ws = Workspace.from_config() line in the cell below with the following:
```
from azureml.core.authentication import ServicePrincipalAuthentication
auth = ServicePrincipalAuthentication('mytenantid', 'myappid', 'mypassword')
ws = Workspace.from_config(auth = auth)
```
For more details, see aka.ms/aml-notebook-auth

In [None]:
ws = Workspace.from_config()
dstor = ws.get_default_datastore()

# Choose a name for the run history container in the workspace.
experiment_name = "forecasting-pipeline"
experiment = Experiment(ws, experiment_name)

output = {}
output["Subscription ID"] = ws.subscription_id
output["Workspace"] = ws.name
output["Resource Group"] = ws.resource_group
output["Location"] = ws.location
output["Run History Name"] = experiment_name
pd.set_option("display.max_colwidth", None)
outputDf = pd.DataFrame(data=output, index=[""])
outputDf.T

## Compute

## Compute 

#### Create or Attach existing AmlCompute

You will need to create a compute target for your AutoML run. In this tutorial, you create AmlCompute as your training compute resource.

> Note that if you have an AzureML Data Scientist role, you will not have permission to create compute resources. Talk to your workspace or IT admin to create the compute targets described in this section, if they do not already exist.

#### Creation of AmlCompute takes approximately 5 minutes. 
If the AmlCompute with that name is already in your workspace this code will skip the creation process.
As with other Azure services, there are limits on certain resources (e.g. AmlCompute) associated with the Azure Machine Learning service. Please read [this article](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-quotas) on the default limits and how to request more quota.

In [None]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# Choose a name for your CPU cluster
amlcompute_cluster_name = "forecast-step-cluster"

# Verify that cluster does not exist already
try:
    compute_target = ComputeTarget(workspace=ws, name=amlcompute_cluster_name)
    print("Found existing cluster, use it.")
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(
        vm_size="STANDARD_DS12_V2", max_nodes=4
    )
    compute_target = ComputeTarget.create(ws, amlcompute_cluster_name, compute_config)
compute_target.wait_for_completion(show_output=True)

## Data
You are now ready to load the historical orange juice sales data. For demonstration purposes, we extract sales time-series for just a few of the stores. We will load the CSV file into a plain pandas DataFrame; the time column in the CSV is called _WeekStarting_, so it will be specially parsed into the datetime type.

In [None]:
time_column_name = "WeekStarting"
train = pd.read_csv("oj-train.csv", parse_dates=[time_column_name])

train.head()

Each row in the DataFrame holds a quantity of weekly sales for an OJ brand at a single store. The data also includes the sales price, a flag indicating if the OJ brand was advertised in the store that week, and some customer demographic information based on the store location. For historical reasons, the data also include the logarithm of the sales quantity. The Dominick's grocery data is commonly used to illustrate econometric modeling techniques where logarithms of quantities are generally preferred.    

The task is now to build a time-series model for the _Quantity_ column. It is important to note that this dataset is comprised of many individual time-series - one for each unique combination of _Store_ and _Brand_. To distinguish the individual time-series, we define the **time_series_id_column_names** - the columns whose values determine the boundaries between time-series: 

In [None]:
time_series_id_column_names = ["Store", "Brand"]
nseries = train.groupby(time_series_id_column_names).ngroups
print("Data contains {0} individual time-series.".format(nseries))

### Test Splitting
We now split the data into a training and a testing set for later forecast prediction. The test set will contain the final 4 weeks of observed sales for each time-series. The splits should be stratified by series, so we use a group-by statement on the time series identifier columns.

In [None]:
n_test_periods = 4

test = pd.read_csv("oj-test.csv", parse_dates=[time_column_name])

### Upload data to datastore
The [Machine Learning service workspace](https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-workspace), is paired with the storage account, which contains the default data store. We will use it to upload the train and test data and create [tabular datasets](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.tabulardataset?view=azure-ml-py) for training and testing. A tabular dataset defines a series of lazily-evaluated, immutable operations to load data from the data source into tabular representation.

In [None]:
from azureml.data.dataset_factory import TabularDatasetFactory

datastore = ws.get_default_datastore()
train_dataset = TabularDatasetFactory.register_pandas_dataframe(
    train, target=(datastore, "dataset/"), name="dominicks_OJ_train"
)

test_dataset = TabularDatasetFactory.register_pandas_dataframe(
    test, target=(datastore, "dataset/"), name="dominicks_OJ_test"
)

## Training

## Modeling

For forecasting tasks, AutoML uses pre-processing and estimation steps that are specific to time-series. AutoML will undertake the following pre-processing steps:
* Detect time-series sample frequency (e.g. hourly, daily, weekly) and create new records for absent time points to make the series regular. A regular time series has a well-defined frequency and has a value at every sample point in a contiguous time span 
* Impute missing values in the target (via forward-fill) and feature columns (using median column values) 
* Create features based on time series identifiers to enable fixed effects across different series
* Create time-based features to assist in learning seasonal patterns
* Encode categorical variables to numeric quantities

In this notebook, AutoML will train a single, regression-type model across **all** time-series in a given training set. This allows the model to generalize across related series. If you're looking for training multiple models for different time-series, please see the many-models notebook.

You are almost ready to start an AutoML training job. First, we need to define the target column.

In [None]:
target_column_name = "Quantity"

## Forecasting Parameters
To define forecasting parameters for your experiment training, you can leverage the ForecastingParameters class. The table below details the forecasting parameter we will be passing into our experiment.


|Property|Description|
|-|-|
|**time_column_name**|The name of your time column.|
|**forecast_horizon**|The forecast horizon is how many periods forward you would like to forecast. This integer horizon is in units of the timeseries frequency (e.g. daily, weekly).|
|**time_series_id_column_names**|The column names used to uniquely identify the time series in data that has multiple rows with the same timestamp. If the time series identifiers are not defined, the data set is assumed to be one time series.|
|**freq**|Forecast frequency. This optional parameter represents the period with which the forecast is desired, for example, daily, weekly, yearly, etc. Use this parameter for the correction of time series containing irregular data points or for padding of short time series. The frequency needs to be a pandas offset alias. Please refer to [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects) for more information.

In [None]:
from azureml.automl.core.forecasting_parameters import ForecastingParameters

forecasting_parameters = ForecastingParameters(
    time_column_name=time_column_name,
    forecast_horizon=n_test_periods,
    time_series_id_column_names=time_series_id_column_names,
    freq="W-THU",  # Set the forecast frequency to be weekly (start on each Thursday)
)

automl_config = AutoMLConfig(
    task="forecasting",
    debug_log="automl_oj_sales_errors.log",
    primary_metric="normalized_mean_absolute_error",
    experiment_timeout_hours=0.25,
    training_data=train_dataset,
    label_column_name=target_column_name,
    compute_target=compute_target,
    enable_early_stopping=True,
    n_cross_validations=5,
    verbosity=logging.INFO,
    max_cores_per_iteration=-1,
    forecasting_parameters=forecasting_parameters,
)

In [None]:
from azureml.pipeline.core import PipelineData, TrainingOutput
from azureml.pipeline.steps import AutoMLStep
from azureml.pipeline.core import Pipeline, PipelineParameter
from azureml.pipeline.steps import PythonScriptStep

metrics_output_name = "metrics_output"
best_model_output_name = "best_model_output"
model_file_name = "model_file"
metrics_data_name = "metrics_data"

metrics_data = PipelineData(
    name=metrics_data_name,
    datastore=datastore,
    pipeline_output_name=metrics_output_name,
    training_output=TrainingOutput(type="Metrics"),
)
model_data = PipelineData(
    name=model_file_name,
    datastore=datastore,
    pipeline_output_name=best_model_output_name,
    training_output=TrainingOutput(type="Model"),
)

automl_step = AutoMLStep(
    name="automl_module",
    automl_config=automl_config,
    outputs=[metrics_data, model_data],
    allow_reuse=False,
)

### Register Model Step

#### Run Configuration and Environment
To have a pipeline step run, we first need an environment to run the jobs. The environment can be build using the following code.

In [None]:
from azureml.core.runconfig import CondaDependencies, RunConfiguration

# create a new RunConfig object
conda_run_config = RunConfiguration(framework="python")

# Set compute target to AmlCompute
conda_run_config.target = compute_target

conda_run_config.docker.use_docker = True

cd = CondaDependencies.create(
    pip_packages=[
        "azureml-sdk[automl]",
        "applicationinsights",
        "azureml-opendatasets",
        "azureml-defaults",
    ],
    conda_packages=["numpy==1.19.5"],
    pin_sdk_version=False,
)
conda_run_config.environment.python.conda_dependencies = cd

print("run config is ready")

#### Step to register the model.
The following code generates a step to register the model to the workspace from previous step. 

In [None]:
from azureml.pipeline.core import PipelineData

# The model name with which to register the trained model in the workspace.
model_name_str = "ojmodel"
model_name = PipelineParameter("model_name", default_value=model_name_str)


register_model_step = PythonScriptStep(
    script_name="register_model.py",
    name="register_model",
    source_directory="scripts",
    allow_reuse=False,
    arguments=[
        "--model_name",
        model_name,
        "--model_path",
        model_data,
        "--ds_name",
        "dominicks_OJ_train",
    ],
    inputs=[model_data],
    compute_target=compute_target,
    runconfig=conda_run_config,
)

### Build the Pipeline

In [None]:
training_pipeline = Pipeline(
    description="training_pipeline",
    workspace=ws,
    steps=[automl_step, register_model_step],
)

### Submit Pipeline Run

In [None]:
training_pipeline_run = experiment.submit(training_pipeline)

In [None]:
training_pipeline_run.wait_for_completion(show_output=True)

### Get metrics for each runs

In [None]:
output_dir = "train_output"
pipeline_output = training_pipeline_run.get_pipeline_output("metrics_output")
pipeline_output.download(output_dir)

In [None]:
file_path = os.path.join(output_dir, pipeline_output.path_on_datastore)
with open(file_path) as f:
    metrics = json.load(f)
for run_id, metrics in metrics.items():
    print("{}: {}".format(run_id, metrics["normalized_root_mean_squared_error"][0]))

## Inference

There are several ways to do the inference, for here we will demonstrate how to use the registered model and pipeline to do the inference. (how to register a model https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.model.model?view=azure-ml-py).

### Get Inference Pipeline Environment
To trigger an inference pipeline run, we first need a running environment for run that contains all the appropriate packages for the model unpickling. This environment can be either assess from the training run or using the `yml` file that comes with the model.

In [None]:
from azureml.core import Model

model = Model(ws, model_name_str)
download_path = model.download(model_name_str, exist_ok=True)

After all the files are downloaded, we can generate the run config for inference runs.

In [None]:
from azureml.core import Environment, RunConfiguration
from azureml.core.conda_dependencies import CondaDependencies

env_file = os.path.join(download_path, "conda_env_v_1_0_0.yml")
inference_env = Environment("oj-inference-env")
inference_env.python.conda_dependencies = CondaDependencies(
    conda_dependencies_file_path=env_file
)

[Optional] The enviroment can also be assessed from the training run using `get_environment()` API.

After we have the environment for the inference, we could build run config based on this environment.

In [None]:
run_config = RunConfiguration()
run_config.environment = inference_env

### Build and submit the inference pipeline

The inference pipeline will create two different format of outputs, 1) a tabular dataset that contains the prediction and 2) an `OutputFileDatasetConfig` that can be used for the sequential pipeline steps.

In [None]:
from azureml.data import OutputFileDatasetConfig

output_data = OutputFileDatasetConfig(name="prediction_result")

output_ds_name = "oj-output"

inference_step = PythonScriptStep(
    name="infer-results",
    source_directory="scripts",
    script_name="infer.py",
    arguments=[
        "--model_name",
        model_name_str,
        "--ouput_dataset_name",
        output_ds_name,
        "--test_dataset_name",
        test_dataset.name,
        "--target_column_name",
        target_column_name,
        "--output_path",
        output_data,
    ],
    compute_target=compute_target,
    allow_reuse=False,
    runconfig=run_config,
)

In [None]:
inference_pipeline = Pipeline(ws, [inference_step])
inference_run = experiment.submit(inference_pipeline)

In [None]:
inference_run.wait_for_completion(show_output=True)

### Get the predicted data

In [None]:
from azureml.core import Dataset

inference_ds = Dataset.get_by_name(ws, output_ds_name)
inference_df = inference_ds.to_pandas_dataframe()
inference_df.tail(5)

## Schedule Pipeline

This section is about how to schedule a pipeline for periodically predictions. For more info about pipeline schedule and pipeline endpoint, please follow this [notebook](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-setup-schedule-for-a-published-pipeline.ipynb).

In [None]:
inference_published_pipeline = inference_pipeline.publish(
    name="OJ Inference Test", description="OJ Inference Test"
)
print("Newly published pipeline id: {}".format(inference_published_pipeline.id))

If `test_dataset` is going to refresh every 4 weeks before Friday 16:00 and we want to predict every 4 weeks (forecast_horizon), we can schedule our pipeline to run every 4 weeks at 16:00 to get daily inference results. You can refresh your test dataset (a newer version will be created) periodically when new data is available (i.e. target column in test dataset would have values in the beginning as context data, and followed by NaNs to be predicted). The inference pipeline will pick up context to further improve the forecast accuracy.

In [None]:
# schedule

from azureml.pipeline.core.schedule import ScheduleRecurrence, Schedule

recurrence = ScheduleRecurrence(
    frequency="Week", interval=4, week_days=["Friday"], hours=[16], minutes=[0]
)

schedule = Schedule.create(
    workspace=ws,
    name="OJ_Inference_schedule",
    pipeline_id=inference_published_pipeline.id,
    experiment_name="Schedule-run-OJ",
    recurrence=recurrence,
    wait_for_provisioning=True,
    description="Schedule Run",
)

# You may want to make sure that the schedule is provisioned properly
# before making any further changes to the schedule

print("Created schedule with id: {}".format(schedule.id))

### [Optional] Disable schedule

In [None]:
schedule.disable()