Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/forecasting-grouping/auto-ml-forecasting-grouping.png)

# Automated Machine Learning

_**Forecasting with grouping using Pipelines**_

## Contents

1. [Introduction](#Introduction)
2. [Setup](#Setup)
3. [Data](#Data)
4. [Compute](#Compute)
4. [AutoMLConfig](#AutoMLConfig)
5. [Pipeline](#Pipeline)
5. [Train](#Train)
6. [Test](#Test)


## Introduction
In this example we use Automated ML and Pipelines to train, select, and operationalize forecasting models for multiple time-series.

If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, go through the [configuration notebook](../../../configuration.ipynb) first if you haven't already to establish your connection to the AzureML Workspace.

In this notebook you will learn how to:

* Create an Experiment in an existing Workspace.
* Configure AutoML using AutoMLConfig.
* Use our helper script to generate pipeline steps to split, train, and deploy the models.
* Explore the results.
* Test the models.

It is advised you ensure your cluster has at least one node per group.

An Enterprise workspace is required for this notebook. To learn more about creating an Enterprise workspace or upgrading to an Enterprise workspace from the Azure portal, please visit our [Workspace page.](https://docs.microsoft.com/azure/machine-learning/service/concept-workspace#upgrade)

## Setup
As part of the setup you have already created an Azure ML `Workspace` object. For Automated ML you will need to create an `Experiment` object, which is a named object in a `Workspace` used to run experiments. 

In [None]:
import json
import logging
import warnings

import numpy as np
import pandas as pd

import azureml.core

from azureml.core.workspace import Workspace
from azureml.core.experiment import Experiment
from azureml.train.automl import AutoMLConfig

Accessing the Azure ML workspace requires authentication with Azure.

The default authentication is interactive authentication using the default tenant. Executing the ws = Workspace.from_config() line in the cell below will prompt for authentication the first time that it is run.

If you have multiple Azure tenants, you can specify the tenant by replacing the ws = Workspace.from_config() line in the cell below with the following:
```
from azureml.core.authentication import InteractiveLoginAuthentication
auth = InteractiveLoginAuthentication(tenant_id = 'mytenantid')
ws = Workspace.from_config(auth = auth)
```
If you need to run in an environment where interactive login is not possible, you can use Service Principal authentication by replacing the ws = Workspace.from_config() line in the cell below with the following:
```
from azureml.core.authentication import ServicePrincipalAuthentication
auth = auth = ServicePrincipalAuthentication('mytenantid', 'myappid', 'mypassword')
ws = Workspace.from_config(auth = auth)
```
For more details, see aka.ms/aml-notebook-auth

In [None]:
ws = Workspace.from_config()
ds = ws.get_default_datastore()

# choose a name for the run history container in the workspace
experiment_name = 'automl-grouping-oj'
# project folder
project_folder = './sample_projects/{}'.format(experiment_name)

experiment = Experiment(ws, experiment_name)

output = {}
output['SDK version'] = azureml.core.VERSION
output['Subscription ID'] = ws.subscription_id
output['Workspace'] = ws.name
output['Resource Group'] = ws.resource_group
output['Location'] = ws.location
output['Project Directory'] = project_folder
output['Run History Name'] = experiment_name
pd.set_option('display.max_colwidth', -1)
outputDf = pd.DataFrame(data = output, index = [''])
outputDf.T

## Data
Upload data to your default datastore and then load it as a `TabularDataset`

In [None]:
from azureml.core.dataset import Dataset

In [None]:
# upload data to your default datastore
ds = ws.get_default_datastore()
ds.upload(src_dir='./data', target_path='groupdata', overwrite=True, show_progress=True)

In [None]:
# load data from your datastore
data = Dataset.Tabular.from_delimited_files(path=ds.path('groupdata/dominicks_OJ_2_5_8_train.csv'))
data_test = Dataset.Tabular.from_delimited_files(path=ds.path('groupdata/dominicks_OJ_2_5_8_test.csv'))

data.take(5).to_pandas_dataframe()

## Compute 

#### Create or Attach existing AmlCompute

You will need to create a compute target for your AutoML run. In this tutorial, you create AmlCompute as your training compute resource.
#### Creation of AmlCompute takes approximately 5 minutes. 
If the AmlCompute with that name is already in your workspace this code will skip the creation process.
As with other Azure services, there are limits on certain resources (e.g. AmlCompute) associated with the Azure Machine Learning service. Please read this article on the default limits and how to request more quota.

In [None]:
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget

# Choose a name for your cluster.
amlcompute_cluster_name = "cpu-cluster-11"

found = False
# Check if this compute target already exists in the workspace.
cts = ws.compute_targets
if amlcompute_cluster_name in cts and cts[amlcompute_cluster_name].type == 'AmlCompute':
    found = True
    print('Found existing compute target.')
    compute_target = cts[amlcompute_cluster_name]
    
if not found:
    print('Creating a new compute target...')
    provisioning_config = AmlCompute.provisioning_configuration(vm_size = "STANDARD_D2_V2", # for GPU, use "STANDARD_NC6"
                                                                #vm_priority = 'lowpriority', # optional
                                                                max_nodes = 6)

    # Create the cluster.
    compute_target = ComputeTarget.create(ws, amlcompute_cluster_name, provisioning_config)
    
print('Checking cluster status...')
# Can poll for a minimum number of nodes and for a specific timeout.
# If no min_node_count is provided, it will use the scale settings for the cluster.
compute_target.wait_for_completion(show_output = True, min_node_count = None, timeout_in_minutes = 20)
    
# For a more detailed view of current AmlCompute status, use get_status().

## AutoMLConfig
#### Create a base AutoMLConfig
This configuration will be used for all the groups in the pipeline.

In [None]:
target_column = 'Quantity'
time_column_name = 'WeekStarting'
grain_column_names = ['Brand']
group_column_names = ['Store']
max_horizon = 20

In [None]:
automl_settings = {
    "iteration_timeout_minutes" : 5,
    "experiment_timeout_minutes" : 15,
    "primary_metric" : 'normalized_mean_absolute_error',
    "time_column_name": time_column_name,
    "grain_column_names": grain_column_names,
    "max_horizon": max_horizon,
    "drop_column_names": ['logQuantity'],
    "max_concurrent_iterations": 2,
    "max_cores_per_iteration": -1
}
base_configuration = AutoMLConfig(task = 'forecasting',
                             path = project_folder,
                             n_cross_validations=3,
                             **automl_settings
                            )

## Pipeline
We've written a script to generate the individual pipeline steps used to create each automl step. Calling this script will return a list of PipelineSteps that will train multiple groups concurrently and then deploy these models.

This step requires an Enterprise workspace to gain access to this feature. To learn more about creating an Enterprise workspace or upgrading to an Enterprise workspace from the Azure portal, please visit our [Workspace page.](https://docs.microsoft.com/azure/machine-learning/service/concept-workspace#upgrade).

### Call the method to build pipeline steps

`build_pipeline_steps()` takes as input:
* **automlconfig**: This is the configuration used for every automl step
* **df**: This is the dataset to be used for training
* **target_column**: This is the target column of the dataset
* **compute_target**: The compute to be used for training
* **deploy**: The option on to deploy the models after training, if set to true an extra step will be added to deploy a webservice with all the models (default is `True`)
* **service_name**: The service name for the model query endpoint
* **time_column_name**: The time column of the data

In [None]:
from azureml.core.webservice import Webservice
from azureml.exceptions import WebserviceException

service_name = 'grouped-model'
try:
    # if you want to get existing service below is the command
    # since aci name needs to be unique in subscription deleting existing aci if any
    # we use aci_service_name to create azure aci
    service = Webservice(ws, name=service_name)
    if service:
        service.delete()
except WebserviceException as e:
    pass

In [None]:
from build import build_pipeline_steps

steps = build_pipeline_steps(
    base_configuration, 
    data, 
    target_column,
    compute_target, 
    group_column_names=group_column_names, 
    deploy=True, 
    service_name=service_name, 
    time_column_name=time_column_name
)

## Train
Use the list of steps generated from above to build the pipeline and submit it to your compute for remote training.

In [None]:
from azureml.pipeline.core import Pipeline
pipeline = Pipeline(
    description="A pipeline with one model per data group using Automated ML.",
    workspace=ws,    
    steps=steps)

pipeline_run = experiment.submit(pipeline)

In [None]:
from azureml.widgets import RunDetails
RunDetails(pipeline_run).show()

In [None]:
pipeline_run.wait_for_completion(show_output=False)

## Test

Now we can use the holdout set to test our models and ensure our web-service is running as expected.

In [None]:
from azureml.core.webservice import AciWebservice
service = AciWebservice(ws, service_name)

In [None]:
X_test = data_test.to_pandas_dataframe()
# Drop the column we are trying to predict (target column)
x_pred = X_test.drop(target_column, inplace=False, axis=1)
x_pred.head()

In [None]:
# Get Predictions
test_sample = X_test.drop(target_column, inplace=False, axis=1).to_json()
predictions = service.run(input_data=test_sample)
print(predictions)

In [None]:
# Convert predictions from JSON to DataFrame
pred_dict =json.loads(predictions)
X_pred = pd.read_json(pred_dict['predictions'])
X_pred.head()

In [None]:
# Fix the index
PRED = 'pred_target'
X_pred[time_column_name] = pd.to_datetime(X_pred[time_column_name], unit='ms')

X_pred.set_index([time_column_name] + grain_column_names, inplace=True, drop=True)
X_pred.rename({'_automl_target_col': PRED}, inplace=True, axis=1)
# Drop all but the target column and index
X_pred.drop(list(set(X_pred.columns.values).difference({PRED})), axis=1, inplace=True)

In [None]:
X_test[time_column_name] = pd.to_datetime(X_test[time_column_name])
X_test.set_index([time_column_name] + grain_column_names, inplace=True, drop=True)
# Merge predictions with raw features
pred_test = X_test.merge(X_pred, left_index=True, right_index=True)
pred_test.head()

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error
def MAPE(actual, pred):
    """
    Calculate mean absolute percentage error.
    Remove NA and values where actual is close to zero
    """
    not_na = ~(np.isnan(actual) | np.isnan(pred))
    not_zero = ~np.isclose(actual, 0.0)
    actual_safe = actual[not_na & not_zero]
    pred_safe = pred[not_na & not_zero]
    APE = 100*np.abs((actual_safe - pred_safe)/actual_safe)
    return np.mean(APE)

def get_metrics(actuals, preds):
    return pd.Series(
    {
        "RMSE": np.sqrt(mean_squared_error(actuals, preds)),
        "NormRMSE": np.sqrt(mean_squared_error(actuals, preds))/np.abs(actuals.max()-actuals.min()),
        "MAE": mean_absolute_error(actuals, preds),
        "MAPE": MAPE(actuals, preds)},
    )

In [None]:
get_metrics(pred_test[PRED].values, pred_test[target_column].values)