Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/classification-bank-marketing/auto-ml-classification-bank-marketing.png)

# Automated Machine Learning
_**Regression on remote compute using Computer Hardware dataset with model explanations**_

## Contents
1. [Introduction](#Introduction)
1. [Setup](#Setup)
1. [Train](#Train)
1. [Results](#Results)
1. [Explanations](#Explanations)

## Introduction

In this example we use the Hardware Performance Dataset to showcase how you can use AutoML for a simple regression problem. After training AutoML models for this regression data set, we show how you can compute model explanations on your remote compute using a sample explainer script.

If you are using an Azure Machine Learning Notebook VM, you are all set.  Otherwise, go through the [configuration](../../../configuration.ipynb)  notebook first if you haven't already to establish your connection to the AzureML Workspace. 

In this notebook you will learn how to:
1. Create an `Experiment` in an existing `Workspace`.
2. Configure AutoML using `AutoMLConfig`.
3. Train the model using remote compute.
4. Explore the results.
5. Setup remote compute for computing the model explanations for a given AutoML model.
6. Start an AzureML experiment on your remote compute to compute explanations for an AutoML model.
7. Download the feature importance for engineered features and visualize the explanations for engineered features. 
8. Download the feature importance for raw features and visualize the explanations for raw features. 


## Setup

As part of the setup you have already created an Azure ML `Workspace` object. For AutoML you will need to create an `Experiment` object, which is a named object in a `Workspace` used to run experiments.

In [None]:
import logging

from matplotlib import pyplot as plt
import pandas as pd
import os

import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.core.dataset import Dataset
from azureml.train.automl import AutoMLConfig

In [None]:
ws = Workspace.from_config()

# choose a name for experiment
experiment_name = 'automl-regression-computer-hardware'

experiment=Experiment(ws, experiment_name)

output = {}
output['SDK version'] = azureml.core.VERSION
output['Subscription ID'] = ws.subscription_id
output['Workspace'] = ws.name
output['Resource Group'] = ws.resource_group
output['Location'] = ws.location
output['Experiment Name'] = experiment.name
pd.set_option('display.max_colwidth', -1)
outputDf = pd.DataFrame(data = output, index = [''])
outputDf.T

### Create or Attach existing AmlCompute
You will need to create a compute target for your AutoML run. In this tutorial, you create AmlCompute as your training compute resource.
#### Creation of AmlCompute takes approximately 5 minutes. 
If the AmlCompute with that name is already in your workspace this code will skip the creation process.
As with other Azure services, there are limits on certain resources (e.g. AmlCompute) associated with the Azure Machine Learning service. Please read this article on the default limits and how to request more quota.

In [None]:
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget

# Choose a name for your cluster.
amlcompute_cluster_name = "automlcl"

found = False
# Check if this compute target already exists in the workspace.
cts = ws.compute_targets
if amlcompute_cluster_name in cts and cts[amlcompute_cluster_name].type == 'AmlCompute':
    found = True
    print('Found existing compute target.')
    compute_target = cts[amlcompute_cluster_name]
    
if not found:
    print('Creating a new compute target...')
    provisioning_config = AmlCompute.provisioning_configuration(vm_size = "STANDARD_D2_V2", # for GPU, use "STANDARD_NC6"
                                                                #vm_priority = 'lowpriority', # optional
                                                                max_nodes = 6)

    # Create the cluster.
    compute_target = ComputeTarget.create(ws, amlcompute_cluster_name, provisioning_config)
    
print('Checking cluster status...')
# Can poll for a minimum number of nodes and for a specific timeout.
# If no min_node_count is provided, it will use the scale settings for the cluster.
compute_target.wait_for_completion(show_output = True, min_node_count = None, timeout_in_minutes = 20)
    
# For a more detailed view of current AmlCompute status, use get_status().

### Conda Dependecies for AutoML training experiment

Create the conda dependencies for running AutoML experiment on remote compute.

In [None]:
from azureml.core.runconfig import RunConfiguration
from azureml.core.conda_dependencies import CondaDependencies
import pkg_resources

# create a new RunConfig object
conda_run_config = RunConfiguration(framework="python")

# Set compute target to AmlCompute
conda_run_config.target = compute_target
conda_run_config.environment.docker.enabled = True

cd = CondaDependencies.create(conda_packages=['numpy','py-xgboost<=0.80'])
conda_run_config.environment.python.conda_dependencies = cd

### Setup Training and Test Data for AutoML experiment

Here we create the train and test datasets for hardware performance dataset. We also register the datasets in your workspace using a name so that these datasets may be accessed from the remote compute.

In [None]:
# Data source
data = "https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/machineData.csv"

# Create dataset from the url
dataset = Dataset.Tabular.from_delimited_files(data)

# Split the dataset into train and test datasets
train_dataset, test_dataset = dataset.random_split(percentage=0.8, seed=223)

# Register the train dataset with your workspace
train_dataset.register(workspace = ws, name = 'hardware_performance_train_dataset',
                       description = 'hardware performance training data',
                       create_new_version=True)

# Register the test dataset with your workspace
test_dataset.register(workspace = ws, name = 'hardware_performance_test_dataset',
                      description = 'hardware performance test data',
                      create_new_version=True)

# Drop the labeled column from the train dataset
X_train = train_dataset.drop_columns(columns=['ERP'])
y_train = train_dataset.keep_columns(columns=['ERP'], validate=True)

# Drop the labeled column from the test dataset
X_test = test_dataset.drop_columns(columns=['ERP'])  

# Display the top rows in the train dataset
X_train.take(5).to_pandas_dataframe()

## Train

Instantiate an `AutoMLConfig` object to specify the settings and data used to run the experiment.

|Property|Description|
|-|-|
|**task**|classification or regression|
|**primary_metric**|This is the metric that you want to optimize. Regression supports the following primary metrics: <br><i>spearman_correlation</i><br><i>normalized_root_mean_squared_error</i><br><i>r2_score</i><br><i>normalized_mean_absolute_error</i>|
|**iteration_timeout_minutes**|Time limit in minutes for each iteration.|
|**iterations**|Number of iterations. In each iteration AutoML trains a specific pipeline with the data.|
|**n_cross_validations**|Number of cross validation splits.|
|**X**|(sparse) array-like, shape = [n_samples, n_features]|
|**y**|(sparse) array-like, shape = [n_samples, ], targets values.|

**_You can find more information about primary metrics_** [here](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-configure-auto-train#primary-metric)

In [None]:
automl_settings = {
    "iteration_timeout_minutes": 5,
    "iterations": 10,
    "n_cross_validations": 2,
    "primary_metric": 'spearman_correlation',
    "preprocess": True,
    "max_concurrent_iterations": 1,
    "verbosity": logging.INFO,
}

automl_config = AutoMLConfig(task = 'regression',
                             debug_log = 'automl_errors_model_exp.log',
                             run_configuration=conda_run_config,
                             X = X_train,
                             y = y_train,
                             **automl_settings
                            )

Call the `submit` method on the experiment object and pass the run configuration. Execution of local runs is synchronous. Depending on the data and the number of iterations this can run for a while.
In this example, we specify `show_output = True` to print currently running iterations to the console.

In [None]:
remote_run = experiment.submit(automl_config, show_output = True)

In [None]:
remote_run

## Results

#### Widget for Monitoring Runs

The widget will first report a "loading" status while running the first iteration. After completing the first iteration, an auto-updating graph and table will be shown. The widget will refresh once per minute, so you should see the graph update as child runs complete.

**Note:** The widget displays a link at the bottom. Use this link to open a web interface to explore the individual run details

In [None]:
from azureml.widgets import RunDetails
RunDetails(remote_run).show() 

## Explanations
This section will walk you through the workflow to compute model explanations for an AutoML model on your remote compute.

### Retrieve any AutoML Model for explanations

Below we select the some AutoML pipeline from our iterations. The `get_output` method returns the a AutoML run and the fitted model for the last invocation. Overloads on `get_output` allow you to retrieve the best run and fitted model for *any* logged metric or for a particular *iteration*.

In [None]:
automl_run, fitted_model = remote_run.get_output(iteration=5)

### Setup model explanation run on the remote compute
The following section provides details on how to setup an AzureML experiment to run model explanations for an AutoML model on your remote compute.

#### Sample script used for computing explanations
View the sample script for computing the model explanations for your AutoML model on remote compute.

In [None]:
with open('train_explainer.py', 'r') as cefr:
    print(cefr.read())

#### Substitute values in your sample script
The following cell shows how you change the values in the sample script so that you can change the sample script according to your experiment and dataset.

In [None]:
import shutil

# create script folder
script_folder = './sample_projects/automl-regression-computer-hardware'
if not os.path.exists(script_folder):
    os.makedirs(script_folder)

# Copy the sample script to script folder.
shutil.copy('train_explainer.py', script_folder)

# Create the explainer script that will run on the remote compute.
script_file_name = script_folder + '/train_explainer.py'

# Open the sample script for modification
with open(script_file_name, 'r') as cefr:
    content = cefr.read()

# Replace the values in train_explainer.py file with the appropriate values
content = content.replace('<<experimnet_name>>', automl_run.experiment.name) # your experiment name.
content = content.replace('<<run_id>>', automl_run.id) # Run-id of the AutoML run for which you want to explain the model.
content = content.replace('<<target_column_name>>', 'ERP') # Your target column name
content = content.replace('<<task>>', 'regression') # Training task type
# Name of your training dataset register with your workspace
content = content.replace('<<train_dataset_name>>', 'hardware_performance_train_dataset') 
# Name of your test dataset register with your workspace
content = content.replace('<<test_dataset_name>>', 'hardware_performance_test_dataset')

# Write sample file into your script folder.
with open(script_file_name, 'w') as cefw:
    cefw.write(content)

#### Create conda configuration for model explanations experiment
We need `azureml-explain-model`, `azureml-train-automl` and `azureml-core` packages for computing model explanations for your AutoML model on remote compute.

In [None]:
from azureml.core.runconfig import RunConfiguration
from azureml.core.conda_dependencies import CondaDependencies
import pkg_resources

# create a new RunConfig object
conda_run_config = RunConfiguration(framework="python")

# Set compute target to AmlCompute
conda_run_config.target = compute_target
conda_run_config.environment.docker.enabled = True
azureml_pip_packages = [
    'azureml-train-automl', 'azureml-core', 'azureml-explain-model'
]

# specify CondaDependencies obj
conda_run_config.environment.python.conda_dependencies = CondaDependencies.create(
    conda_packages=['scikit-learn', 'numpy','py-xgboost<=0.80'],
    pip_packages=azureml_pip_packages)

#### Submit the experiment for model explanations
Submit the experiment with the above `run_config` and the sample script for computing explanations.

In [None]:
# Now submit a run on AmlCompute for model explanations
from azureml.core.script_run_config import ScriptRunConfig

script_run_config = ScriptRunConfig(source_directory=script_folder,
                                    script='train_explainer.py',
                                    run_config=conda_run_config)

run = experiment.submit(script_run_config)

# Show run details
run

In [None]:
%%time
# Shows output of the run on stdout.
run.wait_for_completion(show_output=True)

### Feature importance  and  explanation dashboard
In this section we describe how you can download the explanation results from the explanations experiment and visualize the feature importance for your AutoML model. 

#### Setup for visualizing the model explanation results
For visualizing the explanation results for the *fitted_model* we need to perform the following steps:-
1. Featurize test data samples.

The *automl_explainer_setup_obj* contains all the structures from above list. 

In [None]:
from azureml.train.automl.automl_explain_utilities import AutoMLExplainerSetupClass, automl_setup_model_explanations
explainer_setup_class = automl_setup_model_explanations(fitted_model, 'regression', X_test=X_test)

#### Download engineered feature importance from artifact store
You can use *ExplanationClient* to download the engineered feature explanations from the artifact store of the *automl_run*. You can also use ExplanationDashboard to view the dash board visualization of the feature importance values of the engineered features.

In [None]:
from azureml.explain.model._internal.explanation_client import ExplanationClient
from azureml.contrib.explain.model.visualize import ExplanationDashboard
client = ExplanationClient.from_run(automl_run)
engineered_explanations = client.download_model_explanation(raw=False)
print(engineered_explanations.get_feature_importance_dict())
ExplanationDashboard(engineered_explanations, explainer_setup_class.automl_estimator, explainer_setup_class.X_test_transform)

#### Download raw feature importance from artifact store
You can use *ExplanationClient* to download the raw feature explanations from the artifact store of the *automl_run*. You can also use ExplanationDashboard to view the dash board visualization of the feature importance values of the raw features.

In [None]:
raw_explanations = client.download_model_explanation(raw=True)
print(raw_explanations.get_feature_importance_dict())
ExplanationDashboard(raw_explanations, explainer_setup_class.automl_pipeline, explainer_setup_class.X_test_raw)