Copyright (c) Microsoft Corporation. All rights reserved.  
Licensed under the MIT License.

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-with-automated-machine-learning-step.png)

# Azure Machine Learning Pipeline with AutoMLStep
This notebook demonstrates the use of AutoMLStep in Azure Machine Learning Pipeline.

## Introduction
In this example we use the scikit-learn's [digit dataset](http://scikit-learn.org/stable/datasets/index.html#optical-recognition-of-handwritten-digits-dataset) to showcase how you can use AutoML for a simple classification problem.

If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, make sure you have executed the [configuration](../../../configuration.ipynb) before running this notebook.

In this notebook you would see
1. Create an `Experiment` in an existing `Workspace`.
2. Create or Attach existing AmlCompute to a workspace.
3. Configure AutoML using `AutoMLConfig`.
4. Use AutoMLStep
5. Train the model using AmlCompute
6. Explore the results.
7. Test the best fitted model.

In addition this notebook showcases the following features
- **Parallel** executions for iterations
- **Asynchronous** tracking of progress
- Retrieving models for any iteration or logged metric
- Specifying AutoML settings as `**kwargs`

## Azure Machine Learning and Pipeline SDK-specific imports

In [None]:
import logging
import os
import csv

from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from sklearn import datasets

import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.train.automl import AutoMLConfig
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget
from azureml.core.runconfig import RunConfiguration
from azureml.core.conda_dependencies import CondaDependencies

from azureml.train.automl import AutoMLStep

# Check core SDK version number
print("SDK version:", azureml.core.VERSION)

## Initialize Workspace
Initialize a workspace object from persisted configuration. Make sure the config file is present at .\config.json

In [None]:
ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\n')

## Create an Azure ML experiment
Let's create an experiment named "automl-classification" and a folder to hold the training scripts. The script runs will be recorded under the experiment in Azure.


In [None]:
# Choose a name for the run history container in the workspace.
experiment_name = 'automlstep-classification'
project_folder = './project'

experiment = Experiment(ws, experiment_name)
experiment

### Create or Attach existing AmlCompute
You will need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for your AutoML run. In this tutorial, you get the default `AmlCompute` as your training compute resource.

In [None]:
# Choose a name for your cluster.
amlcompute_cluster_name = "cpu-cluster"

found = False
# Check if this compute target already exists in the workspace.
cts = ws.compute_targets
if amlcompute_cluster_name in cts and cts[amlcompute_cluster_name].type == 'AmlCompute':
    found = True
    print('Found existing compute target.')
    compute_target = cts[amlcompute_cluster_name]
    
if not found:
    print('Creating a new compute target...')
    provisioning_config = AmlCompute.provisioning_configuration(vm_size = "STANDARD_D2_V2", # for GPU, use "STANDARD_NC6"
                                                                #vm_priority = 'lowpriority', # optional
                                                                max_nodes = 4)

    # Create the cluster.
    compute_target = ComputeTarget.create(ws, amlcompute_cluster_name, provisioning_config)
    
    # Can poll for a minimum number of nodes and for a specific timeout.
    # If no min_node_count is provided, it will use the scale settings for the cluster.
    compute_target.wait_for_completion(show_output = True, min_node_count = 1, timeout_in_minutes = 10)
    
     # For a more detailed view of current AmlCompute status, use get_status().

## Prepare and Point to Data
For remote executions, you need to make the data accessible from the remote compute.
This can be done by uploading the data to DataStore.
In this example, we upload scikit-learn's [load_digits](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html) data.

In [None]:
data_train = datasets.load_digits()

if not os.path.isdir('data'):
    os.mkdir('data')
    
if not os.path.exists(project_folder):
    os.makedirs(project_folder)
    
pd.DataFrame(data_train.data).to_csv("data/X_train.tsv", index=False, header=False, quoting=csv.QUOTE_ALL, sep="\t")
pd.DataFrame(data_train.target).to_csv("data/y_train.tsv", index=False, header=False, sep="\t")

ds = ws.get_default_datastore()
ds.upload(src_dir='./data', target_path='bai_data', overwrite=True, show_progress=True)

from azureml.data.data_reference import DataReference      
input_data = DataReference(datastore=ds, 
                           data_reference_name="input_data_reference",
                           path_on_datastore='bai_data',
                           mode='download',
                           path_on_compute='/tmp/azureml_runs',
                           overwrite=False)

In [None]:
# create a new RunConfig object
conda_run_config = RunConfiguration(framework="python")

# Set compute target to AmlCompute
#conda_run_config.target = compute_target

conda_run_config.environment.docker.enabled = True
conda_run_config.environment.docker.base_image = azureml.core.runconfig.DEFAULT_CPU_IMAGE

cd = CondaDependencies.create(pip_packages=['azureml-sdk[automl]'], 
                              conda_packages=['numpy', 'py-xgboost'], 
                              pin_sdk_version=False)
conda_run_config.environment.python.conda_dependencies = cd

print('run config is ready')

In [None]:
%%writefile $project_folder/get_data.py

import pandas as pd

def get_data():
    X_train = pd.read_csv("/tmp/azureml_runs/bai_data/X_train.tsv", delimiter="\t", header=None, quotechar='"')
    y_train = pd.read_csv("/tmp/azureml_runs/bai_data/y_train.tsv", delimiter="\t", header=None, quotechar='"')

    return { "X" : X_train.values, "y" : y_train.values.flatten() }


## Set up AutoMLConfig for Training

You can specify `automl_settings` as `**kwargs` as well. Also note that you can use a `get_data()` function for local excutions too.

**Note:** When using AmlCompute, you can't pass Numpy arrays directly to the fit method.

|Property|Description|
|-|-|
|**primary_metric**|This is the metric that you want to optimize. Classification supports the following primary metrics: <br><i>accuracy</i><br><i>AUC_weighted</i><br><i>average_precision_score_weighted</i><br><i>norm_macro_recall</i><br><i>precision_score_weighted</i>|
|**iteration_timeout_minutes**|Time limit in minutes for each iteration.|
|**iterations**|Number of iterations. In each iteration AutoML trains a specific pipeline with the data.|
|**n_cross_validations**|Number of cross validation splits.|
|**max_concurrent_iterations**|Maximum number of iterations that would be executed in parallel. This should be less than the number of cores on the DSVM.|

In [None]:
automl_settings = {
    "iteration_timeout_minutes": 5,
    "iterations": 20,
    "n_cross_validations": 5,
    "primary_metric": 'AUC_weighted',
    "preprocess": False,
    "max_concurrent_iterations": 3,
    "verbosity": logging.INFO
}
automl_config = AutoMLConfig(task = 'classification',
                             debug_log = 'automl_errors.log',
                             path = project_folder,
                             compute_target=compute_target,
                             run_configuration=conda_run_config,
                             data_script = project_folder + "/get_data.py",
                             **automl_settings
                            )

Call the `submit` method on the experiment object and pass the run configuration. For remote runs the execution is asynchronous, so you will see the iterations get populated as they complete. You can interact with the widgets and models even when the experiment is running to retrieve the best model up to that point. Once you are satisfied with the model, you can cancel a particular iteration or the whole run.
In this example, we specify `show_output = False` to suppress console output while the run is in progress.

## Define AutoMLStep

In [None]:
from azureml.pipeline.core import PipelineData, TrainingOutput

metrics_output_name = 'metrics_output'
best_model_output_name = 'best_model_output'

metirics_data = PipelineData(name='metrics_data',
                           datastore=ds,
                           pipeline_output_name=metrics_output_name,
                           training_output=TrainingOutput(type='Metrics'))
model_data = PipelineData(name='model_data',
                           datastore=ds,
                           pipeline_output_name=best_model_output_name,
                           training_output=TrainingOutput(type='Model'))

In [None]:
automl_step = AutoMLStep(
    name='automl_module',
    experiment=experiment,
    automl_config=automl_config,
    inputs=[input_data],
    outputs=[metirics_data, model_data],
    allow_reuse=True)

In [None]:
from azureml.pipeline.core import Pipeline
pipeline = Pipeline(
    description="pipeline_with_automlstep",
    workspace=ws,    
    steps=[automl_step])

In [None]:
pipeline_run = experiment.submit(pipeline)

In [None]:
from azureml.widgets import RunDetails
RunDetails(pipeline_run).show()

In [None]:
pipeline_run.wait_for_completion()

## Examine Results

### Retrieve the metrics of all child runs
Outputs of above run can be used as inputs of other steps in pipeline. In this tutorial, we will examine the outputs by retrieve output data and running some tests.

In [None]:
metrics_output = pipeline_run.get_pipeline_output(metrics_output_name)
num_file_downloaded = metrics_output.download('.', show_progress=True)

In [None]:
import json
with open(metrics_output._path_on_datastore) as f:  
    metrics_output_result = f.read()
    
deserialized_metrics_output = json.loads(metrics_output_result)
df = pd.DataFrame(deserialized_metrics_output)
df

### Retrieve the Best Model

In [None]:
best_model_output = pipeline_run.get_pipeline_output(best_model_output_name)
num_file_downloaded = best_model_output.download('.', show_progress=True)

In [None]:
 import pickle

 with open(best_model_output._path_on_datastore, "rb" ) as f:
     best_model = pickle.load(f)
 best_model

### Test the Model
#### Load Test Data

In [None]:
digits = datasets.load_digits()
X_test = digits.data[:10, :]
y_test = digits.target[:10]
images = digits.images[:10]

#### Testing Best Model

In [None]:
# Randomly select digits and test.
for index in np.random.choice(len(y_test), 3, replace = False):
   print(index)
   predicted = best_model.predict(X_test[index:index + 1])[0]
   label = y_test[index]
   title = "Label value = %d  Predicted value = %d " % (label, predicted)
   fig = plt.figure(1, figsize=(3,3))
   ax1 = fig.add_axes((0,0,.8,.8))
   ax1.set_title(title)
   plt.imshow(images[index], cmap = plt.cm.gray_r, interpolation = 'nearest')
   plt.show()