Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# Auto ML 04: Remote Execution with Text Data from Azure Blob Storage

In this example we use the [Burning Man 2016 dataset](https://innovate.burningman.org/datasets-page/) to showcase how you can use AutoML to handle text data from Azure Blob Storage.

Make sure you have executed the [00.configuration](00.configuration.ipynb) before running this notebook.

In this notebook you will learn how to:
1. Create an `Experiment` in an existing `Workspace`.
2. Attach an existing DSVM to a workspace.
3. Configure AutoML using `AutoMLConfig`.
4. Train the model using the DSVM.
5. Explore the results.
6. Test the best fitted model.

In addition this notebook showcases the following features
- **Parallel** executions for iterations
- **Asynchronous** tracking of progress
- **Cancellation** of individual iterations or the entire run
- Retrieving models for any iteration or logged metric
- Specifying AutoML settings as `**kwargs`
- Handling **text** data using the `preprocess` flag


## Create an Experiment

As part of the setup you have already created an Azure ML `Workspace` object. For AutoML you will need to create an `Experiment` object, which is a named object in a `Workspace` used to run experiments.

In [None]:
import logging
import os
import random

from matplotlib import pyplot as plt
from matplotlib.pyplot import imshow
import numpy as np
import pandas as pd
from sklearn import datasets

import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.train.automl import AutoMLConfig
from azureml.train.automl.run import AutoMLRun

In [None]:
ws = Workspace.from_config()

# Choose a name for the run history container in the workspace.
experiment_name = 'automl-remote-dsvm-blobstore'
project_folder = './sample_projects/automl-remote-dsvm-blobstore'

experiment = Experiment(ws, experiment_name)

output = {}
output['SDK version'] = azureml.core.VERSION
output['Subscription ID'] = ws.subscription_id
output['Workspace'] = ws.name
output['Resource Group'] = ws.resource_group
output['Location'] = ws.location
output['Project Directory'] = project_folder
output['Experiment Name'] = experiment.name
pd.set_option('display.max_colwidth', -1)
pd.DataFrame(data=output, index=['']).T

## Diagnostics

Opt-in diagnostics for better experience, quality, and security of future releases.

In [None]:
from azureml.telemetry import set_diagnostics_collection
set_diagnostics_collection(send_diagnostics = True)

## Attach a Remote Linux DSVM
To use a remote Docker compute target:
1. Create a Linux DSVM in Azure, following these [quick instructions](https://docs.microsoft.com/en-us/azure/machine-learning/desktop-workbench/how-to-create-dsvm-hdi). Make sure you use the Ubuntu flavor (not CentOS). Make sure that disk space is available under `/tmp` because AutoML creates files under `/tmp/azureml_run`s. The DSVM should have more cores than the number of parallel runs that you plan to enable. It should also have at least 4GB per core.
2. Enter the IP address, user name and password below.

**Note:** By default, SSH runs on port 22 and you don't need to change the port number below. If you've configured SSH to use a different port, change `dsvm_ssh_port` accordinglyaddress. [Read more](https://render.githubusercontent.com/documentation/sdk/ssh-issue.md) on changing SSH ports for security reasons.

In [None]:
from azureml.core.compute import RemoteCompute
import time

# Add your VM information below
# If a compute with the specified compute_name already exists, it will be used and the dsvm_ip_addr, dsvm_ssh_port, 
# dsvm_username and dsvm_password will be ignored.
compute_name  = 'mydsvmb'
dsvm_ip_addr  = '<<ip_addr>>'
dsvm_ssh_port = 22
dsvm_username = '<<username>>'
dsvm_password = '<<password>>'

if compute_name in ws.compute_targets:
    print('Using existing compute.')
    dsvm_compute = ws.compute_targets[compute_name]
else:
    RemoteCompute.attach(workspace=ws, name=compute_name, address=dsvm_ip_addr, username=dsvm_username, password=dsvm_password, ssh_port=dsvm_ssh_port)

    while ws.compute_targets[compute_name].provisioning_state == 'Creating':
        time.sleep(1)

    dsvm_compute = ws.compute_targets[compute_name]
    
    if dsvm_compute.provisioning_state == 'Failed':
        print('Attached failed.')
        print(dsvm_compute.provisioning_errors)
        dsvm_compute.delete()

## Create Get Data File
For remote executions you should author a `get_data.py` file containing a `get_data()` function. This file should be in the root directory of the project. You can encapsulate code to read data either from a blob storage or local disk in this file.
In this example, the `get_data()` function returns a [dictionary](README.md#getdata).

In [None]:
if not os.path.exists(project_folder):
    os.makedirs(project_folder)

In [None]:
%%writefile $project_folder/get_data.py

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

def get_data():
    # Load Burning Man 2016 data.
    df = pd.read_csv("https://automldemods.blob.core.windows.net/datasets/PlayaEvents2016,_1.6MB,_3.4k-rows.cleaned.2.tsv",
                     delimiter="\t", quotechar='"')
    # Get integer labels.
    le = LabelEncoder()
    le.fit(df["Label"].values)
    y = le.transform(df["Label"].values)
    X = df.drop(["Label"], axis=1)

    X_train, _, y_train, _ = train_test_split(X, y, test_size = 0.1, random_state = 42)

    return { "X" : X_train, "y" : y_train }

### View data

You can execute the `get_data()` function locally to view the training data.

In [None]:
%run  $project_folder/get_data.py
data_dict = get_data()
df = data_dict["X"]
y = data_dict["y"]
pd.set_option('display.max_colwidth', 15)
df['Label'] = pd.Series(y, index=df.index)
df.head()

## Configure AutoML <a class="anchor" id="Instatiate-AutoML-Remote-DSVM"></a>

You can specify `automl_settings` as `**kwargs` as well. Also note that you can use a `get_data()` function for local excutions too.

**Note:** When using Remote DSVM, you can't pass Numpy arrays directly to the fit method.

|Property|Description|
|-|-|
|**primary_metric**|This is the metric that you want to optimize. Classification supports the following primary metrics: <br><i>accuracy</i><br><i>AUC_weighted</i><br><i>balanced_accuracy</i><br><i>average_precision_score_weighted</i><br><i>precision_score_weighted</i>|
|**iteration_timeout_minutes**|Time limit in minutes for each iteration.|
|**iterations**|Number of iterations. In each iteration AutoML trains a specific pipeline with the data.|
|**n_cross_validations**|Number of cross validation splits.|
|**max_concurrent_iterations**|Maximum number of iterations that would be executed in parallel. This should be less than the number of cores on the DSVM.|
|**preprocess**|Setting this to *True* enables AutoML to perform preprocessing on the input to handle *missing data*, and to perform some common *feature extraction*.|
|**max_cores_per_iteration**|Indicates how many cores on the compute target would be used to train a single pipeline.<br>Default is *1*; you can set it to *-1* to use all cores.|

In [None]:
automl_settings = {
    "iteration_timeout_minutes": 60,
    "iterations": 4,
    "n_cross_validations": 5,
    "primary_metric": 'AUC_weighted',
    "preprocess": True,
    "max_cores_per_iteration": 2
}

automl_config = AutoMLConfig(task = 'classification',
                             path = project_folder,
                             compute_target = dsvm_compute,
                             data_script = project_folder + "/get_data.py",
                             **automl_settings
                            )


## Train the Models <a class="anchor" id="Training-the-model-Remote-DSVM"></a>

Call the `submit` method on the experiment object and pass the run configuration. For remote runs the execution is asynchronous, so you will see the iterations get populated as they complete. You can interact with the widgets and models even when the experiment is running to retrieve the best model up to that point. Once you are satisfied with the model, you can cancel a particular iteration or the whole run.

In [None]:
remote_run = experiment.submit(automl_config)

## Exploring the Results <a class="anchor" id="Exploring-the-Results-Remote-DSVM"></a>
#### Widget for Monitoring Runs

The widget will first report a "loading" status while running the first iteration. After completing the first iteration, an auto-updating graph and table will be shown. The widget will refresh once per minute, so you should see the graph update as child runs complete.

You can click on a pipeline to see run properties and output logs.  Logs are also available on the DSVM under `/tmp/azureml_run/{iterationid}/azureml-logs`

**Note:** The widget displays a link at the bottom. Use this link to open a web interface to explore the individual run details.

In [None]:
from azureml.widgets import RunDetails
RunDetails(remote_run).show() 

In [None]:
# Wait until the run finishes.
remote_run.wait_for_completion(show_output = True)


#### Retrieve All Child Runs
You can also use SDK methods to fetch all the child runs and see individual metrics that we log. 

In [None]:
children = list(remote_run.get_children())
metricslist = {}
for run in children:
    properties = run.get_properties()
    metrics = {k: v for k, v in run.get_metrics().items() if isinstance(v, float)}
    metricslist[int(properties['iteration'])] = metrics

rundata = pd.DataFrame(metricslist).sort_index(1)
rundata

## Cancelling Runs
You can cancel ongoing remote runs using the `cancel` and `cancel_iteration` functions.

In [None]:
# Cancel the ongoing experiment and stop scheduling new iterations.
remote_run.cancel()

# Cancel iteration 1 and move onto iteration 2.
# remote_run.cancel_iteration(1)

### Retrieve the Best Model

Below we select the best pipeline from our iterations. The `get_output` method returns the best run and the fitted model. Overloads on `get_output` allow you to retrieve the best run and fitted model for *any* logged metric or for a particular *iteration*.

In [None]:
best_run, fitted_model = remote_run.get_output()
print(best_run)
print(fitted_model)

#### Best Model Based on Any Other Metric
Show the run and the model which has the smallest `accuracy` value:

In [None]:
# lookup_metric = "accuracy"
# best_run, fitted_model = remote_run.get_output(metric = lookup_metric)

#### Model from a Specific Iteration

In [None]:
iteration = 0
zero_run, zero_model = remote_run.get_output(iteration = iteration)

### Testing the Fitted Model <a class="anchor" id="Testing-the-Fitted-Model-Remote-DSVM"></a>


In [None]:
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from pandas_ml import ConfusionMatrix

df = pd.read_csv("https://automldemods.blob.core.windows.net/datasets/PlayaEvents2016,_1.6MB,_3.4k-rows.cleaned.2.tsv",
                     delimiter="\t", quotechar='"')

# get integer labels
le = LabelEncoder()
le.fit(df["Label"].values)
y = le.transform(df["Label"].values)
X = df.drop(["Label"], axis=1)

_, X_test, _, y_test = train_test_split(X, y, test_size=0.1, random_state=42)


ypred = fitted_model.predict(X_test.values)


ypred_strings = le.inverse_transform(ypred)
ytest_strings = le.inverse_transform(y_test)

cm = ConfusionMatrix(ytest_strings, ypred_strings)

print(cm)

cm.plot()