Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# AutoML 03:  Remote Execution using DSVM (Ubuntu)

In this example we use the scikit learn's [digit dataset](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html) to showcase how you can use AutoML for a simple classification problem.

Make sure you have executed the [00.configuration](00.configuration.ipynb) before running this notebook.

In this notebook you would see
1. Creating an Experiment using an existing Workspace
2. Attaching an existing DSVM to a workspace
3. Instantiating AutoMLConfig 
4. Training the Model using the DSVM
5. Exploring the results
6. Testing the fitted model

In addition this notebook showcases the following features
- **Parallel** Executions for iterations
- Asyncronous tracking of progress
- **Cancelling** individual iterations or the entire run
- Retrieving models for any iteration or logged metric
- specify automl settings as **kwargs**


## Create Experiment

As part of the setup you have already created a workspace. For AutoML you would need to create a <b>Experiment</b>. An <b>Experiment</b> is a named object in a <b>Workspace</b>, which is used to run experiments.

In [None]:
import logging
import os
import random

from matplotlib import pyplot as plt
from matplotlib.pyplot import imshow
import numpy as np
import pandas as pd
from sklearn import datasets

import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.train.automl import AutoMLConfig
from azureml.train.automl.run import AutoMLRun

In [None]:
ws = Workspace.from_config()

# choose a name for the run history container in the workspace
experiment_name = 'automl-remote-dsvm4'
# project folder
project_folder = './sample_projects/automl-remote-dsvm4'

experiment=Experiment(ws, experiment_name)

output = {}
output['SDK version'] = azureml.core.VERSION
output['Subscription ID'] = ws.subscription_id
output['Workspace Name'] = ws.name
output['Resource Group'] = ws.resource_group
output['Location'] = ws.location
output['Project Directory'] = project_folder
output['Experiment Name'] = experiment.name
pd.set_option('display.max_colwidth', -1)
pd.DataFrame(data = output, index = ['']).T

## Diagnostics

Opt-in diagnostics for better experience, quality, and security of future releases

In [None]:
from azureml.telemetry import set_diagnostics_collection
set_diagnostics_collection(send_diagnostics=True)

## Create a Remote Linux DSVM
Note: If creation fails with a message about Marketplace purchase eligibilty, go to portal.azure.com, start creating DSVM there, and select "Want to create programmatically" to enable programmatic creation. Once you've enabled it, you can exit without actually creating VM.

**Note**: By default SSH runs on port 22 and you don't need to specify it. But if for security reasons you can switch to a different port (such as 5022), you can append the port number to the address. [Read more](https://render.githubusercontent.com/documentation/sdk/ssh-issue.md) on this.

In [None]:
from azureml.core.compute import DsvmCompute

dsvm_name = 'mydsvm'
try:
    dsvm_compute = DsvmCompute(ws, dsvm_name)
    print('found existing dsvm.')
except:
    print('creating new dsvm.')
    dsvm_config = DsvmCompute.provisioning_configuration(vm_size = "Standard_D2_v2")
    dsvm_compute = DsvmCompute.create(ws, name = dsvm_name, provisioning_configuration = dsvm_config)
    dsvm_compute.wait_for_completion(show_output = True)

## Create Get Data File
For remote executions you should author a get_data.py file containing a get_data() function. This file should be in the root directory of the project. You can encapsulate code to read data either from a blob storage or local disk in this file.

In [None]:
if not os.path.exists(project_folder):
    os.makedirs(project_folder)

In [None]:
%%writefile $project_folder/get_data.py

from sklearn import datasets
from scipy import sparse
import numpy as np

def get_data():
    
    digits = datasets.load_digits()
    X_digits = digits.data[100:,:]
    y_digits = digits.target[100:]

    return { "X" : X_digits, "y" : y_digits }

## Instantiate AutoML <a class="anchor" id="Instatiate-AutoML-Remote-DSVM"></a>

You can specify automl_settings as **kwargs** as well. Also note that you can use the get_data() symantic for local excutions too. 

<i>Note: For Remote DSVM and Batch AI you cannot pass Numpy arrays directly to the fit method.</i>

|Property|Description|
|-|-|
|**primary_metric**|This is the metric that you want to optimize.<br> Classification supports the following primary metrics <br><i>accuracy</i><br><i>AUC_weighted</i><br><i>balanced_accuracy</i><br><i>average_precision_score_weighted</i><br><i>precision_score_weighted</i>|
|**max_time_sec**|Time limit in seconds for each iteration|
|**iterations**|Number of iterations. In each iteration Auto ML trains a specific pipeline with the data|
|**n_cross_validations**|Number of cross validation splits|
|**concurrent_iterations**|Max number of iterations that would be executed in parallel.  This should be less than the number of cores on the DSVM.

In [None]:
automl_settings = {
    "max_time_sec": 600,
    "iterations": 20,
    "n_cross_validations": 5,
    "primary_metric": 'AUC_weighted',
    "preprocess": False,
    "concurrent_iterations": 2,
    "verbosity": logging.INFO
}

automl_config = AutoMLConfig(task = 'classification',
                             debug_log = 'automl_errors.log',
                             path=project_folder, 
                             compute_target = dsvm_compute,
                             data_script = project_folder + "/get_data.py",
                             **automl_settings
                            )


<b>Note</b> that the first run on a new DSVM may take a several minutes to preparing the environment.

In [None]:
remote_run = experiment.submit(automl_config, show_output=False)

## Exploring the Results

#### Loading executed runs
In case you need to load a previously executed run given a run id please enable the below cell

#### Widget for monitoring runs

The widget will sit on "loading" until the first iteration completed, then you will see an auto-updating graph and table show up. It refreshed once per minute, so you should see the graph update as child runs complete.

You can click on a pipeline to see run properties and output logs.  Logs are also available on the DSVM under /tmp/azureml_run/{iterationid}/azureml-logs

NOTE: The widget displays a link at the bottom. This links to a web-ui to explore the individual run details.

In [None]:
from azureml.train.widgets import RunDetails
RunDetails(remote_run).show() 

In [None]:
# wait till the run finishes
remote_run.wait_for_completion(show_output = True)


#### Retrieve All Child Runs
You can also use sdk methods to fetch all the child runs and see individual metrics that we log. 

In [None]:
children = list(remote_run.get_children())
metricslist = {}
for run in children:
    properties = run.get_properties()
    metrics = {k: v for k, v in run.get_metrics().items() if isinstance(v, float)}    
    metricslist[int(properties['iteration'])] = metrics

rundata = pd.DataFrame(metricslist).sort_index(1)
rundata

## Canceling runs

You can cancel ongoing remote runs using the *cancel()* and *cancel_iteration()* functions

In [None]:
# Cancel the ongoing experiment and stop scheduling new iterations
# remote_run.cancel()

# Cancel iteration 1 and move onto iteration 2
# remote_run.cancel_iteration(1)

### Retrieve the Best Model

Below we select the best pipeline from our iterations. The *get_output* method on automl_classifier returns the best run and the fitted model for the last *fit* invocation. There are overloads on *get_output* that allow you to retrieve the best run and fitted model for *any* logged metric or a particular *iteration*.

In [None]:
best_run, fitted_model = remote_run.get_output()
print(best_run)
print(fitted_model)

#### Best Model based on any other metric
Show the run/model which has the smallest `log_loss` value.

In [None]:
lookup_metric = "log_loss"
best_run, fitted_model = remote_run.get_output(metric = lookup_metric)
print(best_run)
print(fitted_model)

#### Model from a specific iteration
Show the run and model from the 3rd iteration.

In [None]:
iteration = 3
third_run, third_model = remote_run.get_output(iteration=iteration)
print(third_run)
print(third_model)

### Testing the Fitted Model <a class="anchor" id="Testing-the-Fitted-Model-Remote-DSVM"></a>

#### Load Test Data

In [None]:
digits = datasets.load_digits()
X_digits = digits.data[:10, :]
y_digits = digits.target[:10]
images = digits.images[:10]

#### Testing our best pipeline

In [None]:
#Randomly select digits and test
for index in np.random.choice(len(y_digits), 2):
    print(index)
    predicted = fitted_model.predict(X_digits[index:index + 1])[0]
    label = y_digits[index]
    title = "Label value = %d  Predicted value = %d " % ( label,predicted)
    fig = plt.figure(1, figsize=(3,3))
    ax1 = fig.add_axes((0,0,.8,.8))
    ax1.set_title(title)
    plt.imshow(images[index], cmap=plt.cm.gray_r, interpolation='nearest')
    plt.show()