Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# AutoML 08b: Remote Execution with DataPrep

This sample accesses a data file on a remote DSVM through Datastore using DataPrep. Advantages of using DataPrep are:
1. DataPrep supports reading from and writing to datastores.
2. DataPrep supports automatic file type and column type detection.
3. DataPrep makes passing data into AutoML really simple.

More DataPrep documentation and examples can be found [here](https://github.com/Microsoft/AMLDataPrepDocs).

Make sure you have executed the [00.configuration](00.configuration.ipynb) before running this notebook.

In this notebook you would see
1. Storing data in DataStore.
2. Doing some basic data preparation using DataPrep and passing the prepared data (DataFlow) to AutoML for training (classficiation).



## Create Experiment

As part of the setup you have already created a <b>Workspace</b>. For AutoML you would need to create an <b>Experiment</b>. An <b>Experiment</b> is a named object in a <b>Workspace</b>, which is used to run experiments.

In [None]:
import logging
import os
import random
import time

from matplotlib import pyplot as plt
from matplotlib.pyplot import imshow
import numpy as np
import pandas as pd
from sklearn import datasets

import azureml.core
from azureml.core.compute import DsvmCompute
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.train.automl import AutoMLConfig
from azureml.train.automl.run import AutoMLRun

In [None]:
ws = Workspace.from_config()

# choose a name for experiment
experiment_name = 'automl-remote-datastore-file'
# project folder
project_folder = './sample_projects/automl-remote-dsvm-file'

experiment=Experiment(ws, experiment_name)

output = {}
output['SDK version'] = azureml.core.VERSION
output['Subscription ID'] = ws.subscription_id
output['Workspace'] = ws.name
output['Resource Group'] = ws.resource_group
output['Location'] = ws.location
output['Project Directory'] = project_folder
output['Experiment Name'] = experiment.name
pd.set_option('display.max_colwidth', -1)
pd.DataFrame(data=output, index=['']).T

## Diagnostics

Opt-in diagnostics for better experience, quality, and security of future releases

In [None]:
from azureml.telemetry import set_diagnostics_collection
set_diagnostics_collection(send_diagnostics=True)

## Create a Remote Linux DSVM
Note: If creation fails with a message about Marketplace purchase eligibilty, go to portal.azure.com, start creating DSVM there, and select "Want to create programmatically" to enable programmatic creation. Once you've enabled it, you can exit without actually creating VM.

**Note**: By default SSH runs on port 22 and you don't need to specify it. But if for security reasons you can switch to a different port (such as 5022), you can append the port number to the address. [Read more](https://render.githubusercontent.com/documentation/sdk/ssh-issue.md) on this.

In [None]:
compute_target_name = 'automl-dataprep'

try:
    while ws.compute_targets[compute_target_name].provisioning_state == 'Creating':
        time.sleep(1)
        
    dsvm_compute = DsvmCompute(workspace=ws, name=compute_target_name)
    print('found existing:', dsvm_compute.name)
except:
    dsvm_config = DsvmCompute.provisioning_configuration(vm_size="Standard_D2_v2")
    dsvm_compute = DsvmCompute.create(ws, name=compute_target_name, provisioning_configuration=dsvm_config)
    dsvm_compute.wait_for_completion(show_output=True)

## Copy data file to local

We will download a 1MB simple random sample of the Chicago Crime data into a local temporary directory.

In [None]:
import tempfile
import requests

temp_folder = tempfile.mkdtemp()
temp_tsv = os.path.join(temp_folder, 'crime0.csv')

request = requests.get('https://dprepdata.blob.core.windows.net/demo/crime0-random.csv')
with open(temp_tsv, 'w', encoding='utf-8') as f:
    f.write(request.text)

## Upload data to the cloud

Now let's make the data available in your datastore. Datastore is a convenient construct associated with your workspace for you to  reference different types of cloud storage locations (e.g. Azure Blob Containers, Azure File Shares, Azure Data Lake Stores, etc.). The benefit Datastore brings is you only need to register datastores once and you will be able to access them by name and will not need to expose secrets in your code. When you first create a workspace, a default datastore is registered for you which references the Azure Blob Container that was provisioned with the workspace. Let's upload the data we just got from the public location to the default datastore.

The `csv` file is uploaded into a directory named `datasets` at the root of the datastore.

In [None]:
from azureml.core import Workspace, Datastore

ds = ws.get_default_datastore()
print(ds.datastore_type, ds.account_name, ds.container_name)

In [None]:
ds.upload(src_dir=temp_folder, target_path='datasets', overwrite=True, show_progress=True)

## Create Dataflow using DataPrep
Let's use DataPrep to read the `csv` file from the datastore we just uploaded to and get the data profile to make sure our data looks good. We will predict the type of the offense (`Primary Type`).

In [None]:
import azureml.dataprep as dprep

dflow = dprep.read_csv(path=ds.path('datasets/crime0.csv'))
dflow.get_profile()

Let's also take a look at the first 5 rows of the data to give ourselves an idea of what the data looks like.

In [None]:
dflow.head(5)

From the first 5 rows, we see that there are some rows that have no value in the label column (`Primary Type`). Let's remove those rows.

In [None]:
dflow = dflow.drop_nulls('Primary Type')
dflow.head(5)

Now that we've removed those rows, let's split the dataflow into a features dataflow and a label dataflow.

In [None]:
X = dflow.drop_columns(columns=['Primary Type', 'FBI Code'])
y = dflow.keep_columns(columns=['Primary Type'])

## Instantiate AutoML <a class="anchor" id="Instatiate-AutoML-Remote-DSVM"></a>

You can specify automl_settings as **kwargs** as well. Also note that you can use the get_data() symantic for local excutions too. 

<i>Note: For Remote DSVM and Batch AI you cannot pass Numpy arrays directly to AutoMLConfig.</i>

|Property|Description|
|-|-|
|**primary_metric**|This is the metric that you want to optimize. Classification supports the following primary metrics: <br><i>accuracy</i><br><i>AUC_weighted</i><br><i>average_precision_score_weighted</i><br><i>norm_macro_recall</i><br><i>precision_score_weighted</i>|
|**iteration_timeout_minutes**|Time limit in minutes for each iteration|
|**iterations**|Number of iterations. In each iteration Auto ML trains a specific pipeline with the data|
|**n_cross_validations**|Number of cross validation splits|
|**max_concurrent_iterations**|Max number of iterations that would be executed in parallel.  This should be less than the number of cores on the DSVM
|**preprocess**| *True/False* <br>Setting this to *True* enables Auto ML to perform preprocessing <br>on the input to handle *missing data*, and perform some common *feature extraction*|
|**enable_cache**|Setting this to *True* enables preprocess done once and reuse the same preprocessed data for all the iterations. Default value is True.|
|**max_cores_per_iteration**| Indicates how many cores on the compute target would be used to train a single pipeline.<br> Default is *1*, you can set it to *-1* to use all cores|

In [None]:
from azureml.core.runconfig import RunConfiguration
from azureml.core.conda_dependencies import CondaDependencies

conda_run_config = RunConfiguration(framework="python")

conda_run_config.target = dsvm_compute

cd = CondaDependencies.create(pip_packages=['azureml-sdk[automl]==0.1.0.1918169'], conda_packages=['numpy'], pin_sdk_version=False, pip_indexurl='https://azuremlsdktestpypi.azureedge.net/sdk-release/master/588E708E0DF342C4A80BD954289657CF')
conda_run_config.environment.python.conda_dependencies = cd

In [None]:
automl_settings = {
    "iteration_timeout_minutes": 60,
    "iterations": 4,
    "n_cross_validations": 5,
    "primary_metric": 'accuracy',
    "preprocess": True,
    "max_cores_per_iteration": 1,
    "verbosity": logging.INFO
}
automl_config = AutoMLConfig(task='classification',
                             debug_log='automl_errors.log',
                             path=project_folder,
                             run_configuration=conda_run_config,
                             X=X,
                             y=y,
                             **automl_settings)

## Training the Models <a class="anchor" id="Training-the-model-Remote-DSVM"></a>

For remote runs the execution is asynchronous, so you will see the iterations get populated as they complete. You can interact with the widgets/models even when the experiment is running to retreive the best model up to that point. Once you are satisfied with the model you can cancel a particular iteration or the whole run.

In [None]:
remote_run = experiment.submit(automl_config, show_output=False)

## Exploring the Results <a class="anchor" id="Exploring-the-Results-Remote-DSVM"></a>
#### Widget for monitoring runs

The widget will sit on "loading" until the first iteration completed, then you will see an auto-updating graph and table show up. It refreshed once per minute, so you should see the graph update as child runs complete.

You can click on a pipeline to see run properties and output logs. Logs are also available on the DSVM under /tmp/azureml_run/{iterationid}/azureml-logs

NOTE: The widget displays a link at the bottom. This links to a web-ui to explore the individual run details.

In [None]:
from azureml.widgets import RunDetails
RunDetails(remote_run).show() 

In [None]:
# Wait until the run finishes.
remote_run.wait_for_completion(show_output = True)

In [None]:
remote_run


#### Retrieve All Child Runs
You can also use sdk methods to fetch all the child runs and see individual metrics that we log. 

In [None]:
children = list(remote_run.get_children())
metricslist = {}
for run in children:
    properties = run.get_properties()
    metrics = {k: v for k, v in run.get_metrics().items() if isinstance(v, float)}    
    metricslist[int(properties['iteration'])] = metrics

rundata = pd.DataFrame(metricslist).sort_index(1)
rundata

## Canceling Runs
You can cancel ongoing remote runs using the *cancel()* and *cancel_iteration()* functions

In [None]:
# Cancel the ongoing experiment and stop scheduling new iterations
# remote_run.cancel()

# Cancel iteration 1 and move onto iteration 2
# remote_run.cancel_iteration(1)

## Pre-process cache cleanup
The preprocess data gets cache at user default file store. When the run is completed the cache can be cleaned by running below cell

In [None]:
remote_run.clean_preprocessor_cache()

### Retrieve the Best Model

Below we select the best pipeline from our iterations. The *get_output* method returns the best run and the fitted model. There are overloads on *get_output* that allow you to retrieve the best run and fitted model for *any* logged metric or a particular *iteration*.

In [None]:
best_run, fitted_model = remote_run.get_output()

#### Best Model based on any other metric

In [None]:
# lookup_metric = "accuracy"
# best_run, fitted_model = remote_run.get_output(metric=lookup_metric)

#### Model from a specific iteration

In [None]:
# iteration = 1
# best_run, fitted_model = remote_run.get_output(iteration=iteration)

### Testing the Best Fitted Model <a class="anchor" id="Testing-the-Fitted-Model-Remote-DSVM"></a>


In [None]:
dflow = dprep.read_csv(path='https://dprepdata.blob.core.windows.net/demo/crime0-test.csv')
dflow.head(5)

In [None]:
from pandas_ml import ConfusionMatrix

y_test = dflow.keep_columns(columns=['Primary Type']).to_pandas_dataframe()
X_test = dflow.drop_columns(columns=['Primary Type', 'FBI Code']).to_pandas_dataframe()

ypred = fitted_model.predict(X_test.values)

cm = ConfusionMatrix(y_test['Primary Type'], ypred)

print(cm)

cm.plot()