Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# Training and Hyperparameter Tuning of a TensorFlow Model

In this tutorial, we demonstrate how to use the Azure ML Python SDK to train a Convolutional Neural Network (CNN) in TensorFlow to perform handwritten digit recognition on the popular MNIST dataset. We will demonstrate how to perform hyperparameter tuning of the model using AML's HyperDrive service.   

We will cover the following concepts:
* Create a Batch AI GPU cluster
* (To do): DataStore
* Train a TensorFlow model on a single node
* Logging metrics to Run History
* Set up a hyperparameter sweep with HyperDrive
* Select the best model for download

## Prerequisites
Make sure you go through the [00. Installation and Configuration](00.configuration.ipynb) Notebook first if you haven't. In addition, to run through this notebook, you will need to install a few additional packages by running `pip install pillow tensorflow matplotlib pandas tqdm`

### Authorize Hyperdrive Service Principal

Hyperdrive service is in preview so you need to explicitly grant permissions. In Azure portal, add `vienna-test-westus` as a `Contributor` to your resource group. Or, you can also do this from azure-cli:
```sh
# find the ARM id of your resource group. Copy into memory.
$ az group show -n <rg_name> -o json

# check if https://vienna-test-westus-cluster.sp.azureml.net is a Contributor.
$ az role assignment list --scope <rg_arm_id> -o table

# if not, add it. you will need to be a resource group owner to do this.
$ az role assignment create --role Contributor --scope <rg_arm_id> --assignee https://vienna-test-westus-cluster.sp.azureml.net
```

## 1. Set Up a Workspace
Workspace is the top-level Azure Resource for Azure ML services.

In [None]:
# Check core SDK version number
import azureml.core

print("SDK version:", azureml.core.VERSION)

In [None]:
import os

from azureml.core import Workspace

ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\n')

## Create An Experiment
**Experiment** is a logical container in an Azure ML Workspace. It hosts run records which can include run metrics and output artifacts from your experiments.

In [None]:
from azureml.core import Experiment
experiment_name = 'hyperdrive-with-tf'
experiment = Experiment(workspace = ws, name = experiment_name)

Create a folder to store the training script.

In [None]:
import os
script_folder = './samples/hyperdrive-with-tf'
os.makedirs(script_folder, exist_ok = True)

## 2. Provision a New Batch AI Cluster
Training machine learning models is often a compute-intensive process. Azure's [Batch AI](#https://docs.microsoft.com/en-us/azure/batch-ai/overview) service allows data scientists to leverage the power of compute clusters of CPU or GPU-enabled VMs for training their models. Using the Python SDK, we can easily provision a Batch AI cluster with the specifications we want.

In [None]:
from azureml.core.compute import BatchAiCompute
from azureml.core.compute import ComputeTarget

# choose a name for your cluster
batchai_cluster_name = ws.name + "gpu"

found = False
# see if this compute target already exists in the workspace
for ct in ws.compute_targets():
    print(ct.name, ct.type)
    if ct.name == batchai_cluster_name and type(ct) is BatchAiCompute:
        found = True
        print('found compute target. just use it.')
        compute_target = ct
        break
        
if not found:
    print('creating a new compute target...')
    provisioning_config = BatchAiCompute.provisioning_configuration(vm_size = "STANDARD_NC6", # NC6 is GPU-enabled
                                                                #vm_priority = 'lowpriority', # optional
                                                                autoscale_enabled = True,
                                                                cluster_min_nodes = 1, 
                                                                cluster_max_nodes = 4)

    # create the cluster
    compute_target = ComputeTarget.create(ws, batchai_cluster_name, provisioning_config)
    
    # can poll for a minimum number of nodes and for a specific timeout. 
    # if no min node count is provided it will use the scale settings for the cluster
    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)
    
     # For a more detailed view of current BatchAI cluster status, use the 'status' property    
    print(compute_target.status.serialize())

Here, we specify the following parameters for the `provisioning_config`:
* `vm_size`: the family and size of the VM to use. For this tutorial we want to leverage GPU nodes, so we specify the `STANDARD_NC6` VM, which has one NVIDIA K80 GPU
* `vm_priority`: `'lowpriority'` or `'dedicated'`
* `autoscale_enabled`: with autoscaling set to `True`, Batch AI will automatically resize the cluster based on the demands of your workload. Default is `False`, will create a cluster with a fixed # of nodes
* `cluster_min_nodes`: minimum number of VMs for autoscaling
* `cluster_max_nodes`: maximum number of VMs for autoscaling

## 3. Train TensorFlow MNIST
Now let's train a CNN on the MNIST dataset for predicting handwritten digits. The training script `tf_mnist_train.py` is adapted from TensorFlow's [MNIST](#https://www.tensorflow.org/versions/r1.4/get_started/mnist/pros) tutorial. The changes to the original on concerned logging some metrics about the training run to the AML run history. See the adapted file here: [tf_mnist_train.py](tf_mnist_train.py) -- search for 'run_logger' to find the added lines of code.

In [None]:
from shutil import copyfile

training_script = 'tf_mnist_train.py'
# copy the mnist_tf.py file to the project folder
copyfile(training_script, os.path.join(script_folder, training_script))

In [None]:
# take a look at the training script
!more $training_script

### a. Run a single-node TensorFlow experiment

To facilitate ML training, the Python SDK provides a high-level abstraction called Estimators that allows users to train CNTK, TensorFlow, or custom scripts in the Azure ML ecosystem. Let's instantiate an AML TensorFlow Estimator (not to be conflated with the [`tf.estimator.Estimator`](#https://www.tensorflow.org/programmers_guide/estimators) class).

In [None]:
from azureml.train.dnn import TensorFlow

script_params = {
    '--minibatch_size': 64,
    '--learning_rate': 0.001,
    '--keep_probability': 0.5,
    '--output_dir': 'outputs',
    '--num_iterations': 1000
}

tf_estimator = TensorFlow(source_directory = script_folder, 
                          script_params = script_params, 
                          compute_target = compute_target, 
                          entry_script = training_script, 
                          node_count = 1,
                          use_gpu = True)

We specify the following parameters to the TensorFlow constructor:
* `script_params`: a dictionary specifying the command-line arguments to your `entry_script`
* `compute_target`: the compute target object. Can be a local, DSVM, or Batch AI compute target
* `entry_script`: the relative(?) path to the project directory of the file to be executed during training
* `node_count`: the number of nodes to use for the training job. Defaults to `1`
* `use_gpu`: to leverage the GPU for training, set this flag to `True`. Defaults to `False`

** Note on `outputs` folder: **

When running an experiment using the Python SDK, you can write files out to a folder named `outputs` that is relative to the root directory. This folder is specially tracked by AML in the sense that any files written to that folder during script execution will be picked up by Run History; these files (known as *artifacts*) will be available as part of the run history record.

In [None]:
run = experiment.submit(tf_estimator)

### b. Monitoring the training run
There are several ways with which the user can monitor the details and status of the training run. 

Browse to the run history report (use Chrome please, for now)

In [None]:
run

Print out the current run status

You can also use a widget to monitor the progress of your submitted run, which allows you to do so without blocking your notebook execution:

In [None]:
from azureml.train.widgets import RunDetails

RunDetails(run).show()

![img](../images/hd_tf1.png)

In [None]:
# to block and wait for training to complete 
run.wait_for_completion(show_output = True)

You can also check on the Batch AI cluster and job status using `az-cli` commands:
```shell
# check cluster status. You can see how many nodes are running.
$ az batchai cluster list

# check job status. You can see how many jobs are running
$ az batchai job list
```

### c. Log metrics to Run History

Another useful feature of the Python SDK is the ability to log metrics for each run. These metrics are persisted in the run history by AML. In addition, they are automatically displayed and visualized by the RunDetails widget. (Logging run metrics is also required in order to use the HyperDrive service, which we will go over in more detail in section 4.)

The below code snippet from `tf_mnist_train.py` shows how we can we log the script parameters for a training run, by specifying a key for the metric and the corresponding value:
```python
from azureml.core.run import Run

run_logger = Run.get_submitted_run()
run_logger.log("learning_rate", args.learning_rate)
run_logger.log("minibatch_size", args.minibatch_size)
run_logger.log("keep_probability", args.keep_probability)
run_logger.log("num_iterations", args.num_iterations)
```

In [None]:
run.get_metrics()

## 4. Hyperparameter Tuning with HyperDrive

Now that we've seen how to do a simple TensorFlow training run using the Python SDK, let's see if we can further improve the accuracy of our model.

Hyperparameter tuning is a key part of machine learning experimentation, in which the data scientist tries different configurations of hyperparameters in order to find a set of values that optimizes a specific target metric, such as the accuracy of the model. To this end, Azure ML provides the ** HyperDrive service ** to faciliate the hyperparameter tuning process. 

### a. Start a HyperDrive run

Using HyperDrive, we specify the hyperparameter space to sweep over, the primary metric to optimize, and an early termination policy. HyperDrive will kick off multiple children runs with different hyperparameter configurations, and terminate underperforming runs according to the early termination policy provided.

In [None]:
from azureml.train.hyperdrive import *

param_sampling = RandomParameterSampling( {
        "learning_rate": loguniform(-10, -3),
        "keep_probability": uniform(0.5, 0.1)
    }
)

early_termination_policy = BanditPolicy(slack_factor = 0.15, evaluation_interval=2)

hyperdrive_run_config = HyperDriveRunConfig(estimator = tf_estimator, 
                                           hyperparameter_sampling = param_sampling, 
                                           policy = early_termination_policy,
                                           primary_metric_name = "Accuracy",
                                           primary_metric_goal = PrimaryMetricGoal.MAXIMIZE,
                                           max_total_runs = 20,
                                           max_concurrent_runs = 4)

In the above cell, we first define a sampling space for the hyperparameters we want to sweep over, specifically the `learning_rate` and `keep_probability`. In this case we are using `RandomParameterSampling`, which allows us to specify the parameter values as either a choice among discrete values or as a distribution over a continuous range (here, we are using a uniform distribution for the `keep_probability`). You can run `help(RandomParameterSampling)` for more API details on this class.

Then, we specify the early termination policy to use. If not specified, the policy defaults (?) to `None`, in which case all training runs are run to completion. Here we use the `BanditPolicy`, which will terminate any run that doesn't fall within the slack factor of our primary evaluation metric. Run `help(BanditPolicy)` for more details on this policy.

To do: explain `evaluation_interval` within context of our training script.

We specify the following parameters to the `HyperDriveRunConfig` constructor:
* explain input_paths?
* `estimator`: the estimator that will be called with the sampled hyperparameters
* `hyperparameter_sampling`: the sampling space to use
* `policy`: the early termination policy
* `primary_metric_name`: the name of the metric logged to the AML Run that HyperDrive will use to evaluate runs. Here, we are using the test accuracy (logged as 'Accuracy' in our training script)
* `primary_metric_goal`: the optimization goal of the primary metric (either `PrimaryMetricGoal.MAXIMIZE` or `PrimaryMetricGoal.MINIMIZE`)
* `max_total_runs`: the maximum number of runs HyperDrive will kick off
* `max_concurrent_runs`: the maximum number of runs to run concurrently
* `compute_target`: the compute target. In our case, the Batch AI cluster we provisioned

** Note on logging metrics for HyperDrive: **  

In order to use HyperDrive, we will need to log the metric we want the service to use for evaluating run performance (`primary_metric_name`). In our script, we will use the accuracy of the model evaluated on the MNIST test dataset as our primary metric. For every 100 training iterations, we calculate and log this test accuracy (`'Accuracy'`). We also log an additional utility metric, `'Iterations'`, to inform us of the number of iterations the model was trained on that corresponds to each Accuracy metric logged (see `tf_mnist.py` for more details). This is useful for seeing how many iterations were trained for jobs that were terminated early.

```python
run_logger.log("Accuracy", float(test_acc))
run_logger.log("Iterations", i)
```

In [None]:
# start the HyperDrive run
hyperdrive_run = experiment.submit(hyperdrive_run_config)

In [None]:
run

### b. Use a widget to visualize details of the HyperDrive runs

Runs will automatically start to show in the following widget once rendered.

In [None]:
from azureml.train.widgets import RunDetails

RunDetails(hyperdrive_run).show()

![img](../images/hd_tf2.png)

In [None]:
# check cluster status, pay attention to the # of running nodes
# !az batchai cluster list -o table

In [None]:
# check the Batch AI job queue. Notice the Job name is the run history ID. 
# Pay attention to the state of the job.
# !az batchai job list -o table

### c. Find the best run

Once all of the HyperDrive runs have completed, we can find the run that achieved the highest accuracy and its corresponding hyperparameters.

In [None]:
run.wait_for_completion(show_output = True)

In [None]:
table = helpers.ListTable()
from tqdm import tqdm

run_metrics = {}
table.append(['Accuracy', 'Run', 'Iterations', 'learning_rate', 'keep_probability'])
for run in tqdm(hyperdrive_run.get_children()):
    metrics = run.get_metrics()
    if 'Accuracy' in metrics.keys():
        metrics['Accuracy'] = metrics['Accuracy'][-1] # final test accuracy
        metrics['Iterations'] = max(metrics['Iterations']) # number of iterations the run ran for
        
        table.append([metrics['Accuracy'], 
                      run.id, 
                      metrics['Iterations'], 
                      metrics['learning_rate'], 
                      metrics['keep_probability']])
        run_metrics[run.id] = metrics
table

In [None]:
from azureml.core.run import Run

best_run_id = max(run_metrics, key = lambda k: run_metrics[k]['Accuracy'])
best_run_metrics = run_metrics[best_run_id]
experiment = Experiment(ws, experiment_name)
best_run = Run(experiment, best_run_id)

print('Best Run is:\n  Accuracy: {0:.6f} \n  Learning rate: {1:.6f} \n  Keep probability: {2}'.format(
        best_run_metrics['Accuracy'],
        best_run_metrics['learning_rate'],
        best_run_metrics['keep_probability']
    ))

print(helpers.get_run_history_url(best_run))

### Plot the runs [Optional] 
Note you will need to install `matplotlib` for this.

In [None]:
%matplotlib inline

import numpy as np
import matplotlib.pyplot as plt

plot_data = np.array([[run_metrics[i]['keep_probability'], 
                       run_metrics[i]['learning_rate'], 
                       run_metrics[i]['Accuracy']] for i in run_metrics.keys()])
area = np.array([[run_metrics[i]['Iterations']/5] for i in run_metrics.keys()])

plt.figure(figsize = (15,5))
plt.scatter(plot_data[:,0], plot_data[:,1], s = area, c = plot_data[:,2], alpha = 0.4)
plt.xlabel("keep_probability")
plt.ylabel("learning_rate")
plt.yscale('log')
plt.ylim(0.00001,0.06)
plt.colorbar()
plt.clim(0.95, max(plot_data[:,2]))
plt.show()

### d. Download model from the best run
Once we've identified the best run from HyperDrive, we can download the model files to our local machine.  

The final trained model checkpoint files are located in the `outputs` directory picked up by AML. We can run the below line of code to confirm that those files are present:

In [None]:
best_run.get_file_names()

Finally, we can download the relevant checkpoint files. Note there is currently a bug on uploading files when executing on Batch AI cluster so the below code doesn't work yet.

In [None]:
import os
output_dir = 'outputs'
target_dir =  os.path.join('sample_projects', 'outputs')
model_files_to_download = ['checkpoint', 'model.ckpt.data-00000-of-00001', 'model.ckpt.index', 'model.ckpt.meta']
for file in model_files_to_download:
    model_src_path = os.path.join(output_dir, file)
    model_dest_path = os.path.join(target_dir, file)
    print('downloading ' + file)
    best_run.download_file(name = model_src_path, output_file_path = model_dest_path)

### e. Test the model locally
Now that we have downloaded the best-performing model, we can use it locally to score images of hand-written digits. For this we have prepared a scoring file [tf_mnist_score.py](tf_mnist_score.py) which we import below. tf_mnist_score.py provides a function `run(input_data)` which accepts a base64-encoded image in a JSON dict format (this format is friendly for the deployment of a webservice, which we will do later). 

Note that this scoring code requires tensorflow and PIL (`pip install tensorflow pillow`).

First, we will create a base64-encoded image in a json structure based on one of the test images provided in the folder `mnist_test_images`:

In [None]:
import os, json, base64
from PIL import Image 
import tf_mnist_score
from io import BytesIO

def imgToBase64(img):
    imgio = BytesIO()
    img.save(imgio, 'JPEG')
    img_str = base64.b64encode(imgio.getvalue())
    return img_str.decode('utf-8')

# Generate JSON Base64-encoded image from sample test input
test_img_path = os.path.join('mnist_test_images', 'img_3.jpg')
base64Img = imgToBase64(Image.open(test_img_path))
data = json.dumps({'data': base64Img})
print(data)

Then we will call `tf_mnist_score.run()` with the json data structure we created above. And we draw the image that we are scoring, so we can compare the label returned by the image with the acutual handwritten digit.

In [None]:
from IPython.display import Image as IPImage
tf_mnist_score.init()
result = tf_mnist_score.run(data)
print(result)
IPImage(filename=test_img_path, width=200)