MachineLearningNotebooks/how-to-use-azureml/training-with-deep-learning/train-hyperparameter-tune-deploy-with-keras/train-hyperparameter-tune-deploy-with-keras.ipynb

{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Copyright (c) Microsoft Corporation. All rights reserved.\n",
        "\n",
        "Licensed under the MIT License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/training-with-deep-learning/train-hyperparameter-tune-deploy-with-keras/train-hyperparameter-tune-deploy-with-keras.png)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "nbpresent": {
          "id": "bf74d2e9-2708-49b1-934b-e0ede342f475"
        }
      },
      "source": [
        "# Training, hyperparameter tune, and deploy with Keras\n",
        "\n",
        "## Introduction\n",
        "This tutorial shows how to train a simple deep neural network using the MNIST dataset and Keras on Azure Machine Learning. MNIST is a popular dataset consisting of 70,000 grayscale images. Each image is a handwritten digit of `28x28` pixels, representing number from 0 to 9. The goal is to create a multi-class classifier to identify the digit each image represents, and deploy it as a web service in Azure.\n",
        "\n",
        "For more information about the MNIST dataset, please visit [Yan LeCun's website](http://yann.lecun.com/exdb/mnist/).\n",
        "\n",
        "## Prerequisite:\n",
        "* Understand the [architecture and terms](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture) introduced by Azure Machine Learning\n",
        "* If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, go through the [configuration notebook](../../../configuration.ipynb) to:\n",
        "    * install the AML SDK\n",
        "    * create a workspace and its configuration file (`config.json`)\n",
        "* For local scoring test, you will also need to have `tensorflow` and `keras` installed in the current Jupyter kernel."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Let's get started. First let's import some Python libraries."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "nbpresent": {
          "id": "c377ea0c-0cd9-4345-9be2-e20fb29c94c3"
        }
      },
      "outputs": [],
      "source": [
        "%matplotlib inline\n",
        "import numpy as np\n",
        "import os\n",
        "import matplotlib.pyplot as plt"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "nbpresent": {
          "id": "edaa7f2f-2439-4148-b57a-8c794c0945ec"
        }
      },
      "outputs": [],
      "source": [
        "import azureml\n",
        "from azureml.core import Workspace\n",
        "\n",
        "# check core SDK version number\n",
        "print(\"Azure ML SDK Version: \", azureml.core.VERSION)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Initialize workspace\n",
        "Initialize a [Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace) object from the existing workspace you created in the Prerequisites step. `Workspace.from_config()` creates a workspace object from the details stored in `config.json`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "ws = Workspace.from_config()\n",
        "print('Workspace name: ' + ws.name, \n",
        "      'Azure region: ' + ws.location, \n",
        "      'Subscription id: ' + ws.subscription_id, \n",
        "      'Resource group: ' + ws.resource_group, sep='\\n')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "nbpresent": {
          "id": "59f52294-4a25-4c92-bab8-3b07f0f44d15"
        }
      },
      "source": [
        "## Create an Azure ML experiment\n",
        "Let's create an experiment named \"keras-mnist\" and a folder to hold the training scripts. The script runs will be recorded under the experiment in Azure."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "nbpresent": {
          "id": "bc70f780-c240-4779-96f3-bc5ef9a37d59"
        }
      },
      "outputs": [],
      "source": [
        "from azureml.core import Experiment\n",
        "\n",
        "script_folder = './keras-mnist'\n",
        "os.makedirs(script_folder, exist_ok=True)\n",
        "\n",
        "exp = Experiment(workspace=ws, name='keras-mnist')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Explore data\n",
        "\n",
        "Before you train a model, you need to understand the data that you are using to train it. In this section you learn how to:\n",
        "\n",
        "* Download the MNIST dataset\n",
        "* Display some sample images\n",
        "\n",
        "### Download the MNIST dataset\n",
        "\n",
        "Download the MNIST dataset and save the files into a `data` directory locally.  Images and labels for both training and testing are downloaded."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import urllib.request\n",
        "\n",
        "data_folder = os.path.join(os.getcwd(), 'data')\n",
        "os.makedirs(data_folder, exist_ok=True)\n",
        "\n",
        "urllib.request.urlretrieve('http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz', filename=os.path.join(data_folder, 'train-images.gz'))\n",
        "urllib.request.urlretrieve('http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz', filename=os.path.join(data_folder, 'train-labels.gz'))\n",
        "urllib.request.urlretrieve('http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz', filename=os.path.join(data_folder, 'test-images.gz'))\n",
        "urllib.request.urlretrieve('http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz', filename=os.path.join(data_folder, 'test-labels.gz'))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Display some sample images\n",
        "\n",
        "Load the compressed files into `numpy` arrays. Then use `matplotlib` to plot 30 random images from the dataset with their labels above them. Note this step requires a `load_data` function that's included in an `utils.py` file. This file is included in the sample folder. Please make sure it is placed in the same folder as this notebook. The `load_data` function simply parses the compressed files into numpy arrays."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# make sure utils.py is in the same directory as this code\n",
        "from utils import load_data, one_hot_encode\n",
        "\n",
        "# note we also shrink the intensity values (X) from 0-255 to 0-1. This helps the model converge faster.\n",
        "X_train = load_data(os.path.join(data_folder, 'train-images.gz'), False) / 255.0\n",
        "X_test = load_data(os.path.join(data_folder, 'test-images.gz'), False) / 255.0\n",
        "y_train = load_data(os.path.join(data_folder, 'train-labels.gz'), True).reshape(-1)\n",
        "y_test = load_data(os.path.join(data_folder, 'test-labels.gz'), True).reshape(-1)\n",
        "\n",
        "# now let's show some randomly chosen images from the training set.\n",
        "count = 0\n",
        "sample_size = 30\n",
        "plt.figure(figsize = (16, 6))\n",
        "for i in np.random.permutation(X_train.shape[0])[:sample_size]:\n",
        "    count = count + 1\n",
        "    plt.subplot(1, sample_size, count)\n",
        "    plt.axhline('')\n",
        "    plt.axvline('')\n",
        "    plt.text(x=10, y=-10, s=y_train[i], fontsize=18)\n",
        "    plt.imshow(X_train[i].reshape(28, 28), cmap=plt.cm.Greys)\n",
        "plt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now you have an idea of what these images look like and the expected prediction outcome."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "nbpresent": {
          "id": "defe921f-8097-44c3-8336-8af6700804a7"
        }
      },
      "source": [
        "## Create a FileDataset\n",
        "A FileDataset references one or multiple files in your datastores or public urls. The files can be of any format. FileDataset provides you with the ability to download or mount the files to your compute. By creating a dataset, you create a reference to the data source location. If you applied any subsetting transformations to the dataset, they will be stored in the dataset as well. The data remains in its existing location, so no extra storage cost is incurred. [Learn More](https://aka.ms/azureml/howto/createdatasets)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.core.dataset import Dataset\n",
        "\n",
        "web_paths = [\n",
        "            'http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz',\n",
        "            'http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz',\n",
        "            'http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz',\n",
        "            'http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz'\n",
        "            ]\n",
        "dataset = Dataset.File.from_files(path = web_paths)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Use the `register()` method to register datasets to your workspace so they can be shared with others, reused across various experiments, and referred to by name in your training script."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "dataset = dataset.register(workspace = ws,\n",
        "                           name = 'mnist dataset',\n",
        "                           description='training and test dataset',\n",
        "                           create_new_version=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Create or Attach existing AmlCompute\n",
        "You will need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for training your model. In this tutorial, you create `AmlCompute` as your training compute resource."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "If we could not find the cluster with the given name, then we will create a new cluster here. We will create an `AmlCompute` cluster of `STANDARD_NC6` GPU VMs. This process is broken down into 3 steps:\n",
        "1. create the configuration (this step is local and only takes a second)\n",
        "2. create the cluster (this step will take about **20 seconds**)\n",
        "3. provision the VMs to bring the cluster to the initial size (of 1 in this case). This step will take about **3-5 minutes** and is providing only sparse output in the process. Please make sure to wait until the call returns before moving to the next cell"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.core.compute import ComputeTarget, AmlCompute\n",
        "from azureml.core.compute_target import ComputeTargetException\n",
        "\n",
        "# choose a name for your cluster\n",
        "cluster_name = \"gpu-cluster\"\n",
        "\n",
        "try:\n",
        "    compute_target = ComputeTarget(workspace=ws, name=cluster_name)\n",
        "    print('Found existing compute target')\n",
        "except ComputeTargetException:\n",
        "    print('Creating a new compute target...')\n",
        "    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_NC6', \n",
        "                                                           max_nodes=4)\n",
        "\n",
        "    # create the cluster\n",
        "    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)\n",
        "\n",
        "    # can poll for a minimum number of nodes and for a specific timeout. \n",
        "    # if no min node count is provided it uses the scale settings for the cluster\n",
        "    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)\n",
        "\n",
        "# use get_status() to get a detailed status for the current cluster. \n",
        "print(compute_target.get_status().serialize())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now that you have created the compute target, let's see what the workspace's `compute_targets` property returns. You should now see one entry named \"gpu-cluster\" of type `AmlCompute`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "compute_targets = ws.compute_targets\n",
        "for name, ct in compute_targets.items():\n",
        "    print(name, ct.type, ct.provisioning_state)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Copy the training files into the script folder\n",
        "The Keras training script is already created for you. You can simply copy it into the script folder, together with the utility library used to load compressed data file into numpy array."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import shutil\n",
        "\n",
        "# the training logic is in the keras_mnist.py file.\n",
        "shutil.copy('./keras_mnist.py', script_folder)\n",
        "\n",
        "# the utils.py just helps loading data from the downloaded MNIST dataset into numpy arrays.\n",
        "shutil.copy('./utils.py', script_folder)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "nbpresent": {
          "id": "2039d2d5-aca6-4f25-a12f-df9ae6529cae"
        }
      },
      "source": [
        "## Construct neural network in Keras\n",
        "In the training script `keras_mnist.py`, it creates a very simple DNN (deep neural network), with just 2 hidden layers. The input layer has 28 * 28 = 784 neurons, each representing a pixel in an image. The first hidden layer has 300 neurons, and the second hidden layer has 100 neurons. The output layer has 10 neurons, each representing a targeted label from 0 to 9.\n",
        "\n",
        "![DNN](nn.png)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Azure ML concepts  \n",
        "Please note the following three things in the code below:\n",
        "1. The script accepts arguments using the argparse package. In this case there is one argument `--data_folder` which specifies the FileDataset in which the script can find the MNIST data\n",
        "```\n",
        "    parser = argparse.ArgumentParser()\n",
        "    parser.add_argument('--data_folder')\n",
        "```\n",
        "2. The script is accessing the Azure ML `Run` object by executing `run = Run.get_context()`. Further down the script is using the `run` to report the loss and accuracy at the end of each epoch via callback.\n",
        "```\n",
        "    run.log('Loss', log['loss'])\n",
        "    run.log('Accuracy', log['acc'])\n",
        "```\n",
        "3. When running the script on Azure ML, you can write files out to a folder `./outputs` that is relative to the root directory. This folder is specially tracked by Azure ML in the sense that any files written to that folder during script execution on the remote target will be picked up by Run History; these files (known as artifacts) will be available as part of the run history record."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The next cell will print out the training code for you to inspect."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "with open(os.path.join(script_folder, './keras_mnist.py'), 'r') as f:\n",
        "    print(f.read())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Create TensorFlow estimator & add Keras\n",
        "Next, we construct an `azureml.train.dnn.TensorFlow` estimator object, use the `gpu-cluster` as compute target, and pass the mount-point of the datastore to the training code as a parameter.\n",
        "The TensorFlow estimator is providing a simple way of launching a TensorFlow training job on a compute target. It will automatically provide a docker image that has TensorFlow installed. In this case, we add `keras` package (for the Keras framework obviously), and `matplotlib` package for plotting a \"Loss vs. Accuracy\" chart and record it in run history."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "dataset = Dataset.get_by_name(ws, 'mnist dataset')\n",
        "\n",
        "# list the files referenced by mnist dataset\n",
        "dataset.to_path()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.train.dnn import TensorFlow\n",
        "\n",
        "script_params = {\n",
        "    '--data-folder': dataset.as_named_input('mnist').as_mount(),\n",
        "    '--batch-size': 50,\n",
        "    '--first-layer-neurons': 300,\n",
        "    '--second-layer-neurons': 100,\n",
        "    '--learning-rate': 0.001\n",
        "}\n",
        "\n",
        "est = TensorFlow(source_directory=script_folder,\n",
        "                 script_params=script_params,\n",
        "                 compute_target=compute_target, \n",
        "                 entry_script='keras_mnist.py', \n",
        "                 pip_packages=['keras==2.2.5','azureml-dataprep[pandas,fuse]','matplotlib'])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Submit job to run\n",
        "Submit the estimator to the Azure ML experiment to kick off the execution."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "run = exp.submit(est)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Monitor the Run\n",
        "As the Run is executed, it will go through the following stages:\n",
        "1. Preparing: A docker image is created matching the Python environment specified by the TensorFlow estimator and it will be uploaded to the workspace's Azure Container Registry. This step will only happen once for each Python environment -- the container will then be cached for subsequent runs. Creating and uploading the image takes about **5 minutes**. While the job is preparing, logs are streamed to the run history and can be viewed to monitor the progress of the image creation.\n",
        "\n",
        "2. Scaling: If the compute needs to be scaled up (i.e. the AmlCompute cluster requires more nodes to execute the run than currently available), the cluster will attempt to scale up in order to make the required amount of nodes available. Scaling typically takes about **5 minutes**.\n",
        "\n",
        "3. Running: All scripts in the script folder are uploaded to the compute target, data stores are mounted/copied and the `entry_script` is executed. While the job is running, stdout and the `./logs` folder are streamed to the run history and can be viewed to monitor the progress of the run.\n",
        "\n",
        "4. Post-Processing: The `./outputs` folder of the run is copied over to the run history\n",
        "\n",
        "There are multiple ways to check the progress of a running job. We can use a Jupyter notebook widget. \n",
        "\n",
        "**Note: The widget will automatically update ever 10-15 seconds, always showing you the most up-to-date information about the run**"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.widgets import RunDetails\n",
        "RunDetails(run).show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can also periodically check the status of the run object, and navigate to Azure portal to monitor the run."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "run"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "run.wait_for_completion(show_output=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "In the outputs of the training script, it prints out the Keras version number. Please make a note of it."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### The Run object\n",
        "The Run object provides the interface to the run history -- both to the job and to the control plane (this notebook), and both while the job is running and after it has completed. It provides a number of interesting features for instance:\n",
        "* `run.get_details()`: Provides a rich set of properties of the run\n",
        "* `run.get_metrics()`: Provides a dictionary with all the metrics that were reported for the Run\n",
        "* `run.get_file_names()`: List all the files that were uploaded to the run history for this Run. This will include the `outputs` and `logs` folder, azureml-logs and other logs, as well as files that were explicitly uploaded to the run using `run.upload_file()`\n",
        "\n",
        "Below are some examples -- please run through them and inspect their output. "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "run.get_details()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "run.get_metrics()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "run.get_file_names()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Download the saved model"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "In the training script, the Keras model is saved into two files, `model.json` and `model.h5`, in the `outputs/models` folder on the gpu-cluster AmlCompute node. Azure ML automatically uploaded anything written in the `./outputs` folder into run history file store. Subsequently, we can use the `run` object to download the model files. They are under the the `outputs/model` folder in the run history file store, and are downloaded into a local folder named `model`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# create a model folder in the current directory\n",
        "os.makedirs('./model', exist_ok=True)\n",
        "\n",
        "for f in run.get_file_names():\n",
        "    if f.startswith('outputs/model'):\n",
        "        output_file_path = os.path.join('./model', f.split('/')[-1])\n",
        "        print('Downloading from {} to {} ...'.format(f, output_file_path))\n",
        "        run.download_file(name=f, output_file_path=output_file_path)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Predict on the test set\n",
        "Let's check the version of the local Keras. Make sure it matches with the version number printed out in the training script. Otherwise you might not be able to load the model properly."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import keras\n",
        "import tensorflow as tf\n",
        "\n",
        "print(\"Keras version:\", keras.__version__)\n",
        "print(\"Tensorflow version:\", tf.__version__)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now let's load the downloaded model."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from keras.models import model_from_json\n",
        "\n",
        "# load json and create model\n",
        "json_file = open('model/model.json', 'r')\n",
        "loaded_model_json = json_file.read()\n",
        "json_file.close()\n",
        "loaded_model = model_from_json(loaded_model_json)\n",
        "# load weights into new model\n",
        "loaded_model.load_weights(\"model/model.h5\")\n",
        "print(\"Model loaded from disk.\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Feed test dataset to the persisted model to get predictions."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# evaluate loaded model on test data\n",
        "loaded_model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])\n",
        "y_test_ohe = one_hot_encode(y_test, 10)\n",
        "y_hat = np.argmax(loaded_model.predict(X_test), axis=1)\n",
        "\n",
        "# print the first 30 labels and predictions\n",
        "print('labels:  \\t', y_test[:30])\n",
        "print('predictions:\\t', y_hat[:30])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Calculate the overall accuracy by comparing the predicted value against the test set."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "print(\"Accuracy on the test set:\", np.average(y_hat == y_test))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Intelligent hyperparameter tuning\n",
        "We have trained the model with one set of hyperparameters, now let's how we can do hyperparameter tuning by launching multiple runs on the cluster. First let's define the parameter space using random sampling."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.train.hyperdrive import RandomParameterSampling, BanditPolicy, HyperDriveConfig, PrimaryMetricGoal\n",
        "from azureml.train.hyperdrive import choice, loguniform\n",
        "\n",
        "ps = RandomParameterSampling(\n",
        "    {\n",
        "        '--batch-size': choice(25, 50, 100),\n",
        "        '--first-layer-neurons': choice(10, 50, 200, 300, 500),\n",
        "        '--second-layer-neurons': choice(10, 50, 200, 500),\n",
        "        '--learning-rate': loguniform(-6, -1)\n",
        "    }\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Next, we will create a new estimator without the above parameters since they will be passed in later by Hyperdrive configuration. Note we still need to keep the `data-folder` parameter since that's not a hyperparamter we will sweep."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "est = TensorFlow(source_directory=script_folder,\n",
        "                 script_params={'--data-folder': dataset.as_named_input('mnist').as_mount()},\n",
        "                 compute_target=compute_target,\n",
        "                 entry_script='keras_mnist.py', \n",
        "                 pip_packages=['keras==2.2.5','azureml-dataprep[pandas,fuse]','matplotlib'])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now we will define an early termnination policy. The `BanditPolicy` basically states to check the job every 2 iterations. If the primary metric (defined later) falls outside of the top 10% range, Azure ML terminate the job. This saves us from continuing to explore hyperparameters that don't show promise of helping reach our target metric."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "policy = BanditPolicy(evaluation_interval=2, slack_factor=0.1)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now we are ready to configure a run configuration object, and specify the primary metric `Accuracy` that's recorded in your training runs. If you go back to visit the training script, you will notice that this value is being logged after every epoch (a full batch set). We also want to tell the service that we are looking to maximizing this value. We also set the number of samples to 20, and maximal concurrent job to 4, which is the same as the number of nodes in our computer cluster."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "hdc = HyperDriveConfig(estimator=est, \n",
        "                       hyperparameter_sampling=ps, \n",
        "                       policy=policy, \n",
        "                       primary_metric_name='Accuracy', \n",
        "                       primary_metric_goal=PrimaryMetricGoal.MAXIMIZE, \n",
        "                       max_total_runs=20,\n",
        "                       max_concurrent_runs=4)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Finally, let's launch the hyperparameter tuning job."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "hdr = exp.submit(config=hdc)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can use a run history widget to show the progress. Be patient as this might take a while to complete."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "RunDetails(hdr).show()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "hdr.wait_for_completion(show_output=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Warm start a Hyperparameter Tuning experiment and resuming child runs\n",
        "Often times, finding the best hyperparameter values for your model can be an iterative process, needing multiple tuning runs that learn from previous hyperparameter tuning runs. Reusing knowledge from these previous runs will accelerate the hyperparameter tuning process, thereby reducing the cost of tuning the model and will potentially improve the primary metric of the resulting model. When warm starting a hyperparameter tuning experiment with Bayesian sampling, trials from the previous run will be used as prior knowledge to intelligently pick new samples, so as to improve the primary metric. Additionally, when using Random or Grid sampling, any early termination decisions will leverage metrics from the previous runs to determine poorly performing training runs. \n",
        "\n",
        "Azure Machine Learning allows you to warm start your hyperparameter tuning run by leveraging knowledge from up to 5 previously completed hyperparameter tuning parent runs. \n",
        "\n",
        "Additionally, there might be occasions when individual training runs of a hyperparameter tuning experiment are cancelled due to budget constraints or fail due to other reasons. It is now possible to resume such individual training runs from the last checkpoint (assuming your training script handles checkpoints). Resuming an individual training run will use the same hyperparameter configuration and mount the storage used for that run. The training script should accept the \"--resume-from\" argument, which contains the checkpoint or model files from which to resume the training run. You can also resume individual runs as part of an experiment that spends additional budget on hyperparameter tuning. Any additional budget, after resuming the specified training runs is used for exploring additional configurations.\n",
        "\n",
        "For more information on warm starting and resuming hyperparameter tuning runs, please refer to the [Hyperparameter Tuning for Azure Machine Learning documentation](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-tune-hyperparameters) \n",
        "\n",
        "## Find and register best model\n",
        "When all the jobs finish, we can find out the one that has the highest accuracy."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "best_run = hdr.get_best_run_by_primary_metric()\n",
        "print(best_run.get_details()['runDefinition']['arguments'])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now let's list the model files uploaded during the run."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "print(best_run.get_file_names())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can then register the folder (and all files in it) as a model named `keras-dnn-mnist` under the workspace for deployment."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "model = best_run.register_model(model_name='keras-mlp-mnist', model_path='outputs/model')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Deploy the model in ACI\n",
        "Now we are ready to deploy the model as a web service running in Azure Container Instance [ACI](https://azure.microsoft.com/en-us/services/container-instances/). Azure Machine Learning accomplishes this by constructing a Docker image with the scoring logic and model baked in.\n",
        "### Create score.py\n",
        "First, we will create a scoring script that will be invoked by the web service call. \n",
        "\n",
        "* Note that the scoring script must have two required functions, `init()` and `run(input_data)`. \n",
        "  * In `init()` function, you typically load the model into a global object. This function is executed only once when the Docker container is started. \n",
        "  * In `run(input_data)` function, the model is used to predict a value based on the input data. The input and output to `run` typically use JSON as serialization and de-serialization format but you are not limited to that."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "%%writefile score.py\n",
        "import json\n",
        "import numpy as np\n",
        "import os\n",
        "from keras.models import model_from_json\n",
        "\n",
        "from azureml.core.model import Model\n",
        "\n",
        "def init():\n",
        "    global model\n",
        "    \n",
        "    model_root = Model.get_model_path('keras-mlp-mnist')\n",
        "    # load json and create model\n",
        "    json_file = open(os.path.join(model_root, 'model.json'), 'r')\n",
        "    model_json = json_file.read()\n",
        "    json_file.close()\n",
        "    model = model_from_json(model_json)\n",
        "    # load weights into new model\n",
        "    model.load_weights(os.path.join(model_root, \"model.h5\"))   \n",
        "    model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])\n",
        "    \n",
        "def run(raw_data):\n",
        "    data = np.array(json.loads(raw_data)['data'])\n",
        "    # make prediction\n",
        "    y_hat = np.argmax(model.predict(data), axis=1)\n",
        "    return y_hat.tolist()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Create myenv.yml\n",
        "We also need to create an environment file so that Azure Machine Learning can install the necessary packages in the Docker image which are required by your scoring script. In this case, we need to specify conda packages `tensorflow` and `keras`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "cd = CondaDependencies.create()\n",
        "cd.add_tensorflow_conda_package()\n",
        "cd.add_conda_package('keras==2.2.5')\n",
        "cd.save_to_file(base_directory='./', conda_file_path='myenv.yml')\n",
        "\n",
        "print(cd.serialize_to_string())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Deploy to ACI\n",
        "We are almost ready to deploy. Create the inference configuration and deployment configuration and deploy to ACI. This cell will run for about 7-8 minutes."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.core.webservice import AciWebservice\n",
        "from azureml.core.model import InferenceConfig\n",
        "from azureml.core.model import Model\n",
        "\n",
        "inference_config = InferenceConfig(runtime= \"python\", \n",
        "                                   entry_script=\"score.py\",\n",
        "                                   conda_file=\"myenv.yml\")\n",
        "\n",
        "aciconfig = AciWebservice.deploy_configuration(cpu_cores=1,\n",
        "                                               auth_enabled=True, # this flag generates API keys to secure access\n",
        "                                               memory_gb=1,\n",
        "                                               tags={'name': 'mnist', 'framework': 'Keras'},\n",
        "                                               description='Keras MLP on MNIST')\n",
        "\n",
        "service = Model.deploy(workspace=ws, \n",
        "                           name='keras-mnist-svc', \n",
        "                           models=[model], \n",
        "                           inference_config=inference_config, \n",
        "                           deployment_config=aciconfig)\n",
        "\n",
        "service.wait_for_deployment(True)\n",
        "print(service.state)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "**Tip: If something goes wrong with the deployment, the first thing to look at is the logs from the service by running the following command:** `print(service.get_logs())`"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "This is the scoring web service endpoint:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "print(service.scoring_uri)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Test the deployed model\n",
        "Let's test the deployed model. Pick 30 random samples from the test set, and send it to the web service hosted in ACI. Note here we are using the `run` API in the SDK to invoke the service. You can also make raw HTTP calls using any HTTP tool such as curl.\n",
        "\n",
        "After the invocation, we print the returned predictions and plot them along with the input images. Use red font color and inversed image (white on black) to highlight the misclassified samples. Note since the model accuracy is pretty high, you might have to run the below cell a few times before you can see a misclassified sample."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import json\n",
        "\n",
        "# find 30 random samples from test set\n",
        "n = 30\n",
        "sample_indices = np.random.permutation(X_test.shape[0])[0:n]\n",
        "\n",
        "test_samples = json.dumps({\"data\": X_test[sample_indices].tolist()})\n",
        "test_samples = bytes(test_samples, encoding='utf8')\n",
        "\n",
        "# predict using the deployed model\n",
        "result = service.run(input_data=test_samples)\n",
        "\n",
        "# compare actual value vs. the predicted values:\n",
        "i = 0\n",
        "plt.figure(figsize = (20, 1))\n",
        "\n",
        "for s in sample_indices:\n",
        "    plt.subplot(1, n, i + 1)\n",
        "    plt.axhline('')\n",
        "    plt.axvline('')\n",
        "    \n",
        "    # use different color for misclassified sample\n",
        "    font_color = 'red' if y_test[s] != result[i] else 'black'\n",
        "    clr_map = plt.cm.gray if y_test[s] != result[i] else plt.cm.Greys\n",
        "    \n",
        "    plt.text(x=10, y=-10, s=y_test[s], fontsize=18, color=font_color)\n",
        "    plt.imshow(X_test[s].reshape(28, 28), cmap=clr_map)\n",
        "    \n",
        "    i = i + 1\n",
        "plt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can retreive the API keys used for accessing the HTTP endpoint."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# retreive the API keys. two keys were generated.\n",
        "key1, Key2 = service.get_keys()\n",
        "print(key1)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can now send construct raw HTTP request and send to the service. Don't forget to add key to the HTTP header."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import requests\n",
        "\n",
        "# send a random row from the test set to score\n",
        "random_index = np.random.randint(0, len(X_test)-1)\n",
        "input_data = \"{\\\"data\\\": [\" + str(list(X_test[random_index])) + \"]}\"\n",
        "\n",
        "headers = {'Content-Type':'application/json', 'Authorization': 'Bearer ' + key1}\n",
        "\n",
        "resp = requests.post(service.scoring_uri, input_data, headers=headers)\n",
        "\n",
        "print(\"POST to url\", service.scoring_uri)\n",
        "#print(\"input data:\", input_data)\n",
        "print(\"label:\", y_test[random_index])\n",
        "print(\"prediction:\", resp.text)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Let's look at the workspace after the web service was deployed. You should see \n",
        "* a registered model named 'keras-mlp-mnist' and with the id 'model:1'  \n",
        "* a webservice called 'keras-mnist-svc' with some scoring URL"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "models = ws.models\n",
        "for name, model in models.items():\n",
        "    print(\"Model: {}, ID: {}\".format(name, model.id))\n",
        "    \n",
        "webservices = ws.webservices\n",
        "for name, webservice in webservices.items():\n",
        "    print(\"Webservice: {}, scoring URI: {}\".format(name, webservice.scoring_uri))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Clean up\n",
        "You can delete the ACI deployment with a simple delete API call."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "service.delete()"
      ]
    }
  ],
  "metadata": {
    "authors": [
      {
        "name": "ninhu"
      }
    ],
    "category": "training",
    "compute": [
      "AML Compute"
    ],
    "datasets": [
      "MNIST"
    ],
    "deployment": [
      "Azure Container Instance"
    ],
    "exclude_from_index": false,
    "framework": [
      "TensorFlow"
    ],
    "friendly_name": "Train a DNN using hyperparameter tuning and deploying with Keras",
    "index_order": 1,
    "kernelspec": {
      "display_name": "Python 3.6",
      "language": "python",
      "name": "python36"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.9"
    },
    "tags": [
      "None"
    ],
    "task": "Create a multi-class classifier"
  },
  "nbformat": 4,
  "nbformat_minor": 2
}