MachineLearningNotebooks/how-to-use-azureml/training-with-deep-learning/train-hyperparameter-tune-deploy-with-keras/train-hyperparameter-tune-deploy-with-keras.ipynb

{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Copyright (c) Microsoft Corporation. All rights reserved.\n",
        "\n",
        "Licensed under the MIT License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/training-with-deep-learning/train-hyperparameter-tune-deploy-with-keras/train-hyperparameter-tune-deploy-with-keras.png)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "nbpresent": {
          "id": "bf74d2e9-2708-49b1-934b-e0ede342f475"
        }
      },
      "source": [
        "# Training, hyperparameter tune, and deploy with Keras\n",
        "\n",
        "## Introduction\n",
        "This tutorial shows how to train a simple deep neural network using the MNIST dataset and Keras on Azure Machine Learning. MNIST is a popular dataset consisting of 70,000 grayscale images. Each image is a handwritten digit of `28x28` pixels, representing number from 0 to 9. The goal is to create a multi-class classifier to identify the digit each image represents, and deploy it as a web service in Azure.\n",
        "\n",
        "For more information about the MNIST dataset, please visit [Yan LeCun's website](http://yann.lecun.com/exdb/mnist/).\n",
        "\n",
        "## Prerequisite:\n",
        "* Understand the [architecture and terms](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture) introduced by Azure Machine Learning\n",
        "* If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, go through the [configuration notebook](../../../configuration.ipynb) to:\n",
        "    * install the AML SDK\n",
        "    * create a workspace and its configuration file (`config.json`)\n",
        "* For local scoring test, you will also need to have `tensorflow` and `keras` installed in the current Jupyter kernel."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Let's get started. First let's import some Python libraries."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "nbpresent": {
          "id": "c377ea0c-0cd9-4345-9be2-e20fb29c94c3"
        }
      },
      "outputs": [],
      "source": [
        "%matplotlib inline\n",
        "import numpy as np\n",
        "import os\n",
        "import matplotlib.pyplot as plt"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "nbpresent": {
          "id": "edaa7f2f-2439-4148-b57a-8c794c0945ec"
        }
      },
      "outputs": [],
      "source": [
        "import azureml\n",
        "from azureml.core import Workspace\n",
        "\n",
        "# check core SDK version number\n",
        "print(\"Azure ML SDK Version: \", azureml.core.VERSION)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Initialize workspace\n",
        "Initialize a [Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace) object from the existing workspace you created in the Prerequisites step. `Workspace.from_config()` creates a workspace object from the details stored in `config.json`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "ws = Workspace.from_config()\n",
        "print('Workspace name: ' + ws.name, \n",
        "      'Azure region: ' + ws.location, \n",
        "      'Subscription id: ' + ws.subscription_id, \n",
        "      'Resource group: ' + ws.resource_group, sep='\\n')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "nbpresent": {
          "id": "59f52294-4a25-4c92-bab8-3b07f0f44d15"
        }
      },
      "source": [
        "## Create an Azure ML experiment\n",
        "Let's create an experiment named \"keras-mnist\" and a folder to hold the training scripts. The script runs will be recorded under the experiment in Azure."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "nbpresent": {
          "id": "bc70f780-c240-4779-96f3-bc5ef9a37d59"
        }
      },
      "outputs": [],
      "source": [
        "from azureml.core import Experiment\n",
        "\n",
        "script_folder = './keras-mnist'\n",
        "os.makedirs(script_folder, exist_ok=True)\n",
        "\n",
        "exp = Experiment(workspace=ws, name='keras-mnist')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "nbpresent": {
          "id": "defe921f-8097-44c3-8336-8af6700804a7"
        }
      },
      "source": [
        "## Download MNIST dataset\n",
        "In order to train on the MNIST dataset we will first need to download it from Yan LeCun's web site directly and save them in a `data` folder locally."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import urllib\n",
        "\n",
        "os.makedirs('./data/mnist', exist_ok=True)\n",
        "\n",
        "urllib.request.urlretrieve('http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz', filename='./data/mnist/train-images.gz')\n",
        "urllib.request.urlretrieve('http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz', filename='./data/mnist/train-labels.gz')\n",
        "urllib.request.urlretrieve('http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz', filename='./data/mnist/test-images.gz')\n",
        "urllib.request.urlretrieve('http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz', filename='./data/mnist/test-labels.gz')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "nbpresent": {
          "id": "c3f2f57c-7454-4d3e-b38d-b0946cf066ea"
        }
      },
      "source": [
        "## Show some sample images\n",
        "Let's load the downloaded compressed file into numpy arrays using some utility functions included in the `utils.py` library file from the current folder. Then we use `matplotlib` to plot 30 random images from the dataset along with their labels."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "nbpresent": {
          "id": "396d478b-34aa-4afa-9898-cdce8222a516"
        }
      },
      "outputs": [],
      "source": [
        "from utils import load_data, one_hot_encode\n",
        "\n",
        "# note we also shrink the intensity values (X) from 0-255 to 0-1. This helps the neural network converge faster.\n",
        "X_train = load_data('./data/mnist/train-images.gz', False) / 255.0\n",
        "y_train = load_data('./data/mnist/train-labels.gz', True).reshape(-1)\n",
        "\n",
        "X_test = load_data('./data/mnist/test-images.gz', False) / 255.0\n",
        "y_test = load_data('./data/mnist/test-labels.gz', True).reshape(-1)\n",
        "\n",
        "count = 0\n",
        "sample_size = 30\n",
        "plt.figure(figsize = (16, 6))\n",
        "for i in np.random.permutation(X_train.shape[0])[:sample_size]:\n",
        "    count = count + 1\n",
        "    plt.subplot(1, sample_size, count)\n",
        "    plt.axhline('')\n",
        "    plt.axvline('')\n",
        "    plt.text(x = 10, y = -10, s = y_train[i], fontsize = 18)\n",
        "    plt.imshow(X_train[i].reshape(28, 28), cmap = plt.cm.Greys)\n",
        "plt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Upload MNIST dataset to default datastore \n",
        "A [datastore](https://docs.microsoft.com/azure/machine-learning/service/how-to-access-data) is a place where data can be stored that is then made accessible to a Run either by means of mounting or copying the data to the compute target. A datastore can either be backed by an Azure Blob Storage or and Azure File Share (ADLS will be supported in the future). For simple data handling, each workspace provides a default datastore that can be used, in case the data is not already in Blob Storage or File Share."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "ds = ws.get_default_datastore()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "In this next step, we will upload the training and test set into the workspace's default datastore, which we will then later be mount on an `AmlCompute` cluster for training."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "ds.upload(src_dir='./data/mnist', target_path='mnist', overwrite=True, show_progress=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Create or Attach existing AmlCompute\n",
        "You will need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for training your model. In this tutorial, you create `AmlCompute` as your training compute resource."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "If we could not find the cluster with the given name, then we will create a new cluster here. We will create an `AmlCompute` cluster of `STANDARD_NC6` GPU VMs. This process is broken down into 3 steps:\n",
        "1. create the configuration (this step is local and only takes a second)\n",
        "2. create the cluster (this step will take about **20 seconds**)\n",
        "3. provision the VMs to bring the cluster to the initial size (of 1 in this case). This step will take about **3-5 minutes** and is providing only sparse output in the process. Please make sure to wait until the call returns before moving to the next cell"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.core.compute import ComputeTarget, AmlCompute\n",
        "from azureml.core.compute_target import ComputeTargetException\n",
        "\n",
        "# choose a name for your cluster\n",
        "cluster_name = \"gpu-cluster\"\n",
        "\n",
        "try:\n",
        "    compute_target = ComputeTarget(workspace=ws, name=cluster_name)\n",
        "    print('Found existing compute target')\n",
        "except ComputeTargetException:\n",
        "    print('Creating a new compute target...')\n",
        "    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_NC6', \n",
        "                                                           max_nodes=4)\n",
        "\n",
        "    # create the cluster\n",
        "    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)\n",
        "\n",
        "    # can poll for a minimum number of nodes and for a specific timeout. \n",
        "    # if no min node count is provided it uses the scale settings for the cluster\n",
        "    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)\n",
        "\n",
        "# use get_status() to get a detailed status for the current cluster. \n",
        "print(compute_target.get_status().serialize())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now that you have created the compute target, let's see what the workspace's `compute_targets` property returns. You should now see one entry named \"gpu-cluster\" of type `AmlCompute`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "compute_targets = ws.compute_targets\n",
        "for name, ct in compute_targets.items():\n",
        "    print(name, ct.type, ct.provisioning_state)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Copy the training files into the script folder\n",
        "The Keras training script is already created for you. You can simply copy it into the script folder, together with the utility library used to load compressed data file into numpy array."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import shutil\n",
        "\n",
        "# the training logic is in the keras_mnist.py file.\n",
        "shutil.copy('./keras_mnist.py', script_folder)\n",
        "\n",
        "# the utils.py just helps loading data from the downloaded MNIST dataset into numpy arrays.\n",
        "shutil.copy('./utils.py', script_folder)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "nbpresent": {
          "id": "2039d2d5-aca6-4f25-a12f-df9ae6529cae"
        }
      },
      "source": [
        "## Construct neural network in Keras\n",
        "In the training script `keras_mnist.py`, it creates a very simple DNN (deep neural network), with just 2 hidden layers. The input layer has 28 * 28 = 784 neurons, each representing a pixel in an image. The first hidden layer has 300 neurons, and the second hidden layer has 100 neurons. The output layer has 10 neurons, each representing a targeted label from 0 to 9.\n",
        "\n",
        "![DNN](nn.png)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Azure ML concepts  \n",
        "Please note the following three things in the code below:\n",
        "1. The script accepts arguments using the argparse package. In this case there is one argument `--data_folder` which specifies the file system folder in which the script can find the MNIST data\n",
        "```\n",
        "    parser = argparse.ArgumentParser()\n",
        "    parser.add_argument('--data_folder')\n",
        "```\n",
        "2. The script is accessing the Azure ML `Run` object by executing `run = Run.get_context()`. Further down the script is using the `run` to report the loss and accuracy at the end of each epoch via callback.\n",
        "```\n",
        "    run.log('Loss', log['loss'])\n",
        "    run.log('Accuracy', log['acc'])\n",
        "```\n",
        "3. When running the script on Azure ML, you can write files out to a folder `./outputs` that is relative to the root directory. This folder is specially tracked by Azure ML in the sense that any files written to that folder during script execution on the remote target will be picked up by Run History; these files (known as artifacts) will be available as part of the run history record."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The next cell will print out the training code for you to inspect."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "with open(os.path.join(script_folder, './keras_mnist.py'), 'r') as f:\n",
        "    print(f.read())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Create TensorFlow estimator & add Keras\n",
        "Next, we construct an `azureml.train.dnn.TensorFlow` estimator object, use the `gpu-cluster` as compute target, and pass the mount-point of the datastore to the training code as a parameter.\n",
        "The TensorFlow estimator is providing a simple way of launching a TensorFlow training job on a compute target. It will automatically provide a docker image that has TensorFlow installed. In this case, we add `keras` package (for the Keras framework obviously), and `matplotlib` package for plotting a \"Loss vs. Accuracy\" chart and record it in run history."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.train.dnn import TensorFlow\n",
        "\n",
        "script_params = {\n",
        "    '--data-folder': ds.path('mnist').as_mount(),\n",
        "    '--batch-size': 50,\n",
        "    '--first-layer-neurons': 300,\n",
        "    '--second-layer-neurons': 100,\n",
        "    '--learning-rate': 0.001\n",
        "}\n",
        "\n",
        "est = TensorFlow(source_directory=script_folder,\n",
        "                 script_params=script_params,\n",
        "                 compute_target=compute_target, \n",
        "                 pip_packages=['keras', 'matplotlib'],\n",
        "                 entry_script='keras_mnist.py', \n",
        "                 use_gpu=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "And if you are curious, this is what the mounting point looks like:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "print(ds.path('mnist').as_mount())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Submit job to run\n",
        "Submit the estimator to the Azure ML experiment to kick off the execution."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "run = exp.submit(est)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Monitor the Run\n",
        "As the Run is executed, it will go through the following stages:\n",
        "1. Preparing: A docker image is created matching the Python environment specified by the TensorFlow estimator and it will be uploaded to the workspace's Azure Container Registry. This step will only happen once for each Python environment -- the container will then be cached for subsequent runs. Creating and uploading the image takes about **5 minutes**. While the job is preparing, logs are streamed to the run history and can be viewed to monitor the progress of the image creation.\n",
        "\n",
        "2. Scaling: If the compute needs to be scaled up (i.e. the AmlCompute cluster requires more nodes to execute the run than currently available), the cluster will attempt to scale up in order to make the required amount of nodes available. Scaling typically takes about **5 minutes**.\n",
        "\n",
        "3. Running: All scripts in the script folder are uploaded to the compute target, data stores are mounted/copied and the `entry_script` is executed. While the job is running, stdout and the `./logs` folder are streamed to the run history and can be viewed to monitor the progress of the run.\n",
        "\n",
        "4. Post-Processing: The `./outputs` folder of the run is copied over to the run history\n",
        "\n",
        "There are multiple ways to check the progress of a running job. We can use a Jupyter notebook widget. \n",
        "\n",
        "**Note: The widget will automatically update ever 10-15 seconds, always showing you the most up-to-date information about the run**"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.widgets import RunDetails\n",
        "RunDetails(run).show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can also periodically check the status of the run object, and navigate to Azure portal to monitor the run."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "run"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "run.wait_for_completion(show_output=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "In the outputs of the training script, it prints out the Keras version number. Please make a note of it."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### The Run object\n",
        "The Run object provides the interface to the run history -- both to the job and to the control plane (this notebook), and both while the job is running and after it has completed. It provides a number of interesting features for instance:\n",
        "* `run.get_details()`: Provides a rich set of properties of the run\n",
        "* `run.get_metrics()`: Provides a dictionary with all the metrics that were reported for the Run\n",
        "* `run.get_file_names()`: List all the files that were uploaded to the run history for this Run. This will include the `outputs` and `logs` folder, azureml-logs and other logs, as well as files that were explicitly uploaded to the run using `run.upload_file()`\n",
        "\n",
        "Below are some examples -- please run through them and inspect their output. "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "run.get_details()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "run.get_metrics()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "run.get_file_names()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Download the saved model"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "In the training script, the Keras model is saved into two files, `model.json` and `model.h5`, in the `outputs/models` folder on the gpu-cluster AmlCompute node. Azure ML automatically uploaded anything written in the `./outputs` folder into run history file store. Subsequently, we can use the `run` object to download the model files. They are under the the `outputs/model` folder in the run history file store, and are downloaded into a local folder named `model`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# create a model folder in the current directory\n",
        "os.makedirs('./model', exist_ok=True)\n",
        "\n",
        "for f in run.get_file_names():\n",
        "    if f.startswith('outputs/model'):\n",
        "        output_file_path = os.path.join('./model', f.split('/')[-1])\n",
        "        print('Downloading from {} to {} ...'.format(f, output_file_path))\n",
        "        run.download_file(name=f, output_file_path=output_file_path)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Predict on the test set\n",
        "Let's check the version of the local Keras. Make sure it matches with the version number printed out in the training script. Otherwise you might not be able to load the model properly."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import keras\n",
        "import tensorflow as tf\n",
        "\n",
        "print(\"Keras version:\", keras.__version__)\n",
        "print(\"Tensorflow version:\", tf.__version__)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now let's load the downloaded model."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from keras.models import model_from_json\n",
        "\n",
        "# load json and create model\n",
        "json_file = open('model/model.json', 'r')\n",
        "loaded_model_json = json_file.read()\n",
        "json_file.close()\n",
        "loaded_model = model_from_json(loaded_model_json)\n",
        "# load weights into new model\n",
        "loaded_model.load_weights(\"model/model.h5\")\n",
        "print(\"Model loaded from disk.\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Feed test dataset to the persisted model to get predictions."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# evaluate loaded model on test data\n",
        "loaded_model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])\n",
        "y_test_ohe = one_hot_encode(y_test, 10)\n",
        "y_hat = np.argmax(loaded_model.predict(X_test), axis=1)\n",
        "\n",
        "# print the first 30 labels and predictions\n",
        "print('labels:  \\t', y_test[:30])\n",
        "print('predictions:\\t', y_hat[:30])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Calculate the overall accuracy by comparing the predicted value against the test set."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "print(\"Accuracy on the test set:\", np.average(y_hat == y_test))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Intelligent hyperparameter tuning\n",
        "We have trained the model with one set of hyperparameters, now let's how we can do hyperparameter tuning by launching multiple runs on the cluster. First let's define the parameter space using random sampling."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.train.hyperdrive import RandomParameterSampling, BanditPolicy, HyperDriveConfig, PrimaryMetricGoal\n",
        "from azureml.train.hyperdrive import choice, loguniform\n",
        "\n",
        "ps = RandomParameterSampling(\n",
        "    {\n",
        "        '--batch-size': choice(25, 50, 100),\n",
        "        '--first-layer-neurons': choice(10, 50, 200, 300, 500),\n",
        "        '--second-layer-neurons': choice(10, 50, 200, 500),\n",
        "        '--learning-rate': loguniform(-6, -1)\n",
        "    }\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Next, we will create a new estimator without the above parameters since they will be passed in later by Hyperdrive configuration. Note we still need to keep the `data-folder` parameter since that's not a hyperparamter we will sweep."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "est = TensorFlow(source_directory=script_folder,\n",
        "                 script_params={'--data-folder': ds.path('mnist').as_mount()},\n",
        "                 compute_target=compute_target,\n",
        "                 pip_packages=['keras', 'matplotlib'],\n",
        "                 entry_script='keras_mnist.py', \n",
        "                 use_gpu=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now we will define an early termnination policy. The `BanditPolicy` basically states to check the job every 2 iterations. If the primary metric (defined later) falls outside of the top 10% range, Azure ML terminate the job. This saves us from continuing to explore hyperparameters that don't show promise of helping reach our target metric."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "policy = BanditPolicy(evaluation_interval=2, slack_factor=0.1)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now we are ready to configure a run configuration object, and specify the primary metric `Accuracy` that's recorded in your training runs. If you go back to visit the training script, you will notice that this value is being logged after every epoch (a full batch set). We also want to tell the service that we are looking to maximizing this value. We also set the number of samples to 20, and maximal concurrent job to 4, which is the same as the number of nodes in our computer cluster."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "hdc = HyperDriveConfig(estimator=est, \n",
        "                       hyperparameter_sampling=ps, \n",
        "                       policy=policy, \n",
        "                       primary_metric_name='Accuracy', \n",
        "                       primary_metric_goal=PrimaryMetricGoal.MAXIMIZE, \n",
        "                       max_total_runs=20,\n",
        "                       max_concurrent_runs=4)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Finally, let's launch the hyperparameter tuning job."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "hdr = exp.submit(config=hdc)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can use a run history widget to show the progress. Be patient as this might take a while to complete."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "RunDetails(hdr).show()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "hdr.wait_for_completion(show_output=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Find and register best model\n",
        "When all the jobs finish, we can find out the one that has the highest accuracy."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "best_run = hdr.get_best_run_by_primary_metric()\n",
        "print(best_run.get_details()['runDefinition']['arguments'])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now let's list the model files uploaded during the run."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "print(best_run.get_file_names())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can then register the folder (and all files in it) as a model named `keras-dnn-mnist` under the workspace for deployment."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "model = best_run.register_model(model_name='keras-mlp-mnist', model_path='outputs/model')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Deploy the model in ACI\n",
        "Now we are ready to deploy the model as a web service running in Azure Container Instance [ACI](https://azure.microsoft.com/en-us/services/container-instances/). Azure Machine Learning accomplishes this by constructing a Docker image with the scoring logic and model baked in.\n",
        "### Create score.py\n",
        "First, we will create a scoring script that will be invoked by the web service call. \n",
        "\n",
        "* Note that the scoring script must have two required functions, `init()` and `run(input_data)`. \n",
        "  * In `init()` function, you typically load the model into a global object. This function is executed only once when the Docker container is started. \n",
        "  * In `run(input_data)` function, the model is used to predict a value based on the input data. The input and output to `run` typically use JSON as serialization and de-serialization format but you are not limited to that."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "%%writefile score.py\n",
        "import json\n",
        "import numpy as np\n",
        "import os\n",
        "from keras.models import model_from_json\n",
        "\n",
        "from azureml.core.model import Model\n",
        "\n",
        "def init():\n",
        "    global model\n",
        "    \n",
        "    model_root = Model.get_model_path('keras-mlp-mnist')\n",
        "    # load json and create model\n",
        "    json_file = open(os.path.join(model_root, 'model.json'), 'r')\n",
        "    model_json = json_file.read()\n",
        "    json_file.close()\n",
        "    model = model_from_json(model_json)\n",
        "    # load weights into new model\n",
        "    model.load_weights(os.path.join(model_root, \"model.h5\"))   \n",
        "    model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])\n",
        "    \n",
        "def run(raw_data):\n",
        "    data = np.array(json.loads(raw_data)['data'])\n",
        "    # make prediction\n",
        "    y_hat = np.argmax(model.predict(data), axis=1)\n",
        "    return y_hat.tolist()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Create myenv.yml\n",
        "We also need to create an environment file so that Azure Machine Learning can install the necessary packages in the Docker image which are required by your scoring script. In this case, we need to specify conda packages `tensorflow` and `keras`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.core.runconfig import CondaDependencies\n",
        "\n",
        "cd = CondaDependencies.create()\n",
        "cd.add_conda_package('tensorflow')\n",
        "cd.add_conda_package('keras')\n",
        "cd.save_to_file(base_directory='./', conda_file_path='myenv.yml')\n",
        "\n",
        "print(cd.serialize_to_string())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Deploy to ACI\n",
        "We are almost ready to deploy. Create a deployment configuration and specify the number of CPUs and gigbyte of RAM needed for your ACI container. "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.core.webservice import AciWebservice\n",
        "\n",
        "aciconfig = AciWebservice.deploy_configuration(cpu_cores=1, \n",
        "                                               auth_enabled=True, # this flag generates API keys to secure access\n",
        "                                               memory_gb=1, \n",
        "                                               tags={'name':'mnist', 'framework': 'Keras'},\n",
        "                                               description='Keras MLP on MNIST')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "#### Deployment Process\n",
        "Now we can deploy. **This cell will run for about 7-8 minutes**. Behind the scene, it will do the following:\n",
        "1. **Build Docker image**  \n",
        "Build a Docker image using the scoring file (`score.py`), the environment file (`myenv.yml`), and the `model` object. \n",
        "2. **Register image**    \n",
        "Register that image under the workspace. \n",
        "3. **Ship to ACI**    \n",
        "And finally ship the image to the ACI infrastructure, start up a container in ACI using that image, and expose an HTTP endpoint to accept REST client calls."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.core.image import ContainerImage\n",
        "\n",
        "imgconfig = ContainerImage.image_configuration(execution_script=\"score.py\", \n",
        "                                               runtime=\"python\", \n",
        "                                               conda_file=\"myenv.yml\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "%%time\n",
        "from azureml.core.webservice import Webservice\n",
        "\n",
        "service = Webservice.deploy_from_model(workspace=ws,\n",
        "                                       name='keras-mnist-svc',\n",
        "                                       deployment_config=aciconfig,\n",
        "                                       models=[model],\n",
        "                                       image_config=imgconfig)\n",
        "\n",
        "service.wait_for_deployment(show_output=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "**Tip: If something goes wrong with the deployment, the first thing to look at is the logs from the service by running the following command:**"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "print(service.get_logs())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "This is the scoring web service endpoint:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "print(service.scoring_uri)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Test the deployed model\n",
        "Let's test the deployed model. Pick 30 random samples from the test set, and send it to the web service hosted in ACI. Note here we are using the `run` API in the SDK to invoke the service. You can also make raw HTTP calls using any HTTP tool such as curl.\n",
        "\n",
        "After the invocation, we print the returned predictions and plot them along with the input images. Use red font color and inversed image (white on black) to highlight the misclassified samples. Note since the model accuracy is pretty high, you might have to run the below cell a few times before you can see a misclassified sample."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import json\n",
        "\n",
        "# find 30 random samples from test set\n",
        "n = 30\n",
        "sample_indices = np.random.permutation(X_test.shape[0])[0:n]\n",
        "\n",
        "test_samples = json.dumps({\"data\": X_test[sample_indices].tolist()})\n",
        "test_samples = bytes(test_samples, encoding='utf8')\n",
        "\n",
        "# predict using the deployed model\n",
        "result = service.run(input_data=test_samples)\n",
        "\n",
        "# compare actual value vs. the predicted values:\n",
        "i = 0\n",
        "plt.figure(figsize = (20, 1))\n",
        "\n",
        "for s in sample_indices:\n",
        "    plt.subplot(1, n, i + 1)\n",
        "    plt.axhline('')\n",
        "    plt.axvline('')\n",
        "    \n",
        "    # use different color for misclassified sample\n",
        "    font_color = 'red' if y_test[s] != result[i] else 'black'\n",
        "    clr_map = plt.cm.gray if y_test[s] != result[i] else plt.cm.Greys\n",
        "    \n",
        "    plt.text(x=10, y=-10, s=y_hat[s], fontsize=18, color=font_color)\n",
        "    plt.imshow(X_test[s].reshape(28, 28), cmap=clr_map)\n",
        "    \n",
        "    i = i + 1\n",
        "plt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can retreive the API keys used for accessing the HTTP endpoint."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# retreive the API keys. two keys were generated.\n",
        "key1, Key2 = service.get_keys()\n",
        "print(key1)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can now send construct raw HTTP request and send to the service. Don't forget to add key to the HTTP header."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import requests\n",
        "\n",
        "# send a random row from the test set to score\n",
        "random_index = np.random.randint(0, len(X_test)-1)\n",
        "input_data = \"{\\\"data\\\": [\" + str(list(X_test[random_index])) + \"]}\"\n",
        "\n",
        "headers = {'Content-Type':'application/json', 'Authorization': 'Bearer ' + key1}\n",
        "\n",
        "resp = requests.post(service.scoring_uri, input_data, headers=headers)\n",
        "\n",
        "print(\"POST to url\", service.scoring_uri)\n",
        "#print(\"input data:\", input_data)\n",
        "print(\"label:\", y_test[random_index])\n",
        "print(\"prediction:\", resp.text)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Let's look at the workspace after the web service was deployed. You should see \n",
        "* a registered model named 'keras-mlp-mnist' and with the id 'model:1'\n",
        "* an image called 'keras-mnist-svc' and with a docker image location pointing to your workspace's Azure Container Registry (ACR)  \n",
        "* a webservice called 'keras-mnist-svc' with some scoring URL"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "models = ws.models\n",
        "for name, model in models.items():\n",
        "    print(\"Model: {}, ID: {}\".format(name, model.id))\n",
        "    \n",
        "images = ws.images\n",
        "for name, image in images.items():\n",
        "    print(\"Image: {}, location: {}\".format(name, image.image_location))\n",
        "    \n",
        "webservices = ws.webservices\n",
        "for name, webservice in webservices.items():\n",
        "    print(\"Webservice: {}, scoring URI: {}\".format(name, webservice.scoring_uri))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Clean up\n",
        "You can delete the ACI deployment with a simple delete API call."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "service.delete()"
      ]
    }
  ],
  "metadata": {
    "authors": [
      {
        "name": "ninhu"
      }
    ],
    "kernelspec": {
      "display_name": "Python 3.6",
      "language": "python",
      "name": "python36"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.7"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 2
}