MachineLearningNotebooks/how-to-use-azureml/machine-learning-pipelines/pipeline-batch-scoring/pipeline-batch-scoring.ipynb

{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Copyright (c) Microsoft Corporation. All rights reserved.  \n",
        "Licensed under the MIT License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Using Azure Machine Learning Pipelines for batch prediction\n",
        "\n",
        "In this notebook we will demonstrate how to run a batch scoring job using Azure Machine Learning pipelines. Our example job will be to take an already-trained image classification model, and run that model on some unlabeled images. The image classification model that we'll use is the __[Inception-V3 model](https://arxiv.org/abs/1512.00567)__  and we'll run this model on unlabeled images from the __[ImageNet](http://image-net.org/)__ dataset. \n",
        "\n",
        "The outline of this notebook is as follows:\n",
        "\n",
        "- Register the pretrained inception model into the model registry. \n",
        "- Store the dataset images in a blob container.\n",
        "- Use the registered model to do batch scoring on the images in the data blob container."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Prerequisites\n",
        "Make sure you go through the configuration Notebook located at https://github.com/Azure/MachineLearningNotebooks first if you haven't. This sets you up with a working config file that has information on your workspace, subscription id, etc. "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.core import Datastore\n",
        "from azureml.core import Experiment\n",
        "from azureml.core.compute import AmlCompute, ComputeTarget\n",
        "from azureml.core.conda_dependencies import CondaDependencies\n",
        "from azureml.core.datastore import Datastore\n",
        "from azureml.core.runconfig import CondaDependencies, RunConfiguration\n",
        "from azureml.data.data_reference import DataReference\n",
        "from azureml.pipeline.core import Pipeline, PipelineData\n",
        "from azureml.pipeline.steps import PythonScriptStep"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import os\n",
        "from azureml.core import Workspace, Run, Experiment\n",
        "\n",
        "ws = Workspace.from_config()\n",
        "print('Workspace name: ' + ws.name, \n",
        "      'Azure region: ' + ws.location, \n",
        "      'Subscription id: ' + ws.subscription_id, \n",
        "      'Resource group: ' + ws.resource_group, sep = '\\n')\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Set up machine learning resources"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Set up datastores\n",
        "First, let\u00e2\u20ac\u2122s access the datastore that has the model, labels, and images. \n",
        "\n",
        "### Create a datastore that points to a blob container containing sample images\n",
        "\n",
        "We have created a public blob container `sampledata` on an account named `pipelinedata`, containing images from the ImageNet evaluation set. In the next step, we create a datastore with the name `images_datastore`, which points to this container. In the call to `register_azure_blob_container` below, setting the `overwrite` flag to `True` overwrites any datastore that was created previously with that name. \n",
        "\n",
        "This step can be changed to point to your blob container by providing your own `datastore_name`, `container_name`, and `account_name`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "account_name = \"pipelinedata\"\n",
        "datastore_name=\"images_datastore\"\n",
        "container_name=\"sampledata\"\n",
        "\n",
        "batchscore_blob = Datastore.register_azure_blob_container(ws, \n",
        "                      datastore_name=datastore_name, \n",
        "                      container_name= container_name, \n",
        "                      account_name=account_name, \n",
        "                      overwrite=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Next, let\u00e2\u20ac\u2122s specify the default datastore for the outputs."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "def_data_store = ws.get_default_datastore()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Configure data references\n",
        "Now you need to add references to the data, as inputs to the appropriate pipeline steps in your pipeline. A data source in a pipeline is represented by a DataReference object. The DataReference object points to data that lives in, or is accessible from, a datastore. We need DataReference objects corresponding to the following: the directory containing the input images, the directory in which the pretrained model is stored, the directory containing the labels, and the output directory."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "input_images = DataReference(datastore=batchscore_blob, \n",
        "                             data_reference_name=\"input_images\",\n",
        "                             path_on_datastore=\"batchscoring/images\",\n",
        "                             mode=\"download\"\n",
        "                            )\n",
        "model_dir = DataReference(datastore=batchscore_blob, \n",
        "                          data_reference_name=\"input_model\",\n",
        "                          path_on_datastore=\"batchscoring/models\",\n",
        "                          mode=\"download\"                          \n",
        "                         )\n",
        "label_dir = DataReference(datastore=batchscore_blob, \n",
        "                          data_reference_name=\"input_labels\",\n",
        "                          path_on_datastore=\"batchscoring/labels\",\n",
        "                          mode=\"download\"                          \n",
        "                         )\n",
        "output_dir = PipelineData(name=\"scores\", \n",
        "                          datastore=def_data_store, \n",
        "                          output_path_on_compute=\"batchscoring/results\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Create and attach Compute targets\n",
        "Use the below code to create and attach Compute targets. "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import os\n",
        "\n",
        "# choose a name for your cluster\n",
        "aml_compute_name = os.environ.get(\"AML_COMPUTE_NAME\", \"gpu-cluster\")\n",
        "cluster_min_nodes = os.environ.get(\"AML_COMPUTE_MIN_NODES\", 0)\n",
        "cluster_max_nodes = os.environ.get(\"AML_COMPUTE_MAX_NODES\", 1)\n",
        "vm_size = os.environ.get(\"AML_COMPUTE_SKU\", \"STANDARD_NC6\")\n",
        "\n",
        "\n",
        "if aml_compute_name in ws.compute_targets:\n",
        "    compute_target = ws.compute_targets[aml_compute_name]\n",
        "    if compute_target and type(compute_target) is AmlCompute:\n",
        "        print('found compute target. just use it. ' + aml_compute_name)\n",
        "else:\n",
        "    print('creating a new compute target...')\n",
        "    provisioning_config = AmlCompute.provisioning_configuration(vm_size = vm_size, # NC6 is GPU-enabled\n",
        "                                                                vm_priority = 'lowpriority', # optional\n",
        "                                                                min_nodes = cluster_min_nodes, \n",
        "                                                                max_nodes = cluster_max_nodes)\n",
        "\n",
        "    # create the cluster\n",
        "    compute_target = ComputeTarget.create(ws, aml_compute_name, provisioning_config)\n",
        "    \n",
        "    # can poll for a minimum number of nodes and for a specific timeout. \n",
        "    # if no min node count is provided it will use the scale settings for the cluster\n",
        "    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)\n",
        "    \n",
        "     # For a more detailed view of current Azure Machine Learning Compute  status, use get_status()\n",
        "    print(compute_target.get_status().serialize())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Prepare the Model"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Download the Model\n",
        "\n",
        "Download and extract the model from http://download.tensorflow.org/models/inception_v3_2016_08_28.tar.gz to `\"models\"`"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# create directory for model\n",
        "model_dir = 'models'\n",
        "if not os.path.isdir(model_dir):\n",
        "    os.mkdir(model_dir)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import tarfile\n",
        "import urllib.request\n",
        "\n",
        "url=\"http://download.tensorflow.org/models/inception_v3_2016_08_28.tar.gz\"\n",
        "response = urllib.request.urlretrieve(url, \"model.tar.gz\")\n",
        "tar = tarfile.open(\"model.tar.gz\", \"r:gz\")\n",
        "tar.extractall(model_dir)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Register the model with Workspace"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import shutil\n",
        "from azureml.core.model import Model\n",
        "\n",
        "# register downloaded model \n",
        "model = Model.register(model_path = \"models/inception_v3.ckpt\",\n",
        "                       model_name = \"inception\", # this is the name the model is registered as\n",
        "                       tags = {'pretrained': \"inception\"},\n",
        "                       description = \"Imagenet trained tensorflow inception\",\n",
        "                       workspace = ws)\n",
        "# remove the downloaded dir after registration if you wish\n",
        "shutil.rmtree(\"models\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Write your scoring script"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "To do the scoring, we use a batch scoring script `batch_scoring.py`, which is located in the same directory that this notebook is in. You can take a look at this script to see how you might modify it for your custom batch scoring task.\n",
        "\n",
        "The python script `batch_scoring.py` takes input images, applies the image classification model to these images, and outputs a classification result to a results file.\n",
        "\n",
        "The script `batch_scoring.py` takes the following parameters:\n",
        "\n",
        "- `--model_name`: the name of the model being used, which is expected to be in the `model_dir` directory\n",
        "- `--label_dir` : the directory holding the `labels.txt` file \n",
        "- `--dataset_path`: the directory containing the input images\n",
        "- `--output_dir` : the script will run the model on the data and output a `results-label.txt` to this directory\n",
        "- `--batch_size` : the batch size used in running the model.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Build and run the batch scoring pipeline\n",
        "You have everything you need to build the pipeline. Let\u00e2\u20ac\u2122s put all these together."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "###  Specify the environment to run the script\n",
        "Specify the conda dependencies for your script. You will need this object when you create the pipeline step later on."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.core.runconfig import DEFAULT_GPU_IMAGE\n",
        "\n",
        "cd = CondaDependencies.create(pip_packages=[\"tensorflow-gpu==1.10.0\", \"azureml-defaults\"])\n",
        "\n",
        "# Runconfig\n",
        "amlcompute_run_config = RunConfiguration(conda_dependencies=cd)\n",
        "amlcompute_run_config.environment.docker.enabled = True\n",
        "amlcompute_run_config.environment.docker.gpu_support = True\n",
        "amlcompute_run_config.environment.docker.base_image = DEFAULT_GPU_IMAGE\n",
        "amlcompute_run_config.environment.spark.precache_packages = False"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Specify the parameters for your pipeline\n",
        "A subset of the parameters to the python script can be given as input when we re-run a `PublishedPipeline`. In the current example, we define `batch_size` taken by the script as such parameter."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.pipeline.core.graph import PipelineParameter\n",
        "batch_size_param = PipelineParameter(name=\"param_batch_size\", default_value=20)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Create the pipeline step\n",
        "Create the pipeline step using the script, environment configuration, and parameters. Specify the compute target you already attached to your workspace as the target of execution of the script. We will use PythonScriptStep to create the pipeline step."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "inception_model_name = \"inception_v3.ckpt\"\n",
        "\n",
        "batch_score_step = PythonScriptStep(\n",
        "    name=\"batch_scoring\",\n",
        "    script_name=\"batch_scoring.py\",\n",
        "    arguments=[\"--dataset_path\", input_images, \n",
        "               \"--model_name\", \"inception\",\n",
        "               \"--label_dir\", label_dir, \n",
        "               \"--output_dir\", output_dir, \n",
        "               \"--batch_size\", batch_size_param],\n",
        "    compute_target=compute_target,\n",
        "    inputs=[input_images, label_dir],\n",
        "    outputs=[output_dir],\n",
        "    runconfig=amlcompute_run_config\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Run the pipeline\n",
        "At this point you can run the pipeline and examine the output it produced. "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "pipeline = Pipeline(workspace=ws, steps=[batch_score_step])\n",
        "pipeline_run = Experiment(ws, 'batch_scoring').submit(pipeline, pipeline_params={\"param_batch_size\": 20})"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Monitor the run"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.widgets import RunDetails\n",
        "RunDetails(pipeline_run).show()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "pipeline_run.wait_for_completion(show_output=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Download and review output"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "step_run = list(pipeline_run.get_children())[0]\n",
        "step_run.download_file(\"./outputs/result-labels.txt\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import pandas as pd\n",
        "df = pd.read_csv(\"result-labels.txt\", delimiter=\":\", header=None)\n",
        "df.columns = [\"Filename\", \"Prediction\"]\n",
        "df.head()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Publish a pipeline and rerun using a REST call"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Create a published pipeline\n",
        "Once you are satisfied with the outcome of the run, you can publish the pipeline to run it with different input values later. When you publish a pipeline, you will get a REST endpoint that accepts invoking of the pipeline with the set of parameters you have already incorporated above using PipelineParameter."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "published_pipeline = pipeline_run.publish_pipeline(\n",
        "    name=\"Inception_v3_scoring\", description=\"Batch scoring using Inception v3 model\", version=\"1.0\")\n",
        "\n",
        "published_id = published_pipeline.id"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Rerun the pipeline using the REST endpoint"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Get AAD token"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.core.authentication import AzureCliAuthentication\n",
        "import requests\n",
        "\n",
        "cli_auth = AzureCliAuthentication()\n",
        "aad_token = cli_auth.get_authentication_header()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Run published pipeline"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.pipeline.core import PublishedPipeline\n",
        "\n",
        "rest_endpoint = published_pipeline.endpoint\n",
        "# specify batch size when running the pipeline\n",
        "response = requests.post(rest_endpoint, \n",
        "                         headers=aad_token, \n",
        "                         json={\"ExperimentName\": \"batch_scoring\",\n",
        "                               \"ParameterAssignments\": {\"param_batch_size\": 50}})\n",
        "run_id = response.json()[\"Id\"]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Monitor the new run"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.pipeline.core.run import PipelineRun\n",
        "published_pipeline_run = PipelineRun(ws.experiments[\"batch_scoring\"], run_id)\n",
        "\n",
        "RunDetails(published_pipeline_run).show()"
      ]
    }
  ],
  "metadata": {
    "authors": [
      {
        "name": "hichando"
      }
    ],
    "kernelspec": {
      "display_name": "Python 3.6",
      "language": "python",
      "name": "python36"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.7"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 2
}