Files
MachineLearningNotebooks/tutorials/machine-learning-pipelines-advanced/tutorial-pipeline-batch-scoring-classification.ipynb

674 lines
29 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Copyright (c) Microsoft Corporation. All rights reserved. \n",
"Licensed under the MIT License."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/tutorials/machine-learning-pipelines-advanced/tutorial-pipeline-batch-scoring-classification.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Use Azure Machine Learning Pipelines for batch prediction\n",
"In this tutorial, you use Azure Machine Learning service pipelines to run a batch scoring image classification job. The example job uses the pre-trained [Inception-V3](https://arxiv.org/abs/1512.00567) CNN (convolutional neural network) Tensorflow model to classify unlabeled images. Machine learning pipelines optimize your workflow with speed, portability, and reuse so you can focus on your expertise, machine learning, rather than on infrastructure and automation. After building and publishing a pipeline, you can configure a REST endpoint to enable triggering the pipeline from any HTTP library on any platform.\n",
"\n",
"In this tutorial, you learn the following tasks:\n",
"\n",
"> * Configure workspace and download sample data\n",
"> * Create data objects to fetch and output data\n",
"> * Download, prepare, and register the model to your workspace\n",
"> * Provision compute targets and create a scoring script\n",
"> * Use ParallelRunStep to do batch scoring\n",
"> * Build, run, and publish a pipeline\n",
"> * Enable a REST endpoint for the pipeline\n",
"\n",
"If you don't have an Azure subscription, create a free account before you begin. Try the [free or paid version of Azure Machine Learning service](https://aka.ms/AMLFree) today."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Prerequisites\n",
"\n",
"* Complete the [setup tutorial](https://docs.microsoft.com/azure/machine-learning/service/tutorial-1st-experiment-sdk-setup) if you don't already have an Azure Machine Learning service workspace or notebook virtual machine.\n",
"* After you complete the setup tutorial, open the **tutorials/tutorial-pipeline-batch-scoring-classification.ipynb** notebook using the same notebook server.\n",
"\n",
"This tutorial is also available on [GitHub](https://github.com/Azure/MachineLearningNotebooks/tree/master/tutorials) if you wish to run it in your own [local environment](how-to-configure-environment.md#local). Run `pip install azureml-sdk[notebooks] azureml-pipeline-core azureml-pipeline-steps pandas requests` to get the required packages."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Configure workspace and create datastore"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create a workspace object from the existing workspace. A [Workspace](https://docs.microsoft.com/python/api/azureml-core/azureml.core.workspace.workspace?view=azure-ml-py) is a class that accepts your Azure subscription and resource information. It also creates a cloud resource to monitor and track your model runs. `Workspace.from_config()` reads the file **config.json** and loads the authentication details into an object named `ws`. `ws` is used throughout the rest of the code in this tutorial."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# Check core SDK version number\n",
"import azureml.core\n",
"\n",
"print(\"SDK version:\", azureml.core.VERSION)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from azureml.core import Workspace\n",
"\n",
"ws = Workspace.from_config()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create a datastore for sample images\n",
"\n",
"Get the ImageNet evaluation public data sample from the public blob container `sampledata` on the account `pipelinedata`. Calling `register_azure_blob_container()` makes the data available to the workspace under the name `images_datastore`. Then specify the workspace default datastore as the output datastore, which you use for scoring output in the pipeline."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.datastore import Datastore\n",
"\n",
"batchscore_blob = Datastore.register_azure_blob_container(ws, \n",
" datastore_name=\"images_datastore\", \n",
" container_name=\"sampledata\", \n",
" account_name=\"pipelinedata\", \n",
" overwrite=True)\n",
"\n",
"def_data_store = ws.get_default_datastore()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create data objects\n",
"\n",
"When building pipelines, `Dataset` objects are used for reading data from workspace datastores, and `PipelineData` objects are used for transferring intermediate data between pipeline steps.\n",
"\n",
"This batch scoring example only uses one pipeline step, but in use-cases with multiple steps, the typical flow will include:\n",
"\n",
"1. Using `Dataset` objects as **inputs** to fetch raw data, performing some transformations, then **outputting** a `PipelineData` object.\n",
"1. Use the previous step's `PipelineData` **output object** as an *input object*, repeated for subsequent steps.\n",
"\n",
"For this scenario you create `Dataset` objects corresponding to the datastore directories for both the input images and the classification labels (y-test values). You also create a `PipelineData` object for the batch scoring output data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.dataset import Dataset\n",
"from azureml.pipeline.core import PipelineData\n",
"\n",
"input_images = Dataset.File.from_files((batchscore_blob, \"batchscoring/images/\"))\n",
"label_ds = Dataset.File.from_files((batchscore_blob, \"batchscoring/labels/\"))\n",
"output_dir = PipelineData(name=\"scores\", datastore=def_data_store)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we need to register the datasets with the workspace."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"input_images = input_images.register(workspace=ws, name=\"input_images\")\n",
"label_ds = label_ds.register(workspace=ws, name=\"label_ds\", create_new_version=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Download and register the model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Download the pre-trained Tensorflow model to use it for batch scoring in the pipeline. First create a local directory where you store the model, then download and extract it."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import tarfile\n",
"import urllib.request\n",
"\n",
"if not os.path.isdir(\"models\"):\n",
" os.mkdir(\"models\")\n",
" \n",
"response = urllib.request.urlretrieve(\"http://download.tensorflow.org/models/inception_v3_2016_08_28.tar.gz\", \"model.tar.gz\")\n",
"tar = tarfile.open(\"model.tar.gz\", \"r:gz\")\n",
"tar.extractall(\"models\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now you register the model to your workspace, which allows you to easily retrieve it in the pipeline process. In the `register()` static function, the `model_name` parameter is the key you use to locate your model throughout the SDK."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import shutil\n",
"from azureml.core.model import Model\n",
"\n",
"# register downloaded model \n",
"model = Model.register(model_path=\"models/inception_v3.ckpt\",\n",
" model_name=\"inception\",\n",
" tags={\"pretrained\": \"inception\"},\n",
" description=\"Imagenet trained tensorflow inception\",\n",
" workspace=ws)\n",
"# remove the downloaded dir after registration if you wish\n",
"shutil.rmtree(\"models\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create and attach remote compute target\n",
"\n",
"Azure Machine Learning service pipelines cannot be run locally, and only run on cloud resources. Remote compute targets are reusable virtual compute environments where you run experiments and work-flows. Run the following code to create a GPU-enabled [`AmlCompute`](https://docs.microsoft.com/python/api/azureml-core/azureml.core.compute.amlcompute.amlcompute?view=azure-ml-py) target, and attach it to your workspace. See the [conceptual article](https://docs.microsoft.com/azure/machine-learning/service/concept-compute-target) for more information on compute targets."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.compute import AmlCompute, ComputeTarget\n",
"from azureml.exceptions import ComputeTargetException\n",
"compute_name = \"gpu-cluster\"\n",
"\n",
"# checks to see if compute target already exists in workspace, else create it\n",
"try:\n",
" compute_target = ComputeTarget(workspace=ws, name=compute_name)\n",
"except ComputeTargetException:\n",
" config = AmlCompute.provisioning_configuration(vm_size=\"STANDARD_NC6\",\n",
" vm_priority=\"lowpriority\", \n",
" min_nodes=0, \n",
" max_nodes=1)\n",
"\n",
" compute_target = ComputeTarget.create(workspace=ws, name=compute_name, provisioning_configuration=config)\n",
" compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Write a scoring script"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To do the scoring, you create a batch scoring script `batch_scoring.py`, and write it to the current directory. The script takes a minibatch of input images, applies the classification model, and outputs the predictions to a results file.\n",
"\n",
"The script `batch_scoring.py` takes the following parameters, which get passed from the `ParallelRunStep` that you create later:\n",
"\n",
"- `--model_name`: the name of the model being used\n",
"- `--labels_dir` : the directory path having the `labels.txt` file \n",
"\n",
"The pipelines infrastructure uses the `ArgumentParser` class to pass parameters into pipeline steps. For example, in the code below the first argument `--model_name` is given the property identifier `model_name`. In the `main()` function, this property is accessed using `Model.get_model_path(args.model_name)`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The pipeline in this tutorial only has one step and writes the output to a file, but for multi-step pipelines, you also use `ArgumentParser` to define a directory to write output data for input to subsequent steps. See the [notebook](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/machine-learning-pipelines/nyc-taxi-data-regression-model-building/nyc-taxi-data-regression-model-building.ipynb) for an example of passing data between multiple pipeline steps using the `ArgumentParser` design pattern."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Build and run the pipeline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Before running the pipeline, you create an object that defines the python environment and dependencies needed by your script `batch_scoring.py`. The main dependency required is Tensorflow, but you also install `azureml-core` and `azureml-dataset-runtime[fuse]` for background processes from the SDK. Create a `RunConfiguration` object using the dependencies, and also specify Docker and Docker-GPU support."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core import Environment\n",
"from azureml.core.conda_dependencies import CondaDependencies\n",
"from azureml.core.runconfig import DEFAULT_GPU_IMAGE\n",
"\n",
"cd = CondaDependencies.create(pip_packages=[\"tensorflow-gpu==1.15.2\",\n",
" \"azureml-core\", \"azureml-dataset-runtime[fuse]\"])\n",
"\n",
"env = Environment(name=\"parallelenv\")\n",
"env.python.conda_dependencies=cd\n",
"env.docker.base_image = DEFAULT_GPU_IMAGE"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create the configuration to wrap the inference script\n",
"Create the pipeline step using the script, environment configuration, and parameters. Specify the compute target you already attached to your workspace as the target of execution of the script. We will use PythonScriptStep to create the pipeline step."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.pipeline.steps import ParallelRunConfig\n",
"\n",
"parallel_run_config = ParallelRunConfig(\n",
" environment=env,\n",
" entry_script=\"batch_scoring.py\",\n",
" source_directory=\"scripts\",\n",
" output_action=\"append_row\",\n",
" append_row_file_name=\"parallel_run_step.txt\",\n",
" mini_batch_size=\"20\",\n",
" error_threshold=1,\n",
" compute_target=compute_target,\n",
" process_count_per_node=2,\n",
" node_count=1\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create the pipeline step\n",
"\n",
"A pipeline step is an object that encapsulates everything you need for running a pipeline including:\n",
"\n",
"* environment and dependency settings\n",
"* the compute resource to run the pipeline on\n",
"* input and output data, and any custom parameters\n",
"* reference to a script or SDK-logic to run during the step\n",
"\n",
"There are multiple classes that inherit from the parent class [`PipelineStep`](https://docs.microsoft.com/python/api/azureml-pipeline-steps/azureml.pipeline.steps.parallelrunstep?view=azure-ml-py) to assist with building a step using certain frameworks and stacks. In this example, you use the [`ParallelRunStep`](https://docs.microsoft.com/en-us/python/api/azureml-contrib-pipeline-steps/azureml.contrib.pipeline.steps.parallelrunstep?view=azure-ml-py) class to define your step logic using a scoring script. \n",
"\n",
"An object reference in the `outputs` array becomes available as an **input** for a subsequent pipeline step, for scenarios where there is more than one step."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.pipeline.steps import ParallelRunStep\n",
"from datetime import datetime\n",
"\n",
"parallel_step_name = \"batchscoring-\" + datetime.now().strftime(\"%Y%m%d%H%M\")\n",
"\n",
"label_config = label_ds.as_named_input(\"labels_input\")\n",
"\n",
"batch_score_step = ParallelRunStep(\n",
" name=parallel_step_name,\n",
" inputs=[input_images.as_named_input(\"input_images\")],\n",
" output=output_dir,\n",
" arguments=[\"--model_name\", \"inception\",\n",
" \"--labels_dir\", label_config],\n",
" side_inputs=[label_config],\n",
" parallel_run_config=parallel_run_config,\n",
" allow_reuse=False\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For a list of all classes for different step types, see the [steps package](https://docs.microsoft.com/python/api/azureml-pipeline-steps/azureml.pipeline.steps?view=azure-ml-py)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Run the pipeline\n",
"\n",
"Now you run the pipeline. First create a `Pipeline` object with your workspace reference and the pipeline step you created. The `steps` parameter is an array of steps, and in this case there is only one step for batch scoring. To build pipelines with multiple steps, you place the steps in order in this array.\n",
"\n",
"Next use the `Experiment.submit()` function to submit the pipeline for execution. You also specify the custom parameter `param_batch_size`. The `wait_for_completion` function will output logs during the pipeline build process, which allows you to see current progress.\n",
"\n",
"Note: The first pipeline run takes roughly **15 minutes**, as all dependencies must be downloaded, a Docker image is created, and the Python environment is provisioned/created. Running it again takes significantly less time as those resources are reused. However, total run time depends on the workload of your scripts and processes running in each pipeline step."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core import Experiment\n",
"from azureml.pipeline.core import Pipeline\n",
"\n",
"pipeline = Pipeline(workspace=ws, steps=[batch_score_step])\n",
"pipeline_run = Experiment(ws, \"Tutorial-Batch-Scoring\").submit(pipeline)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# This will output information of the pipeline run, including the link to the details page of portal.\n",
"pipeline_run"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Wait the run for completion and show output log to console\n",
"pipeline_run.wait_for_completion(show_output=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Download and review output"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Run the following code to download the output file created from the `batch_scoring.py` script, then explore the scoring results."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import tempfile\n",
"\n",
"batch_run = pipeline_run.find_step_run(batch_score_step.name)[0]\n",
"batch_output = batch_run.get_output_data(output_dir.name)\n",
"\n",
"target_dir = tempfile.mkdtemp()\n",
"batch_output.download(local_path=target_dir)\n",
"result_file = os.path.join(target_dir, batch_output.path_on_datastore, parallel_run_config.append_row_file_name)\n",
"\n",
"df = pd.read_csv(result_file, delimiter=\":\", header=None)\n",
"df.columns = [\"Filename\", \"Prediction\"]\n",
"print(\"Prediction has \", df.shape[0], \" rows\")\n",
"df.head(10) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Publish and run from REST endpoint"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Run the following code to publish the pipeline to your workspace. In your workspace in the portal, you can see metadata for the pipeline including run history and durations. You can also run the pipeline manually from the portal.\n",
"\n",
"Additionally, publishing the pipeline enables a REST endpoint to rerun the pipeline from any HTTP library on any platform."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"published_pipeline = pipeline_run.publish_pipeline(\n",
" name=\"Inception_v3_scoring\", description=\"Batch scoring using Inception v3 model\", version=\"1.0\")\n",
"\n",
"published_pipeline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To run the pipeline from the REST endpoint, you first need an OAuth2 Bearer-type authentication header. This example uses interactive authentication for illustration purposes, but for most production scenarios requiring automated or headless authentication, use service principle authentication as [described in this notebook](https://aka.ms/pl-restep-auth).\n",
"\n",
"Service principle authentication involves creating an **App Registration** in **Azure Active Directory**, generating a client secret, and then granting your service principal **role access** to your machine learning workspace. You then use the [`ServicePrincipalAuthentication`](https://docs.microsoft.com/python/api/azureml-core/azureml.core.authentication.serviceprincipalauthentication?view=azure-ml-py) class to manage your auth flow. \n",
"\n",
"Both `InteractiveLoginAuthentication` and `ServicePrincipalAuthentication` inherit from `AbstractAuthentication`, and in both cases you use the `get_authentication_header()` function in the same way to fetch the header."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.authentication import InteractiveLoginAuthentication\n",
"\n",
"interactive_auth = InteractiveLoginAuthentication()\n",
"auth_header = interactive_auth.get_authentication_header()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Get the REST url from the `endpoint` property of the published pipeline object. You can also find the REST url in your workspace in the portal. Build an HTTP POST request to the endpoint, specifying your authentication header. Additionally, add a JSON payload object with the experiment name and the batch size parameter. As a reminder, the `process_count_per_node` is passed through to `ParallelRunStep` because you defined it is defined as a `PipelineParameter` object in the step configuration.\n",
"\n",
"Make the request to trigger the run. Access the `Id` key from the response dict to get the value of the run id."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import requests\n",
"\n",
"rest_endpoint = published_pipeline.endpoint\n",
"response = requests.post(rest_endpoint, \n",
" headers=auth_header, \n",
" json={\"ExperimentName\": \"Tutorial-Batch-Scoring\",\n",
" \"ParameterAssignments\": {\"process_count_per_node\": 6}})"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"try:\n",
" response.raise_for_status()\n",
"except Exception: \n",
" raise Exception(\"Received bad response from the endpoint: {}\\n\"\n",
" \"Response Code: {}\\n\"\n",
" \"Headers: {}\\n\"\n",
" \"Content: {}\".format(rest_endpoint, response.status_code, response.headers, response.content))\n",
"\n",
"run_id = response.json().get('Id')\n",
"print('Submitted pipeline run: ', run_id)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use the run id to monitor the status of the new run. This will take another 10-15 min to run and will look similar to the previous pipeline run, so if you don't need to see another pipeline run, you can skip watching the full output."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.pipeline.core.run import PipelineRun\n",
"\n",
"published_pipeline_run = PipelineRun(ws.experiments[\"Tutorial-Batch-Scoring\"], run_id)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Show detail information of the run\n",
"published_pipeline_run"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Clean up resources\n",
"\n",
"Do not complete this section if you plan on running other Azure Machine Learning service tutorials.\n",
"\n",
"### Stop the notebook VM\n",
"\n",
"If you used a cloud notebook server, stop the VM when you are not using it to reduce cost.\n",
"\n",
"1. In your workspace, select **Compute**.\n",
"1. Select the **Notebook VMs** tab in the compute page.\n",
"1. From the list, select the VM.\n",
"1. Select **Stop**.\n",
"1. When you're ready to use the server again, select **Start**.\n",
"\n",
"### Delete everything\n",
"\n",
"If you don't plan to use the resources you created, delete them, so you don't incur any charges.\n",
"\n",
"1. In the Azure portal, select **Resource groups** on the far left.\n",
"1. From the list, select the resource group you created.\n",
"1. Select **Delete resource group**.\n",
"1. Enter the resource group name. Then select **Delete**.\n",
"\n",
"You can also keep the resource group but delete a single workspace. Display the workspace properties and select **Delete**."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Next steps\n",
"\n",
"In this machine learning pipelines tutorial, you did the following tasks:\n",
"\n",
"> * Built a pipeline with environment dependencies to run on a remote GPU compute resource\n",
"> * Created a scoring script to run batch predictions with a pre-trained Tensorflow model\n",
"> * Published a pipeline and enabled it to be run from a REST endpoint\n",
"\n",
"See the [how-to](https://docs.microsoft.com/azure/machine-learning/service/how-to-create-your-first-pipeline?view=azure-devops) for additional detail on building pipelines with the machine learning SDK."
]
}
],
"metadata": {
"authors": [
{
"name": [
"sanpil",
"trmccorm",
"pansav"
]
}
],
"kernelspec": {
"display_name": "Python 3.6",
"language": "python",
"name": "python36"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
},
"msauthor": "trbye"
},
"nbformat": 4,
"nbformat_minor": 2
}