Files
MachineLearningNotebooks/how-to-use-azureml/training-with-deep-learning/train-hyperparameter-tune-deploy-with-pytorch/train-hyperparameter-tune-deploy-with-pytorch.ipynb

745 lines
28 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Copyright (c) Microsoft Corporation. All rights reserved. \n",
"\n",
"Licensed under the MIT License."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/training-with-deep-learning/train-hyperparameter-tune-deploy-with-pytorch/train-hyperparameter-tune-deploy-with-pytorch.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Train, hyperparameter tune, and deploy with PyTorch\n",
"\n",
"In this tutorial, you will train, hyperparameter tune, and deploy a PyTorch model using the Azure Machine Learning (Azure ML) Python SDK.\n",
"\n",
"This tutorial will train an image classification model using transfer learning, based on PyTorch's [Transfer Learning tutorial](https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html). The model is trained to classify ants and bees by first using a pretrained ResNet18 model that has been trained on the [ImageNet](http://image-net.org/index) dataset."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Prerequisites\n",
"* If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, go through the [Configuration](../../../configuration.ipynb) notebook to install the Azure Machine Learning Python SDK and create an Azure ML `Workspace`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Check core SDK version number\n",
"import azureml.core\n",
"\n",
"print(\"SDK version:\", azureml.core.VERSION)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Diagnostics\n",
"Opt-in diagnostics for better experience, quality, and security of future releases."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"Diagnostics"
]
},
"outputs": [],
"source": [
"from azureml.telemetry import set_diagnostics_collection\n",
"\n",
"set_diagnostics_collection(send_diagnostics=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Initialize workspace\n",
"Initialize a [Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace) object from the existing workspace you created in the Prerequisites step. `Workspace.from_config()` creates a workspace object from the details stored in `config.json`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.workspace import Workspace\n",
"\n",
"ws = Workspace.from_config()\n",
"print('Workspace name: ' + ws.name, \n",
" 'Azure region: ' + ws.location, \n",
" 'Subscription id: ' + ws.subscription_id, \n",
" 'Resource group: ' + ws.resource_group, sep='\\n')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create or Attach existing AmlCompute\n",
"You will need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for training your model. In this tutorial, we use Azure ML managed compute ([AmlCompute](https://docs.microsoft.com/azure/machine-learning/service/how-to-set-up-training-targets#amlcompute)) for our remote training compute resource.\n",
"\n",
"**Creation of AmlCompute takes approximately 5 minutes.** If the AmlCompute with that name is already in your workspace, this code will skip the creation process.\n",
"\n",
"As with other Azure services, there are limits on certain resources (e.g. AmlCompute) associated with the Azure Machine Learning service. Please read [this article](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-quotas) on the default limits and how to request more quota."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.compute import ComputeTarget, AmlCompute\n",
"from azureml.core.compute_target import ComputeTargetException\n",
"\n",
"# choose a name for your cluster\n",
"cluster_name = \"gpu-cluster\"\n",
"\n",
"try:\n",
" compute_target = ComputeTarget(workspace=ws, name=cluster_name)\n",
" print('Found existing compute target.')\n",
"except ComputeTargetException:\n",
" print('Creating a new compute target...')\n",
" compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_NC6', \n",
" max_nodes=4)\n",
"\n",
" # create the cluster\n",
" compute_target = ComputeTarget.create(ws, cluster_name, compute_config)\n",
"\n",
" compute_target.wait_for_completion(show_output=True)\n",
"\n",
"# use get_status() to get a detailed status for the current cluster. \n",
"print(compute_target.get_status().serialize())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The above code creates a GPU cluster. If you instead want to create a CPU cluster, provide a different VM size to the `vm_size` parameter, such as `STANDARD_D2_V2`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Train model on the remote compute\n",
"Now that you have your data and training script prepared, you are ready to train on your remote compute cluster. You can take advantage of Azure compute to leverage GPUs to cut down your training time. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create a project directory\n",
"Create a directory that will contain all the necessary code from your local machine that you will need access to on the remote resource. This includes the training script and any additional files your training script depends on."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"\n",
"project_folder = './pytorch-hymenoptera'\n",
"os.makedirs(project_folder, exist_ok=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Download training data\n",
"The dataset we will use (located [here](https://download.pytorch.org/tutorial/hymenoptera_data.zip) as a zip file) consists of about 120 training images each for ants and bees, with 75 validation images for each class. [Hymenoptera](https://en.wikipedia.org/wiki/Hymenoptera) is the order of insects that includes ants and bees. We will download and extract the dataset as part of our training script `pytorch_train.py`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Prepare training script\n",
"Now you will need to create your training script. In this tutorial, the training script is already provided for you at `pytorch_train.py`. In practice, you should be able to take any custom training script as is and run it with Azure ML without having to modify your code.\n",
"\n",
"However, if you would like to use Azure ML's [tracking and metrics](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#metrics) capabilities, you will have to add a small amount of Azure ML code inside your training script. \n",
"\n",
"In `pytorch_train.py`, we will log some metrics to our Azure ML run. To do so, we will access the Azure ML `Run` object within the script:\n",
"```Python\n",
"from azureml.core.run import Run\n",
"run = Run.get_context()\n",
"```\n",
"Further within `pytorch_train.py`, we log the learning rate and momentum parameters, and the best validation accuracy the model achieves:\n",
"```Python\n",
"run.log('lr', np.float(learning_rate))\n",
"run.log('momentum', np.float(momentum))\n",
"\n",
"run.log('best_val_acc', np.float(best_acc))\n",
"```\n",
"These run metrics will become particularly important when we begin hyperparameter tuning our model in the \"Tune model hyperparameters\" section."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once your script is ready, copy the training script `pytorch_train.py` into your project directory."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import shutil\n",
"\n",
"shutil.copy('pytorch_train.py', project_folder)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create an experiment\n",
"Create an [Experiment](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#experiment) to track all the runs in your workspace for this transfer learning PyTorch tutorial. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core import Experiment\n",
"\n",
"experiment_name = 'pytorch-hymenoptera'\n",
"experiment = Experiment(ws, name=experiment_name)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create a PyTorch estimator\n",
"The Azure ML SDK's PyTorch estimator enables you to easily submit PyTorch training jobs for both single-node and distributed runs. For more information on the PyTorch estimator, refer [here](https://docs.microsoft.com/azure/machine-learning/service/how-to-train-pytorch). The following code will define a single-node PyTorch job."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.train.dnn import PyTorch\n",
"\n",
"script_params = {\n",
" '--num_epochs': 30,\n",
" '--output_dir': './outputs'\n",
"}\n",
"\n",
"estimator = PyTorch(source_directory=project_folder, \n",
" script_params=script_params,\n",
" compute_target=compute_target,\n",
" entry_script='pytorch_train.py',\n",
" use_gpu=True,\n",
" pip_packages=['pillow==5.4.1'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `script_params` parameter is a dictionary containing the command-line arguments to your training script `entry_script`. Please note the following:\n",
"- We passed our training data reference `ds_data` to our script's `--data_dir` argument. This will 1) mount our datastore on the remote compute and 2) provide the path to the training data `hymenoptera_data` on our datastore.\n",
"- We specified the output directory as `./outputs`. The `outputs` directory is specially treated by Azure ML in that all the content in this directory gets uploaded to your workspace as part of your run history. The files written to this directory are therefore accessible even once your remote run is over. In this tutorial, we will save our trained model to this output directory.\n",
"\n",
"To leverage the Azure VM's GPU for training, we set `use_gpu=True`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Submit job\n",
"Run your experiment by submitting your estimator object. Note that this call is asynchronous."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"run = experiment.submit(estimator)\n",
"print(run)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# to get more details of your run\n",
"print(run.get_details())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Monitor your run\n",
"You can monitor the progress of the run with a Jupyter widget. Like the run submission, the widget is asynchronous and provides live updates every 10-15 seconds until the job completes."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.widgets import RunDetails\n",
"\n",
"RunDetails(run).show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Alternatively, you can block until the script has completed training before running more code."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"run.wait_for_completion(show_output=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Tune model hyperparameters\n",
"Now that we've seen how to do a simple PyTorch training run using the SDK, let's see if we can further improve the accuracy of our model. We can optimize our model's hyperparameters using Azure Machine Learning's hyperparameter tuning capabilities."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Start a hyperparameter sweep\n",
"First, we will define the hyperparameter space to sweep over. Since our training script uses a learning rate schedule to decay the learning rate every several epochs, let's tune the initial learning rate and the momentum parameters. In this example we will use random sampling to try different configuration sets of hyperparameters to maximize our primary metric, the best validation accuracy (`best_val_acc`).\n",
"\n",
"Then, we specify the early termination policy to use to early terminate poorly performing runs. Here we use the `BanditPolicy`, which will terminate any run that doesn't fall within the slack factor of our primary evaluation metric. In this tutorial, we will apply this policy every epoch (since we report our `best_val_acc` metric every epoch and `evaluation_interval=1`). Notice we will delay the first policy evaluation until after the first `10` epochs (`delay_evaluation=10`).\n",
"Refer [here](https://docs.microsoft.com/azure/machine-learning/service/how-to-tune-hyperparameters#specify-an-early-termination-policy) for more information on the BanditPolicy and other policies available."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.train.hyperdrive import RandomParameterSampling, BanditPolicy, HyperDriveConfig, uniform, PrimaryMetricGoal\n",
"\n",
"param_sampling = RandomParameterSampling( {\n",
" 'learning_rate': uniform(0.0005, 0.005),\n",
" 'momentum': uniform(0.9, 0.99)\n",
" }\n",
")\n",
"\n",
"early_termination_policy = BanditPolicy(slack_factor=0.15, evaluation_interval=1, delay_evaluation=10)\n",
"\n",
"hyperdrive_config = HyperDriveConfig(estimator=estimator,\n",
" hyperparameter_sampling=param_sampling, \n",
" policy=early_termination_policy,\n",
" primary_metric_name='best_val_acc',\n",
" primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,\n",
" max_total_runs=8,\n",
" max_concurrent_runs=4)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, lauch the hyperparameter tuning job."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# start the HyperDrive run\n",
"hyperdrive_run = experiment.submit(hyperdrive_config)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Monitor HyperDrive runs\n",
"You can monitor the progress of the runs with the following Jupyter widget. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"RunDetails(hyperdrive_run).show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Or block until the HyperDrive sweep has completed:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"hyperdrive_run.wait_for_completion(show_output=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Find and register the best model\n",
"Once all the runs complete, we can find the run that produced the model with the highest accuracy."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"best_run = hyperdrive_run.get_best_run_by_primary_metric()\n",
"best_run_metrics = best_run.get_metrics()\n",
"print(best_run)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print('Best Run is:\\n Validation accuracy: {0:.5f} \\n Learning rate: {1:.5f} \\n Momentum: {2:.5f}'.format(\n",
" best_run_metrics['best_val_acc'][-1],\n",
" best_run_metrics['lr'],\n",
" best_run_metrics['momentum'])\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, register the model from your best-performing run to your workspace. The `model_path` parameter takes in the relative path on the remote VM to the model file in your `outputs` directory. In the next section, we will deploy this registered model as a web service."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"model = best_run.register_model(model_name = 'pytorch-hymenoptera', model_path = 'outputs/model.pt')\n",
"print(model.name, model.id, model.version, sep = '\\t')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Deploy model as web service\n",
"Once you have your trained model, you can deploy the model on Azure. In this tutorial, we will deploy the model as a web service in [Azure Container Instances](https://docs.microsoft.com/en-us/azure/container-instances/) (ACI). For more information on deploying models using Azure ML, refer [here](https://docs.microsoft.com/azure/machine-learning/service/how-to-deploy-and-where)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create scoring script\n",
"\n",
"First, we will create a scoring script that will be invoked by the web service call. Note that the scoring script must have two required functions:\n",
"* `init()`: In this function, you typically load the model into a `global` object. This function is executed only once when the Docker container is started. \n",
"* `run(input_data)`: In this function, the model is used to predict a value based on the input data. The input and output typically use JSON as serialization and deserialization format, but you are not limited to that.\n",
"\n",
"Refer to the scoring script `pytorch_score.py` for this tutorial. Our web service will use this file to predict whether an image is an ant or a bee. When writing your own scoring script, don't forget to test it locally first before you go and deploy the web service."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create environment file\n",
"Then, we will need to create an environment file (`myenv.yml`) that specifies all of the scoring script's package dependencies. This file is used to ensure that all of those dependencies are installed in the Docker image by Azure ML. In this case, we need to specify `azureml-core`, `torch` and `torchvision`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.conda_dependencies import CondaDependencies \n",
"\n",
"myenv = CondaDependencies.create(pip_packages=['azureml-defaults', 'torch', 'torchvision'])\n",
"\n",
"with open(\"myenv.yml\",\"w\") as f:\n",
" f.write(myenv.serialize_to_string())\n",
" \n",
"print(myenv.serialize_to_string())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Configure the container image\n",
"Now configure the Docker image that you will use to build your ACI container."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.image import ContainerImage\n",
"\n",
"image_config = ContainerImage.image_configuration(execution_script='pytorch_score.py', \n",
" runtime='python', \n",
" conda_file='myenv.yml',\n",
" description='Image with hymenoptera model')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Configure the ACI container\n",
"We are almost ready to deploy. Create a deployment configuration file to specify the number of CPUs and gigabytes of RAM needed for your ACI container. While it depends on your model, the default of `1` core and `1` gigabyte of RAM is usually sufficient for many models."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.webservice import AciWebservice\n",
"\n",
"aciconfig = AciWebservice.deploy_configuration(cpu_cores=1, \n",
" memory_gb=1, \n",
" tags={'data': 'hymenoptera', 'method':'transfer learning', 'framework':'pytorch'},\n",
" description='Classify ants/bees using transfer learning with PyTorch')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Deploy the registered model\n",
"Finally, let's deploy a web service from our registered model. Deploy the web service using the ACI config and image config files created in the previous steps. We pass the `model` object in a list to the `models` parameter. If you would like to deploy more than one registered model, append the additional models to this list."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"from azureml.core.webservice import Webservice\n",
"\n",
"service_name = 'aci-hymenoptera'\n",
"service = Webservice.deploy_from_model(workspace=ws,\n",
" name=service_name,\n",
" models=[model],\n",
" image_config=image_config,\n",
" deployment_config=aciconfig,)\n",
"\n",
"service.wait_for_deployment(show_output=True)\n",
"print(service.state)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If your deployment fails for any reason and you need to redeploy, make sure to delete the service before you do so: `service.delete()`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Tip: If something goes wrong with the deployment, the first thing to look at is the logs from the service by running the following command:**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"service.get_logs()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Get the web service's HTTP endpoint, which accepts REST client calls. This endpoint can be shared with anyone who wants to test the web service or integrate it into an application."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(service.scoring_uri)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Test the web service\n",
"Finally, let's test our deployed web service. We will send the data as a JSON string to the web service hosted in ACI and use the SDK's `run` API to invoke the service. Here we will take an image from our validation data to predict on."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"from PIL import Image\n",
"import matplotlib.pyplot as plt\n",
"\n",
"plt.imshow(Image.open('test_img.jpg'))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import torch\n",
"from torchvision import transforms\n",
" \n",
"def preprocess(image_file):\n",
" \"\"\"Preprocess the input image.\"\"\"\n",
" data_transforms = transforms.Compose([\n",
" transforms.Resize(256),\n",
" transforms.CenterCrop(224),\n",
" transforms.ToTensor(),\n",
" transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])\n",
" ])\n",
"\n",
" image = Image.open(image_file)\n",
" image = data_transforms(image).float()\n",
" image = torch.tensor(image)\n",
" image = image.unsqueeze(0)\n",
" return image.numpy()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"input_data = preprocess('test_img.jpg')\n",
"result = service.run(input_data=json.dumps({'data': input_data.tolist()}))\n",
"print(result)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Clean up\n",
"Once you no longer need the web service, you can delete it with a simple API call."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"service.delete()"
]
}
],
"metadata": {
"authors": [
{
"name": "ninhu"
}
],
"kernelspec": {
"display_name": "Python 3.6",
"language": "python",
"name": "python36"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.6"
}
},
"nbformat": 4,
"nbformat_minor": 2
}