Files
MachineLearningNotebooks/how-to-use-azureml/training-with-deep-learning/distributed-chainer/distributed-chainer.ipynb
Roope Astala 2d41c00488 version 1.0.39
2019-05-14 16:01:14 -04:00

303 lines
10 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Copyright (c) Microsoft Corporation. All rights reserved.\n",
"\n",
"Licensed under the MIT License."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/training-with-deep-learning/distributed-chainer/distributed-chainer.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Distributed Chainer\n",
"In this tutorial, you will run a Chainer training example on the [MNIST](http://yann.lecun.com/exdb/mnist/) dataset using ChainerMN distributed training across a GPU cluster."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Prerequisites\n",
"* If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, go through the [Configuration](../../../configuration.ipynb) notebook to install the Azure Machine Learning Python SDK and create an Azure ML `Workspace`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Check core SDK version number\n",
"import azureml.core\n",
"\n",
"print(\"SDK version:\", azureml.core.VERSION)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Diagnostics\n",
"Opt-in diagnostics for better experience, quality, and security of future releases."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"Diagnostics"
]
},
"outputs": [],
"source": [
"from azureml.telemetry import set_diagnostics_collection\n",
"\n",
"set_diagnostics_collection(send_diagnostics=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Initialize workspace\n",
"\n",
"Initialize a [Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace) object from the existing workspace you created in the Prerequisites step. `Workspace.from_config()` creates a workspace object from the details stored in `config.json`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.workspace import Workspace\n",
"\n",
"ws = Workspace.from_config()\n",
"print('Workspace name: ' + ws.name, \n",
" 'Azure region: ' + ws.location, \n",
" 'Subscription id: ' + ws.subscription_id, \n",
" 'Resource group: ' + ws.resource_group, sep = '\\n')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Get default AmlCompute\n",
"You can create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for training your model. In this tutorial, we use Azure ML managed compute ([AmlCompute](https://docs.microsoft.com/azure/machine-learning/service/how-to-set-up-training-targets#amlcompute)) for our remote training compute resource. Specifically, the below code gets the default compute cluster.\n",
"\n",
"As with other Azure services, there are limits on certain resources (e.g. AmlCompute) associated with the Azure Machine Learning service. Please read [this article](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-quotas) on the default limits and how to request more quota."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"compute_target = ws.get_default_compute_target(type=\"GPU\")\n",
"\n",
"# use get_status() to get a detailed status for the current AmlCompute. \n",
"print(compute_target.get_status().serialize())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The above code retrieves the default GPU compute. If you instead want to use default CPU compute, provide type=\"CPU\"."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Train model on the remote compute\n",
"Now that we have the AmlCompute ready to go, let's run our distributed training job."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create a project directory\n",
"Create a directory that will contain all the necessary code from your local machine that you will need access to on the remote resource. This includes the training script and any additional files your training script depends on."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"\n",
"project_folder = './chainer-distr'\n",
"os.makedirs(project_folder, exist_ok=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Prepare training script\n",
"Now you will need to create your training script. In this tutorial, the script for distributed training of MNIST is already provided for you at `train_mnist.py`. In practice, you should be able to take any custom Chainer training script as is and run it with Azure ML without having to modify your code."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once your script is ready, copy the training script `train_mnist.py` into the project directory."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import shutil\n",
"\n",
"shutil.copy('train_mnist.py', project_folder)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create an experiment\n",
"Create an [Experiment](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#experiment) to track all the runs in your workspace for this distributed Chainer tutorial. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core import Experiment\n",
"\n",
"experiment_name = 'chainer-distr'\n",
"experiment = Experiment(ws, name=experiment_name)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create a Chainer estimator\n",
"The Azure ML SDK's Chainer estimator enables you to easily submit Chainer training jobs for both single-node and distributed runs."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.runconfig import MpiConfiguration\n",
"from azureml.train.dnn import Chainer\n",
"\n",
"estimator = Chainer(source_directory=project_folder,\n",
" compute_target=compute_target,\n",
" entry_script='train_mnist.py',\n",
" node_count=2,\n",
" distributed_training=MpiConfiguration(),\n",
" use_gpu=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The above code specifies that we will run our training script on `2` nodes, with one worker per node. In order to execute a distributed run using MPI, you must provide the argument `distributed_backend='mpi'`. Using this estimator with these settings, Chainer and its dependencies will be installed for you. However, if your script also uses other packages, make sure to install them via the `Chainer` constructor's `pip_packages` or `conda_packages` parameters."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Submit job\n",
"Run your experiment by submitting your estimator object. Note that this call is asynchronous."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"run = experiment.submit(estimator)\n",
"print(run)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Monitor your run\n",
"You can monitor the progress of the run with a Jupyter widget. Like the run submission, the widget is asynchronous and provides live updates every 10-15 seconds until the job completes. You can see that the widget automatically plots and visualizes the loss metric that we logged to the Azure ML run."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.widgets import RunDetails\n",
"\n",
"RunDetails(run).show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"run.wait_for_completion(show_output=True)"
]
}
],
"metadata": {
"authors": [
{
"name": "minxia"
}
],
"kernelspec": {
"display_name": "Python 3.6",
"language": "python",
"name": "python36"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.6"
},
"msauthor": "minxia"
},
"nbformat": 4,
"nbformat_minor": 2
}