{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Copyright (c) Microsoft Corporation. All rights reserved.\n", "\n", "Licensed under the MIT License." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/training-with-deep-learning/distributed-cntk-with-custom-docker/distributed-cntk-with-custom-docker.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Distributed CNTK using custom docker images\n", "In this tutorial, you will train a CNTK model on the [MNIST](http://yann.lecun.com/exdb/mnist/) dataset using a custom docker image and distributed training." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prerequisites\n", "* Understand the [architecture and terms](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture) introduced by Azure Machine Learning\n", "* If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, go through the [configuration notebook](../../../configuration.ipynb) to:\n", " * install the AML SDK\n", " * create a workspace and its configuration file (`config.json`)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Check core SDK version number\n", "import azureml.core\n", "\n", "print(\"SDK version:\", azureml.core.VERSION)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Diagnostics\n", "Opt-in diagnostics for better experience, quality, and security of future releases." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "Diagnostics" ] }, "outputs": [], "source": [ "from azureml.telemetry import set_diagnostics_collection\n", "\n", "set_diagnostics_collection(send_diagnostics=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Initialize workspace\n", "\n", "Initialize a [Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace) object from the existing workspace you created in the Prerequisites step. `Workspace.from_config()` creates a workspace object from the details stored in `config.json`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from azureml.core.workspace import Workspace\n", "\n", "ws = Workspace.from_config()\n", "print('Workspace name: ' + ws.name,\n", " 'Azure region: ' + ws.location, \n", " 'Subscription id: ' + ws.subscription_id, \n", " 'Resource group: ' + ws.resource_group, sep='\\n')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create or Attach existing AmlCompute\n", "You will need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for training your model. In this tutorial, you create `AmlCompute` as your training compute resource.\n", "\n", "**Creation of AmlCompute takes approximately 5 minutes.** If the AmlCompute with that name is already in your workspace this code will skip the creation process.\n", "\n", "As with other Azure services, there are limits on certain resources (e.g. AmlCompute) associated with the Azure Machine Learning service. Please read [this article](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-quotas) on the default limits and how to request more quota." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from azureml.core.compute import ComputeTarget, AmlCompute\n", "from azureml.core.compute_target import ComputeTargetException\n", "\n", "# choose a name for your cluster\n", "cluster_name = \"gpu-cluster\"\n", "\n", "try:\n", " compute_target = ComputeTarget(workspace=ws, name=cluster_name)\n", " print('Found existing compute target.')\n", "except ComputeTargetException:\n", " print('Creating a new compute target...')\n", " compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_NC6',\n", " max_nodes=4)\n", "\n", " # create the cluster\n", " compute_target = ComputeTarget.create(ws, cluster_name, compute_config)\n", "\n", " compute_target.wait_for_completion(show_output=True)\n", "\n", "# use get_status() to get a detailed status for the current AmlCompute\n", "print(compute_target.get_status().serialize())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Upload training data\n", "For this tutorial, we will be using the MNIST dataset.\n", "\n", "First, let's download the dataset. We've included the `install_mnist.py` script to download the data and convert it to a CNTK-supported format. Our data files will get written to a directory named `'mnist'`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import install_mnist\n", "\n", "install_mnist.main('mnist')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To make the data accessible for remote training, you will need to upload the data from your local machine to the cloud. AML provides a convenient way to do so via a [Datastore](https://docs.microsoft.com/azure/machine-learning/service/how-to-access-data). The datastore provides a mechanism for you to upload/download data, and interact with it from your remote compute targets. \n", "\n", "Each workspace is associated with a default datastore. In this tutorial, we will upload the training data to this default datastore, which we will then mount on the remote compute for training in the next section." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ds = ws.get_default_datastore()\n", "print(ds.datastore_type, ds.account_name, ds.container_name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following code will upload the training data to the path `./mnist` on the default datastore." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ds.upload(src_dir='./mnist', target_path='./mnist')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's get a reference to the path on the datastore with the training data. We can do so using the `path` method. In the next section, we can then pass this reference to our training script's `--data_dir` argument. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "path_on_datastore = 'mnist'\n", "ds_data = ds.path(path_on_datastore)\n", "print(ds_data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Train model on the remote compute\n", "Now that we have the cluster ready to go, let's run our distributed training job." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create a project directory\n", "Create a directory that will contain all the necessary code from your local machine that you will need access to on the remote resource. This includes the training script, and any additional files your training script depends on." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "project_folder = './cntk-distr'\n", "os.makedirs(project_folder, exist_ok=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Copy the training script `cntk_distr_mnist.py` into this project directory." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import shutil\n", "\n", "shutil.copy('cntk_distr_mnist.py', project_folder)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create an experiment\n", "Create an [experiment](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#experiment) to track all the runs in your workspace for this distributed CNTK tutorial. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from azureml.core import Experiment\n", "\n", "experiment_name = 'cntk-distr'\n", "experiment = Experiment(ws, name=experiment_name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create an Estimator\n", "The AML SDK's base Estimator enables you to easily submit custom scripts for both single-node and distributed runs. You should this generic estimator for training code using frameworks such as sklearn or CNTK that don't have corresponding custom estimators. For more information on using the generic estimator, refer [here](https://docs.microsoft.com/azure/machine-learning/service/how-to-train-ml-models)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "estimator-remarks-sample" ] }, "outputs": [], "source": [ "from azureml.train.estimator import Estimator\n", "\n", "script_params = {\n", " '--num_epochs': 20,\n", " '--data_dir': ds_data.as_mount(),\n", " '--output_dir': './outputs'\n", "}\n", "\n", "estimator = Estimator(source_directory=project_folder,\n", " compute_target=compute_target,\n", " entry_script='cntk_distr_mnist.py',\n", " script_params=script_params,\n", " node_count=2,\n", " process_count_per_node=1,\n", " distributed_backend='mpi',\n", " pip_packages=['cntk-gpu==2.6'],\n", " custom_docker_image='microsoft/mmlspark:gpu-0.12',\n", " use_gpu=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We would like to train our model using a [pre-built Docker container](https://hub.docker.com/r/microsoft/mmlspark/). To do so, specify the name of the docker image to the argument `custom_docker_image`. Finally, we provide the `cntk` package to `pip_packages` to install CNTK 2.6 on our custom image.\n", "\n", "The above code specifies that we will run our training script on `2` nodes, with one worker per node. In order to run distributed CNTK, which uses MPI, you must provide the argument `distributed_backend='mpi'`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Submit job\n", "Run your experiment by submitting your estimator object. Note that this call is asynchronous." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "run = experiment.submit(estimator)\n", "print(run)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Monitor your run\n", "You can monitor the progress of the run with a Jupyter widget. Like the run submission, the widget is asynchronous and provides live updates every 10-15 seconds until the job completes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from azureml.widgets import RunDetails\n", "\n", "RunDetails(run).show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternatively, you can block until the script has completed training before running more code." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "run.wait_for_completion(show_output=True)" ] } ], "metadata": { "authors": [ { "name": "ninhu" } ], "kernelspec": { "display_name": "Python 3.6", "language": "python", "name": "python36" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.6" } }, "nbformat": 4, "nbformat_minor": 2 }