MachineLearningNotebooks/training/05.distributed-tensorflow-with-parameter-server/05.distributed-tensorflow-with-parameter-server.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copyright (c) Microsoft Corporation. All rights reserved.\n",
    "\n",
    "Licensed under the MIT License."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 05. Distributed TensorFlow with parameter server\n",
    "In this tutorial, you will train a TensorFlow model on the [MNIST](http://yann.lecun.com/exdb/mnist/) dataset using native [distributed TensorFlow](https://www.tensorflow.org/deploy/distributed)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prerequisites\n",
    "* Understand the [architecture and terms](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture) introduced by Azure Machine Learning (AML)\n",
    "* Go through the [00.configuration.ipynb](https://github.com/Azure/MachineLearningNotebooks/blob/master/00.configuration.ipynb) notebook to:\n",
    "    * install the AML SDK\n",
    "    * create a workspace and its configuration file (`config.json`)\n",
    "* Review the [tutorial](https://aka.ms/aml-notebook-hyperdrive) on single-node TensorFlow training using the SDK"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Check core SDK version number\n",
    "import azureml.core\n",
    "\n",
    "print(\"SDK version:\", azureml.core.VERSION)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Initialize workspace\n",
    "Initialize a [Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace) object from the existing workspace you created in the Prerequisites step. `Workspace.from_config()` creates a workspace object from the details stored in `config.json`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from azureml.core.workspace import Workspace\n",
    "\n",
    "ws = Workspace.from_config()\n",
    "print('Workspace name: ' + ws.name, \n",
    "      'Azure region: ' + ws.location, \n",
    "      'Subscription id: ' + ws.subscription_id, \n",
    "      'Resource group: ' + ws.resource_group, sep = '\\n')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create a remote compute target\n",
    "You will need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) to execute your training script on. In this tutorial, you create an [Azure Batch AI](https://docs.microsoft.com/azure/batch-ai/overview) cluster as your training compute resource. This code creates a cluster for you if it does not already exist in your workspace.\n",
    "\n",
    "**Creation of the cluster takes approximately 5 minutes.** If the cluster is already in your workspace this code will skip the cluster creation process."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from azureml.core.compute import ComputeTarget, BatchAiCompute\n",
    "from azureml.core.compute_target import ComputeTargetException\n",
    "\n",
    "# choose a name for your cluster\n",
    "cluster_name = \"gpucluster\"\n",
    "\n",
    "try:\n",
    "    compute_target = ComputeTarget(workspace=ws, name=cluster_name)\n",
    "    print('Found existing compute target.')\n",
    "except ComputeTargetException:\n",
    "    print('Creating a new compute target...')\n",
    "    compute_config = BatchAiCompute.provisioning_configuration(vm_size='STANDARD_NC6', \n",
    "                                                                autoscale_enabled=True,\n",
    "                                                                cluster_min_nodes=0, \n",
    "                                                                cluster_max_nodes=4)\n",
    "\n",
    "    # create the cluster\n",
    "    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)\n",
    "\n",
    "    compute_target.wait_for_completion(show_output=True)\n",
    "\n",
    "    # Use the 'status' property to get a detailed status for the current cluster. \n",
    "    print(compute_target.status.serialize())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Train model on the remote compute\n",
    "Now that we have the cluster ready to go, let's run our distributed training job."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create a project directory\n",
    "Create a directory that will contain all the necessary code from your local machine that you will need access to on the remote resource. This includes the training script, and any additional files your training script depends on."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "project_folder = './tf-distr-ps'\n",
    "os.makedirs(project_folder, exist_ok=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copy the training script `tf_mnist_replica.py` into this project directory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import shutil\n",
    "shutil.copy('tf_mnist_replica.py', project_folder)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create an experiment\n",
    "Create an [Experiment](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#experiment) to track all the runs in your workspace for this distributed TensorFlow tutorial. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from azureml.core import Experiment\n",
    "\n",
    "experiment_name = 'tf-distr-ps'\n",
    "experiment = Experiment(ws, name=experiment_name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create a TensorFlow estimator\n",
    "The AML SDK's TensorFlow estimator enables you to easily submit TensorFlow training jobs for both single-node and distributed runs. For more information on the TensorFlow estimator, refer [here](https://docs.microsoft.com/azure/machine-learning/service/how-to-train-tensorflow)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from azureml.train.dnn import TensorFlow\n",
    "\n",
    "script_params={\n",
    "    '--num_gpus': 1\n",
    "}\n",
    "\n",
    "estimator = TensorFlow(source_directory=project_folder,\n",
    "                       compute_target=compute_target,\n",
    "                       script_params=script_params,\n",
    "                       entry_script='tf_mnist_replica.py',\n",
    "                       node_count=2,\n",
    "                       worker_count=2,\n",
    "                       parameter_server_count=1,   \n",
    "                       distributed_backend='ps',\n",
    "                       use_gpu=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The above code specifies that we will run our training script on `2` nodes, with two workers and one parameter server. In order to execute a native distributed TensorFlow run, you must provide the argument `distributed_backend='ps'`. Using this estimator with these settings, TensorFlow and its dependencies will be installed for you. However, if your script also uses other packages, make sure to install them via the `TensorFlow` constructor's `pip_packages` or `conda_packages` parameters."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Submit job\n",
    "Run your experiment by submitting your estimator object. Note that this call is asynchronous."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "run = experiment.submit(estimator)\n",
    "print(run.get_details())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Monitor your run\n",
    "You can monitor the progress of the run with a Jupyter widget. Like the run submission, the widget is asynchronous and provides live updates every 10-15 seconds until the job completes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from azureml.train.widgets import RunDetails\n",
    "RunDetails(run).show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Alternatively, you can block until the script has completed training before running more code."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "run.wait_for_completion(show_output=True) # this provides a verbose log"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  },
  "msauthor": "minxia"
 },
 "nbformat": 4,
 "nbformat_minor": 2
}