MachineLearningNotebooks/how-to-use-azureml/training-with-deep-learning/train-tensorflow-resume-training/train-tensorflow-resume-training.ipynb

{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Copyright (c) Microsoft Corporation. All rights reserved.\n",
        "\n",
        "Licensed under the MIT License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/training-with-deep-learning/train-tensorflow-resume-training/tensorflow-resume-training.png)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Resuming Tensorflow training from previous run\n",
        "In this tutorial, you will resume a mnist model in TensorFlow from a previously submitted run."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Prerequisites\n",
        "* Understand the [architecture and terms](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture) introduced by Azure Machine Learning (AML)\n",
        "* Go through the [configuration notebook](../../../configuration.ipynb) to:\n",
        "    * install the AML SDK\n",
        "    * create a workspace and its configuration file (`config.json`)\n",
        "* Review the [tutorial](../train-hyperparameter-tune-deploy-with-tensorflow/train-hyperparameter-tune-deploy-with-tensorflow.ipynb) on single-node TensorFlow training using the SDK"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Check core SDK version number\n",
        "import azureml.core\n",
        "\n",
        "print(\"SDK version:\", azureml.core.VERSION)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Diagnostics\n",
        "Opt-in diagnostics for better experience, quality, and security of future releases."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "tags": [
          "Diagnostics"
        ]
      },
      "outputs": [],
      "source": [
        "from azureml.telemetry import set_diagnostics_collection\n",
        "\n",
        "set_diagnostics_collection(send_diagnostics=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Initialize workspace\n",
        "Initialize a [Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace) object from the existing workspace you created in the Prerequisites step. `Workspace.from_config()` creates a workspace object from the details stored in `config.json`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.core.workspace import Workspace\n",
        "\n",
        "ws = Workspace.from_config()\n",
        "print('Workspace name: ' + ws.name, \n",
        "      'Azure region: ' + ws.location, \n",
        "      'Subscription id: ' + ws.subscription_id, \n",
        "      'Resource group: ' + ws.resource_group, sep='\\n')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Create or Attach existing AmlCompute\n",
        "You will need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for training your model. In this tutorial, you create `AmlCompute` as your training compute resource.\n",
        "\n",
        "**Creation of AmlCompute takes approximately 5 minutes.** If the AmlCompute with that name is already in your workspace this code will skip the creation process.\n",
        "\n",
        "As with other Azure services, there are limits on certain resources (e.g. AmlCompute) associated with the Azure Machine Learning service. Please read [this article](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-quotas) on the default limits and how to request more quota."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.core.compute import ComputeTarget, AmlCompute\n",
        "from azureml.core.compute_target import ComputeTargetException\n",
        "\n",
        "# choose a name for your cluster\n",
        "cluster_name = \"gpu-cluster\"\n",
        "\n",
        "try:\n",
        "    compute_target = ComputeTarget(workspace=ws, name=cluster_name)\n",
        "    print('Found existing compute target.')\n",
        "except ComputeTargetException:\n",
        "    print('Creating a new compute target...')\n",
        "    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_NC6', \n",
        "                                                           max_nodes=4)\n",
        "\n",
        "    # create the cluster\n",
        "    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)\n",
        "\n",
        "    compute_target.wait_for_completion(show_output=True)\n",
        "\n",
        "# use get_status() to get a detailed status for the current cluster. \n",
        "print(compute_target.get_status().serialize())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The above code creates a GPU cluster. If you instead want to create a CPU cluster, provide a different VM size to the `vm_size` parameter, such as `STANDARD_D2_V2`."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Upload data to datastore\n",
        "To make data accessible for remote training, AML provides a convenient way to do so via a [Datastore](https://docs.microsoft.com/azure/machine-learning/service/how-to-access-data). The datastore provides a mechanism for you to upload/download data to Azure Storage, and interact with it from your remote compute targets. \n",
        "\n",
        "If your data is already stored in Azure, or you download the data as part of your training script, you will not need to do this step. For this tutorial, although you can download the data in your training script, we will demonstrate how to upload the training data to a datastore and access it during training to illustrate the datastore functionality."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "First download the data from Yan LeCun's web site directly and save them in a data folder locally."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import os\n",
        "import urllib\n",
        "\n",
        "os.makedirs('./data/mnist', exist_ok=True)\n",
        "\n",
        "urllib.request.urlretrieve('http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz', filename = './data/mnist/train-images.gz')\n",
        "urllib.request.urlretrieve('http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz', filename = './data/mnist/train-labels.gz')\n",
        "urllib.request.urlretrieve('http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz', filename = './data/mnist/test-images.gz')\n",
        "urllib.request.urlretrieve('http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz', filename = './data/mnist/test-labels.gz')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Each workspace is associated with a default datastore. In this tutorial, we will upload the training data to this default datastore."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "ds = ws.get_default_datastore()\n",
        "print(ds.datastore_type, ds.account_name, ds.container_name)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Upload MNIST data to the default datastore."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "ds.upload(src_dir='./data/mnist', target_path='mnist', overwrite=True, show_progress=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "For convenience, let's get a reference to the datastore. In the next section, we can then pass this reference to our training script's `--data-folder` argument. "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "ds_data = ds.as_mount()\n",
        "print(ds_data)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Train model on the remote compute"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Create a project directory\n",
        "Create a directory that will contain all the necessary code from your local machine that you will need access to on the remote resource. This includes the training script, and any additional files your training script depends on."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "script_folder = './tf-resume-training'\n",
        "os.makedirs(script_folder, exist_ok=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Copy the training script `tf_mnist_with_checkpoint.py` into this project directory."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import shutil\n",
        "\n",
        "# the training logic is in the tf_mnist_with_checkpoint.py file.\n",
        "shutil.copy('./tf_mnist_with_checkpoint.py', script_folder)\n",
        "\n",
        "# the utils.py just helps loading data from the downloaded MNIST dataset into numpy arrays.\n",
        "shutil.copy('./utils.py', script_folder)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Create an experiment\n",
        "Create an [Experiment](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#experiment) to track all the runs in your workspace for this distributed TensorFlow tutorial. "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.core import Experiment\n",
        "\n",
        "experiment_name = 'tf-resume-training'\n",
        "experiment = Experiment(ws, name=experiment_name)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Create a TensorFlow estimator\n",
        "The AML SDK's TensorFlow estimator enables you to easily submit TensorFlow training jobs for both single-node and distributed runs. For more information on the TensorFlow estimator, refer [here](https://docs.microsoft.com/azure/machine-learning/service/how-to-train-tensorflow).\n",
        "\n",
        "The TensorFlow estimator also takes a `framework_version` parameter -- if no version is provided, the estimator will default to the latest version supported by AzureML. Use `TensorFlow.get_supported_versions()` to get a list of all versions supported by your current SDK version or see the [SDK documentation](https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.dnn?view=azure-ml-py) for the versions supported in the most current release."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.train.dnn import TensorFlow\n",
        "\n",
        "script_params={\n",
        "    '--data-folder': ds_data\n",
        "}\n",
        "\n",
        "estimator= TensorFlow(source_directory=script_folder,\n",
        "                      compute_target=compute_target,\n",
        "                      script_params=script_params,\n",
        "                      entry_script='tf_mnist_with_checkpoint.py')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "In the above code, we passed our training data reference `ds_data` to our script's `--data-folder` argument. This will 1) mount our datastore on the remote compute and 2) provide the path to the data zip file on our datastore."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Submit job\n",
        "### Run your experiment by submitting your estimator object. Note that this call is asynchronous."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "run = experiment.submit(estimator)\n",
        "print(run)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Monitor your run\n",
        "You can monitor the progress of the run with a Jupyter widget. Like the run submission, the widget is asynchronous and provides live updates every 10-15 seconds until the job completes."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.widgets import RunDetails\n",
        "RunDetails(run).show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Alternatively, you can block until the script has completed training before running more code."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "run.wait_for_completion(show_output=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Now let's resume the training from the above run"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "First, we will get the DataPath to the outputs directory of the above run which\n",
        "contains the checkpoint files and/or model"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "model_location = run._get_outputs_datapath()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now, we will create a new TensorFlow estimator and pass in the model location. On passing 'resume_from' parameter, a new entry in script_params is created with key as 'resume_from' and value as the model/checkpoint files location and the location gets automatically mounted on the compute target."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.train.dnn import TensorFlow\n",
        "\n",
        "script_params={\n",
        "    '--data-folder': ds_data\n",
        "}\n",
        "\n",
        "estimator2 = TensorFlow(source_directory=script_folder,\n",
        "                      compute_target=compute_target,\n",
        "                      script_params=script_params,\n",
        "                      entry_script='tf_mnist_with_checkpoint.py',\n",
        "                      resume_from=model_location)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now you can submit the experiment and it should resume from previous run's checkpoint files."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "run2 = experiment.submit(estimator2)\n",
        "print(run2)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "run2.wait_for_completion(show_output=True)"
      ]
    }
  ],
  "metadata": {
    "authors": [
      {
        "name": "hesuri"
      }
    ],
    "kernelspec": {
      "display_name": "Python 3.6",
      "language": "python",
      "name": "python36"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.6"
    },
    "msauthor": "hesuri"
  },
  "nbformat": 4,
  "nbformat_minor": 2
}