MachineLearningNotebooks/tutorials/get-started-day1/day1-part4-data.ipynb

{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Copyright (c) Microsoft Corporation. All rights reserved.\n",
        "\n",
        "Licensed under the MIT License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/tutorials/get-started-day1/day1-part4-data.png)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Tutorial: Bring your own data (Part 4 of 4)\n",
        "\n",
        "---\n",
        "## Introduction\n",
        "\n",
        "In the previous [Tutorial: Train a model in the cloud](day1-part3-train-model.ipynb) article, the CIFAR10 data was downloaded using the inbuilt `torchvision.datasets.CIFAR10` method in the PyTorch API. However, in many cases you are going to want to use your own data in a remote training run. This article focuses on the workflow you can leverage such that you can work with your own data in Azure Machine Learning. \n",
        "\n",
        "By the end of this tutorial you would have a better understanding of:\n",
        "\n",
        "- How to upload your data to Azure\n",
        "- Best practices for working with cloud data in Azure Machine Learning\n",
        "- Working with command-line arguments\n",
        "\n",
        "This notebook follows the steps provided on the [Python (day 1) - bring your own data documentation page](https://aka.ms/day1aml).\n",
        "\n",
        "## Prerequisites\n",
        "\n",
        "- You have completed:\n",
        "  - Setup on your [Azure Machine Learning Compute Cluster](day1-part1-setup.ipynb)\n",
        "  - [Tutorial: Hello World](day1-part2-hello-world.ipynb)\n",
        "  - [Tutorial: Train a model in the cloud](day1-part3-train-model.ipynb)\n",
        "- Familiarity with Python and Machine Learning concepts\n",
        "- If you are using a compute instance in Azure Machine Learning to run this notebook series, you are all set. Otherwise, please follow the [Configure a development environment for Azure Machine Learning](https://docs.microsoft.com/azure/machine-learning/how-to-configure-environment)\n",
        "\n",
        "---\n",
        "\n",
        "## Your machine learning code\n",
        "\n",
        "By now you have your training script running in Azure Machine Learning, and can monitor the model performance. Let's _parametrize_ the training script by introducing\n",
        "arguments. Using arguments will allow you to easily compare different hyperparmeters.\n",
        "\n",
        "Presently our training script is set to download the CIFAR10 dataset on each run. The python code in [code/pytorch-cifar10-your-data/train.py](code/pytorch-cifar10-your-data/train.py) now uses **`argparse` to parametize the script.**"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Understanding your machine learning code changes\n",
        "\n",
        "The code used in `train.py` has leveraged the `argparse` library to set up the `data_path`, `learning_rate`, and `momentum`.\n",
        "\n",
        "```python\n",
        "# .... other code\n",
        "parser = argparse.ArgumentParser()\n",
        "parser.add_argument('--data_path', type=str, help='Path to the training data')\n",
        "parser.add_argument('--learning_rate', type=float, default=0.001, help='Learning rate for SGD')\n",
        "parser.add_argument('--momentum', type=float, default=0.9, help='Momentum for SGD')\n",
        "args = parser.parse_args()\n",
        "# ... other code\n",
        "```\n",
        "\n",
        "Also the `train.py` script was adapted to update the optimizer to use the user-defined parameters:\n",
        "\n",
        "```python\n",
        "optimizer = optim.SGD(\n",
        "    net.parameters(),\n",
        "    lr=args.learning_rate,     # get learning rate from command-line argument\n",
        "    momentum=args.momentum,    # get momentum from command-line argument\n",
        ")\n",
        "```\n",
        "\n",
        "## Test your machine learning code locally\n",
        "\n",
        "To run the modified training script locally, run the python command below.\n",
        "\n",
        "You avoid having to download the CIFAR10 dataset by passing in a local path to the\n",
        "data. Also you can experiment with different values for _learning rate_ and\n",
        "_momentum_ hyperparameters without having to hard-code them in the training script.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "!python code/pytorch-cifar10-your-data/train.py --data_path ./data --learning_rate 0.003 --momentum 0.92"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Upload your data to Azure\n",
        "\n",
        "In order to run this script in Azure Machine Learning, you need to make your training data available in Azure. Your Azure Machine Learning workspace comes equipped with a _default_ **Datastore** - an Azure Blob storage account - that you can use to store your training data.\n",
        "\n",
        "> <span style=\"color:purple; font-weight:bold\">! NOTE <br>\n",
        "> Azure Machine Learning allows you to connect other cloud-based datastores that store your data. For more details, see [datastores documentation](./concept-data.md).</span>\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.core import Workspace\n",
        "ws = Workspace.from_config()\n",
        "datastore = ws.get_default_datastore()\n",
        "datastore.upload(src_dir='./data', target_path='datasets/cifar10', overwrite=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The `target_path` specifies the path on the datastore where the CIFAR10 data will be uploaded.\n",
        "\n",
        "## Submit your machine learning code to Azure Machine Learning\n",
        "\n",
        "As you have done previously, create a new Python control script:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "tags": [
          "remote run",
          "batchai",
          "configure run",
          "use notebook widget",
          "get metrics",
          "use datastore"
        ]
      },
      "outputs": [],
      "source": [
        "from azureml.core import Workspace, Experiment, Environment, ScriptRunConfig, Dataset\n",
        "from azureml.widgets import RunDetails\n",
        "\n",
        "ws = Workspace.from_config()\n",
        "\n",
        "datastore = ws.get_default_datastore()\n",
        "dataset = Dataset.File.from_files(path=(datastore, 'datasets/cifar10'))\n",
        "\n",
        "experiment = Experiment(workspace=ws, name='day1-experiment-data')\n",
        "\n",
        "config = ScriptRunConfig(source_directory='./code/pytorch-cifar10-your-data',\n",
        "                         script='train.py',\n",
        "                         compute_target='cpu-cluster',\n",
        "                         arguments=[\n",
        "                            '--data_path', dataset.as_named_input('input').as_mount(),\n",
        "                            '--learning_rate', 0.003,\n",
        "                            '--momentum', 0.92])\n",
        "\n",
        "# set up pytorch environment\n",
        "env = Environment.from_conda_specification(name='pytorch-aml-env',file_path='configuration/pytorch-aml-env.yml')\n",
        "config.run_config.environment = env\n",
        "\n",
        "run = experiment.submit(config)\n",
        "RunDetails(run).show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Understand the control code\n",
        "\n",
        "The above control code has the following additional code compared to the control code written in [previous tutorial](03-train-model.ipynb)\n",
        "\n",
        "**`dataset = Dataset.File.from_files(path=(datastore, 'datasets/cifar10'))`**: A Dataset is used to reference the data you uploaded to the Azure Blob Store. Datasets are an abstraction layer on top of your data that are designed to improve reliability and trustworthiness.\n",
        "\n",
        "\n",
        "**`config = ScriptRunConfig(...)`**: We modified the `ScriptRunConfig` to include a list of arguments that will be passed into `train.py`. We also specified `dataset.as_named_input('input').as_mount()`, which means the directory specified will be _mounted_ to the compute target.\n",
        "\n",
        "## Inspect the 70_driver_log log file\n",
        "\n",
        "In the navigate to the 70_driver_log.txt file - you should see the following output:\n",
        "\n",
        "```\n",
        "Processing 'input'.\n",
        "Processing dataset FileDataset\n",
        "{\n",
        "  \"source\": [\n",
        "    \"('workspaceblobstore', 'datasets/cifar10')\"\n",
        "  ],\n",
        "  \"definition\": [\n",
        "    \"GetDatastoreFiles\"\n",
        "  ],\n",
        "  \"registration\": {\n",
        "    \"id\": \"XXXXX\",\n",
        "    \"name\": null,\n",
        "    \"version\": null,\n",
        "    \"workspace\": \"Workspace.create(name='XXXX', subscription_id='XXXX', resource_group='X')\"\n",
        "  }\n",
        "}\n",
        "Mounting input to /tmp/tmp9kituvp3.\n",
        "Mounted input to /tmp/tmp9kituvp3 as folder.\n",
        "Exit __enter__ of DatasetContextManager\n",
        "Entering Run History Context Manager.\n",
        "Current directory:  /mnt/batch/tasks/shared/LS_root/jobs/dsvm-aml/azureml/tutorial-session-3_1600171983_763c5381/mounts/workspaceblobstore/azureml/tutorial-session-3_1600171983_763c5381\n",
        "Preparing to call script [ train.py ] with arguments: ['--data_path', '$input', '--learning_rate', '0.003', '--momentum', '0.92']\n",
        "After variable expansion, calling script [ train.py ] with arguments: ['--data_path', '/tmp/tmp9kituvp3', '--learning_rate', '0.003', '--momentum', '0.92']\n",
        "\n",
        "Script type = None\n",
        "===== DATA =====\n",
        "DATA PATH: /tmp/tmp9kituvp3\n",
        "LIST FILES IN DATA PATH...\n",
        "['cifar-10-batches-py', 'cifar-10-python.tar.gz']\n",
        "```\n",
        "\n",
        "Notice:\n",
        "\n",
        "1. Azure Machine Learning has mounted the blob store to the compute cluster automatically for you.\n",
        "2. The ``dataset.as_named_input('input').as_mount()`` used in the control script resolves to the mount point\n",
        "3. In the machine learning code we include a line to list the directorys under the data directory - you can see the list above.\n",
        "\n",
        "## Clean up resources\n",
        "\n",
        "The compute cluster will scale down to zero after 40minutes of idle time. When the compute is idle you will not be charged. If you want to delete the cluster use:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.core import Workspace\n",
        "\n",
        "ws = Workspace.from_config()\n",
        "ct = ws.compute_targets['cpu-cluster']\n",
        "# ct.delete()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "If you're not going to use what you've created here, delete the resources you just created with this quickstart so you don't incur any charges for storage. In the Azure portal, select and delete your resource group.\n",
        "\n",
        "## Next Steps\n",
        "\n",
        "To learn more about the capabilities of Azure Machine Learning please refer to the following documentation:\n",
        "\n",
        "* [Azure Machine Learning Pipelines](https://docs.microsoft.com/en-us/azure/machine-learning/concept-ml-pipelines#building-pipelines-with-the-python-sdk)\n",
        "* [Deploy models for real-time scoring](https://docs.microsoft.com/en-us/azure/machine-learning/tutorial-deploy-models-with-aml)\n",
        "* [Hyper parameter tuning with Azure Machine Learning](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-tune-hyperparameters)\n",
        "* [Prep your code for production](https://docs.microsoft.com/azure/machine-learning/tutorial-convert-ml-experiment-to-production)"
      ]
    }
  ],
  "metadata": {
    "authors": [
      {
        "name": "samkemp"
      }
    ],
    "celltoolbar": "Edit Metadata",
    "kernelspec": {
      "display_name": "Python 3.6",
      "language": "python",
      "name": "python36"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.5"
    },
    "notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License."
  },
  "nbformat": 4,
  "nbformat_minor": 4
}