MachineLearningNotebooks/how-to-use-azureml/work-with-data/datasets-tutorial/labeled-datasets/labeled-datasets.ipynb

{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Copyright (c) Microsoft Corporation. All rights reserved.\n",
        "\n",
        "Licensed under the MIT License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/work-with-data/datasets-tutorial/labeled-datasets/labeled-datasets.png)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Introduction to labeled datasets\n",
        "\n",
        "Labeled datasets are output from Azure Machine Learning [labeling projects](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-create-labeling-projects). It captures the reference to the data (e.g. image files) and its labels. \n",
        "\n",
        "This tutorial introduces the capabilities of labeled datasets and how to use it in training.\n",
        "\n",
        "Learn how-to:\n",
        "\n",
        "> * Set up your development environment\n",
        "> * Explore labeled datasets\n",
        "> * Train a simple deep learning neural network on a remote cluster\n",
        "\n",
        "## Prerequisite:\n",
        "* Understand the [architecture and terms](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture) introduced by Azure Machine Learning\n",
        "* Go through Azure Machine Learning [labeling projects](https://docs.microsoft.com/azure/machine-learning/service/how-to-create-labeling-projects) and export the labels as an Azure Machine Learning dataset\n",
        "* Go through the [configuration notebook](../../../configuration.ipynb) to:\n",
        "    * install the latest version of azureml-sdk\n",
        "    * install the latest version of azureml-contrib-dataset\n",
        "    * install [PyTorch](https://pytorch.org/)\n",
        "    * create a workspace and its configuration file (`config.json`)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Set up your development environment\n",
        "\n",
        "All the setup for your development work can be accomplished in a Python notebook.  Setup includes:\n",
        "\n",
        "* Importing Python packages\n",
        "* Connecting to a workspace to enable communication between your local computer and remote resources\n",
        "* Creating an experiment to track all your runs\n",
        "* Creating a remote compute target to use for training\n",
        "\n",
        "### Import packages\n",
        "\n",
        "Import Python packages you need in this session. Also display the Azure Machine Learning SDK version."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import os\n",
        "import azureml.core\n",
        "import azureml.contrib.dataset\n",
        "from azureml.core import Dataset, Workspace, Experiment\n",
        "from azureml.contrib.dataset import FileHandlingOption\n",
        "\n",
        "# check core SDK version number\n",
        "print(\"Azure ML SDK Version: \", azureml.core.VERSION)\n",
        "print(\"Azure ML Contrib Version\", azureml.contrib.dataset.VERSION)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Connect to workspace\n",
        "\n",
        "Create a workspace object from the existing workspace. `Workspace.from_config()` reads the file **config.json** and loads the details into an object named `workspace`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# load workspace\n",
        "workspace = Workspace.from_config()\n",
        "print('Workspace name: ' + workspace.name, \n",
        "      'Azure region: ' + workspace.location, \n",
        "      'Subscription id: ' + workspace.subscription_id, \n",
        "      'Resource group: ' + workspace.resource_group, sep='\\n')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Create experiment and a directory\n",
        "\n",
        "Create an experiment to track the runs in your workspace and a directory to deliver the necessary code from your computer to the remote resource."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# create an ML experiment\n",
        "exp = Experiment(workspace=workspace, name='labeled-datasets')\n",
        "\n",
        "# create a directory\n",
        "script_folder = './labeled-datasets'\n",
        "os.makedirs(script_folder, exist_ok=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Create or Attach existing compute resource\n",
        "By using Azure Machine Learning Compute, a managed service, data scientists can train machine learning models on clusters of Azure virtual machines. Examples include VMs with GPU support. In this tutorial, you will create Azure Machine Learning Compute as your training environment. The code below creates the compute clusters for you if they don't already exist in your workspace.\n",
        "\n",
        "**Creation of compute takes approximately 5 minutes.** If the AmlCompute with that name is already in your workspace the code will skip the creation process."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.core.compute import ComputeTarget, AmlCompute\n",
        "from azureml.core.compute_target import ComputeTargetException\n",
        "\n",
        "# choose a name for your cluster\n",
        "cluster_name = \"openhack\"\n",
        "\n",
        "try:\n",
        "    compute_target = ComputeTarget(workspace=workspace, name=cluster_name)\n",
        "    print('Found existing compute target')\n",
        "except ComputeTargetException:\n",
        "    print('Creating a new compute target...')\n",
        "    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_NC6', \n",
        "                                                           max_nodes=4)\n",
        "\n",
        "    # create the cluster\n",
        "    compute_target = ComputeTarget.create(workspace, cluster_name, compute_config)\n",
        "\n",
        "    # can poll for a minimum number of nodes and for a specific timeout. \n",
        "    # if no min node count is provided it uses the scale settings for the cluster\n",
        "    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)\n",
        "\n",
        "# use get_status() to get a detailed status for the current cluster. \n",
        "print(compute_target.get_status().serialize())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Explore labeled datasets\n",
        "\n",
        "**Note**: How to create labeled datasets is not covered in this tutorial. To create labeled datasets, you can go through [labeling projects](https://docs.microsoft.com/azure/machine-learning/service/how-to-create-labeling-projects) and export the output labels as Azure Machine Lerning datasets. \n",
        "\n",
        "`animal_labels` used in this tutorial section is the output from a labeling project, with the task type of \"Object Identification\"."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# get animal_labels dataset from the workspace\n",
        "animal_labels = Dataset.get_by_name(workspace, 'animal_labels')\n",
        "animal_labels"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "You can load labeled datasets into pandas DataFrame. There are 3 file handling option that you can choose to load the data files referenced by the labeled datasets:\n",
        "* Streaming: The default option to load data files.\n",
        "* Download: Download your data files to a local path.\n",
        "* Mount: Mount your data files to a mount point. Mount only works for Linux-based compute, including Azure Machine Learning notebook VM and Azure Machine Learning Compute."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "animal_pd = animal_labels.to_pandas_dataframe(file_handling_option=FileHandlingOption.DOWNLOAD, target_path='./download/', overwrite_download=True)\n",
        "animal_pd"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import matplotlib.pyplot as plt\n",
        "import matplotlib.image as mpimg\n",
        "\n",
        "# read images from downloaded path\n",
        "img = mpimg.imread(animal_pd.loc[0,'image_url'])\n",
        "imgplot = plt.imshow(img)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "You can also load labeled datasets into [torchvision datasets](https://pytorch.org/docs/stable/torchvision/datasets.html), so that you can leverage on the open source libraries provided by PyTorch for image transformation and training."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from torchvision.transforms import functional as F\n",
        "\n",
        "# load animal_labels dataset into torchvision dataset\n",
        "pytorch_dataset = animal_labels.to_torchvision()\n",
        "img = pytorch_dataset[0][0]\n",
        "print(type(img))\n",
        "\n",
        "# use methods from torchvision to transform the img into grayscale\n",
        "pil_image = F.to_pil_image(img)\n",
        "gray_image = F.to_grayscale(pil_image, num_output_channels=3)\n",
        "\n",
        "imgplot = plt.imshow(gray_image)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Train an image classification model\n",
        "\n",
        " `crack_labels` dataset used in this tutorial section is the output from a labeling project, with the task type of \"Image Classification Multi-class\". We will use this dataset to train an image classification model that classify whether an image has cracks or not."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# get crack_labels dataset from the workspace\n",
        "crack_labels = Dataset.get_by_name(workspace, 'crack_labels')\n",
        "crack_labels"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Configure Estimator for training\n",
        "\n",
        "You can ask the system to build a conda environment based on your dependency specification. Once the environment is built, and if you don't change your dependencies, it will be reused in subsequent runs."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.core import Environment\n",
        "from azureml.core.conda_dependencies import CondaDependencies\n",
        "\n",
        "conda_env = Environment('conda-env')\n",
        "conda_env.python.conda_dependencies = CondaDependencies.create(pip_packages=['azureml-sdk',\n",
        "                                                                             'azureml-contrib-dataset',\n",
        "                                                                             'torch','torchvision',\n",
        "                                                                             'azureml-dataset-runtime[pandas]'])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "An estimator object is used to submit the run. Azure Machine Learning has pre-configured estimators for common machine learning frameworks, as well as generic Estimator. Create a generic estimator for by specifying\n",
        "\n",
        "* The name of the estimator object, `est`\n",
        "* The directory that contains your scripts. All the files in this directory are uploaded into the cluster nodes for execution. \n",
        "* The training script name, train.py\n",
        "* The input dataset for training\n",
        "* The compute target. In this case you will use the AmlCompute you created\n",
        "* The environment definition for the experiment"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.train.estimator import Estimator\n",
        "\n",
        "est = Estimator(source_directory=script_folder, \n",
        "                entry_script='train.py',\n",
        "                inputs=[crack_labels.as_named_input('crack_labels')],\n",
        "                compute_target=compute_target,\n",
        "                environment_definition= conda_env)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Submit job to run\n",
        "\n",
        "Submit the estimator to the Azure ML experiment to kick off the execution."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "run = exp.submit(est)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "run.wait_for_completion(show_output=True)"
      ]
    }
  ],
  "metadata": {
    "authors": [
      {
        "name": "sihhu"
      }
    ],
    "category": "tutorial",
    "compute": [
      "Remote"
    ],
    "deployment": [
      "None"
    ],
    "exclude_from_index": false,
    "framework": [
      "Azure ML"
    ],
    "friendly_name": "Introduction to labeled datasets",
    "index_order": 1,
    "kernelspec": {
      "display_name": "Python 3.6",
      "language": "python",
      "name": "python36"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.9"
    },
    "nteract": {
      "version": "nteract-front-end@1.0.0"
    },
    "star_tag": [
      "featured"
    ],
    "tags": [
      "Dataset",
      "label",
      "Estimator"
    ],
    "task": "Train"
  },
  "nbformat": 4,
  "nbformat_minor": 2
}