Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/work-with-data/datasets-tutorial/labeled-datasets/labeled-datasets.png)

# Introduction to labeled datasets

Labeled datasets are output from Azure Machine Learning [labeling projects](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-create-labeling-projects). It captures the reference to the data (e.g. image files) and its labels. 

This tutorial introduces the capabilities of labeled datasets and how to use it in training.

Learn how-to:

> * Set up your development environment
> * Explore labeled datasets
> * Train a simple deep learning neural network on a remote cluster

## Prerequisite:
* Understand the [architecture and terms](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture) introduced by Azure Machine Learning
* Go through Azure Machine Learning [labeling projects](https://docs.microsoft.com/azure/machine-learning/service/how-to-create-labeling-projects) and export the labels as an Azure Machine Learning dataset
* Go through the [configuration notebook](../../../configuration.ipynb) to:
    * install the latest version of azureml-sdk
    * install the latest version of azureml-contrib-dataset
    * install [PyTorch](https://pytorch.org/)
    * create a workspace and its configuration file (`config.json`)

## Set up your development environment

All the setup for your development work can be accomplished in a Python notebook.  Setup includes:

* Importing Python packages
* Connecting to a workspace to enable communication between your local computer and remote resources
* Creating an experiment to track all your runs
* Creating a remote compute target to use for training

### Import packages

Import Python packages you need in this session. Also display the Azure Machine Learning SDK version.

In [None]:
import os
import azureml.core
import azureml.contrib.dataset
from azureml.core import Dataset, Workspace, Experiment
from azureml.contrib.dataset import FileHandlingOption

# check core SDK version number
print("Azure ML SDK Version: ", azureml.core.VERSION)
print("Azure ML Contrib Version", azureml.contrib.dataset.VERSION)

### Connect to workspace

Create a workspace object from the existing workspace. `Workspace.from_config()` reads the file **config.json** and loads the details into an object named `workspace`.

In [None]:
# load workspace
workspace = Workspace.from_config()
print('Workspace name: ' + workspace.name, 
      'Azure region: ' + workspace.location, 
      'Subscription id: ' + workspace.subscription_id, 
      'Resource group: ' + workspace.resource_group, sep='\n')

### Create experiment and a directory

Create an experiment to track the runs in your workspace and a directory to deliver the necessary code from your computer to the remote resource.

In [None]:
# create an ML experiment
exp = Experiment(workspace=workspace, name='labeled-datasets')

# create a directory
script_folder = './labeled-datasets'
os.makedirs(script_folder, exist_ok=True)

### Create or Attach existing compute resource
By using Azure Machine Learning Compute, a managed service, data scientists can train machine learning models on clusters of Azure virtual machines. Examples include VMs with GPU support. In this tutorial, you will create Azure Machine Learning Compute as your training environment. The code below creates the compute clusters for you if they don't already exist in your workspace.

**Creation of compute takes approximately 5 minutes.** If the AmlCompute with that name is already in your workspace the code will skip the creation process.

In [None]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# choose a name for your cluster
cluster_name = "openhack"

try:
    compute_target = ComputeTarget(workspace=workspace, name=cluster_name)
    print('Found existing compute target')
except ComputeTargetException:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_NC6', 
                                                           max_nodes=4)

    # create the cluster
    compute_target = ComputeTarget.create(workspace, cluster_name, compute_config)

    # can poll for a minimum number of nodes and for a specific timeout. 
    # if no min node count is provided it uses the scale settings for the cluster
    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)

# use get_status() to get a detailed status for the current cluster. 
print(compute_target.get_status().serialize())

## Explore labeled datasets

**Note**: How to create labeled datasets is not covered in this tutorial. To create labeled datasets, you can go through [labeling projects](https://docs.microsoft.com/azure/machine-learning/service/how-to-create-labeling-projects) and export the output labels as Azure Machine Lerning datasets. 

`animal_labels` used in this tutorial section is the output from a labeling project, with the task type of "Object Identification".

In [None]:
# get animal_labels dataset from the workspace
animal_labels = Dataset.get_by_name(workspace, 'animal_labels')
animal_labels

You can load labeled datasets into pandas DataFrame. There are 3 file handling option that you can choose to load the data files referenced by the labeled datasets:
* Streaming: The default option to load data files.
* Download: Download your data files to a local path.
* Mount: Mount your data files to a mount point. Mount only works for Linux-based compute, including Azure Machine Learning notebook VM and Azure Machine Learning Compute.

In [None]:
animal_pd = animal_labels.to_pandas_dataframe(file_handling_option=FileHandlingOption.DOWNLOAD, target_path='./download/', overwrite_download=True)
animal_pd

In [None]:
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

# read images from downloaded path
img = mpimg.imread(animal_pd.loc[0,'image_url'])
imgplot = plt.imshow(img)

You can also load labeled datasets into [torchvision datasets](https://pytorch.org/docs/stable/torchvision/datasets.html), so that you can leverage on the open source libraries provided by PyTorch for image transformation and training.

In [None]:
from torchvision.transforms import functional as F

# load animal_labels dataset into torchvision dataset
pytorch_dataset = animal_labels.to_torchvision()
img = pytorch_dataset[0][0]
print(type(img))

# use methods from torchvision to transform the img into grayscale
pil_image = F.to_pil_image(img)
gray_image = F.to_grayscale(pil_image, num_output_channels=3)

imgplot = plt.imshow(gray_image)

## Train an image classification model

 `crack_labels` dataset used in this tutorial section is the output from a labeling project, with the task type of "Image Classification Multi-class". We will use this dataset to train an image classification model that classify whether an image has cracks or not.

In [None]:
# get crack_labels dataset from the workspace
crack_labels = Dataset.get_by_name(workspace, 'crack_labels')
crack_labels

### Configure training job

You can ask the system to build a conda environment based on your dependency specification. Once the environment is built, and if you don't change your dependencies, it will be reused in subsequent runs.

In [None]:
from azureml.core import Environment
from azureml.core.conda_dependencies import CondaDependencies

conda_env = Environment('conda-env')
conda_env.python.conda_dependencies = CondaDependencies.create(pip_packages=['azureml-sdk',
                                                                             'azureml-contrib-dataset',
                                                                             'torch','torchvision',
                                                                             'azureml-dataset-runtime[pandas]'])

A ScriptRunConfig object is used to submit the run. Create a ScriptRunConfig by specifying

* The directory that contains your scripts. All the files in this directory are uploaded into the cluster nodes for execution. 
* The training script name, train.py
* The input dataset for training
* The compute target. In this case you will use the AmlCompute you created
* The environment for the experiment

In [None]:
from azureml.core import ScriptRunConfig

src = ScriptRunConfig(source_directory=script_folder,
                      script='train.py',
                      arguments=[crack_labels.as_named_input('crack_labels')],
                      compute_target=compute_target,
                      enviroment=conda_env)

### Submit job to run

Submit the ScriptRunConfig to the Azure ML experiment to kick off the execution.

In [None]:
run = exp.submit(src)

In [None]:
run.wait_for_completion(show_output=True)