# Tutorial: Learn how to use Datasets in Azure ML

In this tutorial, you learn how to use Azure ML Datasets to train a regression model with the Azure Machine Learning SDK for Python. You will

* Explore and prepare data for training the model
* Register the Dataset in your workspace to share it with others
* Take snapshots of data to ensure models can be trained with the same data every time
* Create and use multiple Dataset definitions to ensure that updates to the definition don't break existing pipelines/scripts


In this tutorial, you:

&#x2611; Setup a Python environment and import packages

&#x2611; Load the Titanic data from your Azure Blob Storage. (The [original data](https://www.kaggle.com/c/titanic/data) can be found on Kaggle)

&#x2611; Explore and cleanse the data to remove anomalies

&#x2611; Register the Dataset in your workspace, allowing you to use it in model training 

&#x2611; Take a Dataset snapshot for repeatability and train a model with the snapshot

&#x2611; Make changes to the dataset's definition without breaking the production model or the daily data pipeline

## Pre-requisites:
Skip to Set up your development environment to read through the notebook steps, or use the instructions below to get the notebook and run it on Azure Notebooks or your own notebook server. To run the notebook you will need:

A Python 3.6 notebook server with the following installed:
* The Azure Machine Learning SDK for Python
* The Azure Machine Learning Data Prep SDK for Python
* The tutorial notebook

Data and train.py script to store in your Azure Blob Storage Account.
 * [Titanic data](./train-dataset/Titanic.csv)
 * [train.py](./train-dataset/train.py)

To create and register Datasets you need:

   * An Azure subscription. If you donâ€™t have an Azure subscription, create a free account before you begin. Try the [free or paid version of Azure Machine Learning service](https://aka.ms/AMLFree) today.

   * An Azure Machine Learning service workspace. See the [Create an Azure Machine Learning service workspace](https://docs.microsoft.com/en-us/azure/machine-learning/service/setup-create-workspace?branch=release-build-amls).

   * The Azure Machine Learning SDK for Python (version 1.0.21 or later). To install or update to the latest version of the SDK, see [Install or update the SDK](https://docs.microsoft.com/python/api/overview/azure/ml/install?view=azure-ml-py).


For more information on how to set up your workspace, see the [Create an Azure Machine Learning service workspace](https://docs.microsoft.com/en-us/azure/machine-learning/service/setup-create-workspace?branch=release-build-amls).

The first part that needs to be done is setting up your python environment. You will need to import all of your python packages including `azureml.dataprep` and `azureml.core.dataset`. Then access your workspace through your Azure subscription and set up your compute target. 

In [None]:
import azureml.dataprep as dprep
import azureml.core
import pandas as pd
import logging
import os
import shutil
from azureml.core import Workspace, Datastore, Dataset

# Get existing workspace from config.json file in the same folder as the tutorial notebook
# You can download the config file from your workspace
workspace = Workspace.from_config()
print("Workspace")
print(workspace)
print("Compute targets")
print(workspace.compute_targets)

# Get compute target that has already been attached to the workspace
# Pick the right compute target from the list of computes attached to your workspace

compute_target_name = 'dataset-test'
remote_compute_target = workspace.compute_targets[compute_target_name]

To load data to your dataset, you will access the data through your datastore. After you create your dataset, you can use `get_profile()` to see your data's statistics.

We will now upload the [original data](https://www.kaggle.com/c/titanic/data) to the default datastore(blob) within your workspace..

In [None]:
datastore = workspace.get_default_datastore()
datastore.upload_files(files=['./train-dataset/Titanic.csv'],
                       target_path='train-dataset/',
                       overwrite=True,
                       show_progress=True)

dataset = Dataset.auto_read_files(path=datastore.path('train-dataset/Titanic.csv'))

#Display Dataset Profile of the Titanic Dataset
dataset.get_profile()

To predict if a person survived the Titanic's sinking or not, the columns that are relevant to train the model are 'Survived','Pclass', 'Sex','SibSp', and 'Parch'. You can update your dataset's deinition and only keep these columns you will need. You will also need to convert values ("male","female") in the "Sex" column to 0 or 1, because the algorithm in the train.py file will be using numeric values instead of strings.

For more examples of preparing data with Datasets, see [Explore and prepare data with the Dataset class](aka.ms/azureml/howto/exploreandpreparedata).

In [None]:
ds_def = dataset.get_definition()
ds_def = ds_def.keep_columns(['Survived','Pclass', 'Sex','SibSp', 'Parch', 'Fare'])
ds_def = ds_def.replace('Sex','male', 0)
ds_def = ds_def.replace('Sex','female', 1)
ds_def.head(5)

Once you have cleaned your data, you can register your dataset in your workspace. 

Registering your dataset allows you to easily have access to your processed data and share it with other people in your organization using the same workspace. It can be accessed in any notebook or script that is connected to your workspace.

In [None]:
dataset = dataset.update_definition(ds_def, 'Cleaned Data')

dataset.generate_profile(compute_target='local').get_result()

dataset_name = 'clean_Titanic_tutorial'
dataset = dataset.register(workspace=workspace,
                           name=dataset_name,
                           description='training dataset',
                           tags = {'year':'2019', 'month':'Apr'},
                           exist_ok=True)
workspace.datasets

You can also take a snapshot of your dataset. This makes for easily reproducing your data as it is in that moment. Even if you changed the definition of your dataset, or have data that refreshes regularly, you can always go back to your snapshot to compare. Since this snapshot is being created on a compute in your workspace, it may take a signficant amount of time to provision the compute before running the action itself.

In [None]:
print(dataset.get_all_snapshots())
snapshot_name = 'train_snapshot'

print("Compute target status")
print(remote_compute_target.get_status().provisioning_state)

snapshot = dataset.create_snapshot(snapshot_name=snapshot_name, 
                                   compute_target=remote_compute_target, 
                                   create_data_snapshot=True)
snapshot.wait_for_completion()

Now that you have registered your dataset and created a snapshot, you can call up the dataset and it's snapshot to use it in your train.py script.

The following code snippit will train your model locally using the train.py script.

In [None]:
from azureml.core import Experiment, RunConfiguration

experiment_name = 'training-datasets'
experiment = Experiment(workspace = workspace, name = experiment_name)
project_folder = './train-dataset/'

# create a new RunConfig object
run_config = RunConfiguration()

run_config.environment.python.user_managed_dependencies = True

from azureml.core import Run
from azureml.core import ScriptRunConfig

src = ScriptRunConfig(source_directory=project_folder, 
                      script='train.py', 
                      run_config=run_config) 
run = experiment.submit(config=src)
run.wait_for_completion(show_output=True)

You can also use the same script with your dataset snapshot for your Pipeline's Python Script Step.


In [None]:
from azureml.pipeline.core import Pipeline, PipelineData
from azureml.pipeline.steps import PythonScriptStep
from azureml.data.data_reference import DataReference

trainStep = PythonScriptStep(script_name="train.py",
                             compute_target=remote_compute_target,
                             source_directory=project_folder)

pipeline = Pipeline(workspace=workspace,
                    steps=trainStep)

pipeline_run = experiment.submit(pipeline)
pipeline_run.wait_for_completion()

During any point of your workflow, you can get a previous snapshot of your dataset and use that version in your pipeline to quickly see how different versions of your data can effect your model.

In [None]:
snapshot = dataset.get_snapshot(snapshot_name=snapshot_name)
snapshot.to_pandas_dataframe().head(5)

You can make changes to the dataset's definition without breaking the production model or the daily data pipeline. 

You can call get_definitions to see that there are several versions. After each change to a dataset's version, another one is added.

In [None]:
dataset.get_definitions()

In [None]:
dataset = Dataset.get(workspace=workspace, name=dataset_name)
ds_def = dataset.get_definition()
ds_def = ds_def.drop_columns(['Fare'])
dataset = dataset.update_definition(ds_def, 'Dropping Fare as PClass and Fare are strongly correlated')
dataset.generate_profile(compute_target='local').get_result()

Dataset definitions can be deprecated when usage is no longer recommended and a replacement is available. When a deprecated dataset definition is used in an AML Experimentation/Pipeline scenario, a warning message gets returned but execution will not be blocked. 

In [None]:
# Deprecate dataset definition 1 by the 2nd definition
ds_def = dataset.get_definition('1')
ds_def.deprecate(deprecate_by_dataset_id=dataset._id, deprecated_by_definition_version='2')
dataset.get_definitions()

Dataset definitions can be archived when definitions are not supposed to be used for any reasons (such as underlying data no longer available). When an archived dataset definition is used in an AML Experimentation/Pipeline scenario, execution will be blocked with error. No further actions can be performed on archived Dataset definitions, but the references will be kept intact. 

In [None]:
# Archive the deprecated dataset definition #1
ds_def = dataset.get_definition('1')
ds_def.archive()
dataset.get_definitions()

You can also reactivate any defition that you archived for later use.

In [None]:
ds_def = dataset.get_definition('1')
ds_def.reactivate()
dataset.get_definitions()

Now delete the current snapshot name to clean up your resource's space.

In [None]:
dataset.delete_snapshot(snapshot_name)

You have now finished using a dataset from start to finish of your experiment!