MachineLearningNotebooks/how-to-use-azureml/work-with-data/datasets-tutorial/train-with-datasets/train-with-datasets.ipynb

{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Copyright (c) Microsoft Corporation. All rights reserved.\n",
        "\n",
        "Licensed under the MIT License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/work-with-data/datasets-tutorial/train-with-datasets/train-with-datasets.png)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Train with Azure Machine Learning datasets\n",
        "Datasets are categorized into TabularDataset and FileDataset based on how users consume them in training. \n",
        "* A TabularDataset represents data in a tabular format by parsing the provided file or list of files. TabularDataset can be created from csv, tsv, parquet files, SQL query results etc. For the complete list, please visit our [documentation](https://aka.ms/tabulardataset-api-reference). It provides you with the ability to materialize the data into a pandas DataFrame.\n",
        "* A FileDataset references single or multiple files in your datastores or public urls. This provides you with the ability to download or mount the files to your compute. The files can be of any format, which enables a wider range of machine learning scenarios including deep learning.\n",
        "\n",
        "In this tutorial, you will learn how to train with Azure Machine Learning datasets:\n",
        "\n",
        "&#x2611; Use datasets directly in your training script\n",
        "\n",
        "&#x2611; Use datasets to mount files to a remote compute"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Prerequisites\n",
        "If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, go through the [configuration notebook](../../../configuration.ipynb) first if you haven't already established your connection to the AzureML Workspace."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Check core SDK version number\n",
        "import azureml.core\n",
        "\n",
        "print('SDK version:', azureml.core.VERSION)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Initialize Workspace\n",
        "\n",
        "Initialize a workspace object from persisted configuration."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.core import Workspace\n",
        "\n",
        "ws = Workspace.from_config()\n",
        "print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep='\\n')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Create Experiment\n",
        "\n",
        "**Experiment** is a logical container in an Azure ML Workspace. It hosts run records which can include run metrics and output artifacts from your experiments."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "experiment_name = 'train-with-datasets'\n",
        "\n",
        "from azureml.core import Experiment\n",
        "exp = Experiment(workspace=ws, name=experiment_name)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Create or Attach existing compute resource\n",
        "By using Azure Machine Learning Compute, a managed service, data scientists can train machine learning models on clusters of Azure virtual machines. Examples include VMs with GPU support. In this tutorial, you create Azure Machine Learning Compute as your training environment. The code below creates the compute clusters for you if they don't already exist in your workspace.\n",
        "\n",
        "> Note that if you have an AzureML Data Scientist role, you will not have permission to create compute resources. Talk to your workspace or IT admin to create the compute targets described in this section, if they do not already exist.\n",
        "\n",
        "**Creation of compute takes approximately 5 minutes.** If the AmlCompute with that name is already in your workspace the code will skip the creation process."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.core.compute import AmlCompute\n",
        "from azureml.core.compute import ComputeTarget\n",
        "import os\n",
        "\n",
        "# choose a name for your cluster\n",
        "compute_name = os.environ.get('AML_COMPUTE_CLUSTER_NAME', 'cpu-cluster')\n",
        "compute_min_nodes = os.environ.get('AML_COMPUTE_CLUSTER_MIN_NODES', 0)\n",
        "compute_max_nodes = os.environ.get('AML_COMPUTE_CLUSTER_MAX_NODES', 4)\n",
        "\n",
        "# This example uses CPU VM. For using GPU VM, set SKU to STANDARD_NC6\n",
        "vm_size = os.environ.get('AML_COMPUTE_CLUSTER_SKU', 'STANDARD_D2_V2')\n",
        "\n",
        "\n",
        "if compute_name in ws.compute_targets:\n",
        "    compute_target = ws.compute_targets[compute_name]\n",
        "    if compute_target and type(compute_target) is AmlCompute:\n",
        "        print('found compute target. just use it. ' + compute_name)\n",
        "else:\n",
        "    print('creating a new compute target...')\n",
        "    provisioning_config = AmlCompute.provisioning_configuration(vm_size=vm_size,\n",
        "                                                                min_nodes=compute_min_nodes, \n",
        "                                                                max_nodes=compute_max_nodes)\n",
        "\n",
        "    # create the cluster\n",
        "    compute_target = ComputeTarget.create(ws, compute_name, provisioning_config)\n",
        "    \n",
        "    # can poll for a minimum number of nodes and for a specific timeout. \n",
        "    # if no min node count is provided it will use the scale settings for the cluster\n",
        "    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)\n",
        "    \n",
        "     # For a more detailed view of current AmlCompute status, use get_status()\n",
        "    print(compute_target.get_status().serialize())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "You now have the necessary packages and compute resources to train a model in the cloud.\n",
        "## Use datasets directly in training\n",
        "\n",
        "### Create a TabularDataset\n",
        "By creating a dataset, you create a reference to the data source location. If you applied any subsetting transformations to the dataset, they will be stored in the dataset as well. The data remains in its existing location, so no extra storage cost is incurred. \n",
        "\n",
        "Every workspace comes with a default [datastore](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-access-data) (and you can register more) which is backed by the Azure blob storage account associated with the workspace. We can use it to transfer data from local to the cloud, and create dataset from it. We will now upload the [Iris data](./train-dataset/Iris.csv) to the default datastore (blob) within your workspace."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "datastore = ws.get_default_datastore()\n",
        "datastore.upload_files(files = ['./train-dataset/iris.csv'],\n",
        "                       target_path = 'train-dataset/tabular/',\n",
        "                       overwrite = True,\n",
        "                       show_progress = True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Then we will create an unregistered TabularDataset pointing to the path in the datastore. You can also create a dataset from multiple paths. [learn more](https://aka.ms/azureml/howto/createdatasets) \n",
        "\n",
        "[TabularDataset](https://docs.microsoft.com/python/api/azureml-core/azureml.data.tabulardataset?view=azure-ml-py) represents data in a tabular format by parsing the provided file or list of files. This provides you with the ability to materialize the data into a Pandas or Spark DataFrame. You can create a TabularDataset object from .csv, .tsv, and parquet files, and from SQL query results. For a complete list, see [TabularDatasetFactory](https://docs.microsoft.com/python/api/azureml-core/azureml.data.dataset_factory.tabulardatasetfactory?view=azure-ml-py) class."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "tags": [
          "dataset-remarks-tabular-sample"
        ]
      },
      "outputs": [],
      "source": [
        "from azureml.core import Dataset\n",
        "dataset = Dataset.Tabular.from_delimited_files(path = [(datastore, 'train-dataset/tabular/iris.csv')])\n",
        "\n",
        "# preview the first 3 rows of the dataset\n",
        "dataset.take(3).to_pandas_dataframe()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Create a training script\n",
        "\n",
        "To submit the job to the cluster, first create a training script. Run the following code to create the training script called `train_titanic.py` in the script_folder. "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "script_folder = os.path.join(os.getcwd(), 'train-dataset')"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "%%writefile $script_folder/train_iris.py\n",
        "\n",
        "import os\n",
        "\n",
        "from azureml.core import Dataset, Run\n",
        "from sklearn.model_selection import train_test_split\n",
        "from sklearn.tree import DecisionTreeClassifier\n",
        "# sklearn.externals.joblib is removed in 0.23\n",
        "from sklearn import __version__ as sklearnver\n",
        "from packaging.version import Version\n",
        "if Version(sklearnver) < Version(\"0.23.0\"):\n",
        "    from sklearn.externals import joblib\n",
        "else:\n",
        "    import joblib\n",
        "\n",
        "run = Run.get_context()\n",
        "# get input dataset by name\n",
        "dataset = run.input_datasets['iris']\n",
        "\n",
        "df = dataset.to_pandas_dataframe()\n",
        "\n",
        "x_col = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']\n",
        "y_col = ['species']\n",
        "x_df = df.loc[:, x_col]\n",
        "y_df = df.loc[:, y_col]\n",
        "\n",
        "#dividing X,y into train and test data\n",
        "x_train, x_test, y_train, y_test = train_test_split(x_df, y_df, test_size=0.2, random_state=223)\n",
        "\n",
        "data = {'train': {'X': x_train, 'y': y_train},\n",
        "\n",
        "        'test': {'X': x_test, 'y': y_test}}\n",
        "\n",
        "clf = DecisionTreeClassifier().fit(data['train']['X'], data['train']['y'])\n",
        "model_file_name = 'decision_tree.pkl'\n",
        "\n",
        "print('Accuracy of Decision Tree classifier on training set: {:.2f}'.format(clf.score(x_train, y_train)))\n",
        "print('Accuracy of Decision Tree classifier on test set: {:.2f}'.format(clf.score(x_test, y_test)))\n",
        "\n",
        "os.makedirs('./outputs', exist_ok=True)\n",
        "with open(model_file_name, 'wb') as file:\n",
        "    joblib.dump(value=clf, filename='outputs/' + model_file_name)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Create an environment\n",
        "\n",
        "Define a conda environment YAML file with your training script dependencies and create an Azure ML environment."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "%%writefile conda_dependencies.yml\n",
        "\n",
        "dependencies:\n",
        "- python=3.6.2\n",
        "- scikit-learn\n",
        "- pip:\n",
        "  - azureml-defaults\n",
        "  - packaging"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.core import Environment\n",
        "\n",
        "sklearn_env = Environment.from_conda_specification(name = 'sklearn-env', file_path = './conda_dependencies.yml')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Configure training run"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "A ScriptRunConfig object specifies the configuration details of your training job, including your training script, environment to use, and the compute target to run on. Specify the following in your script run configuration:\n",
        "* The directory that contains your scripts. All the files in this directory are uploaded into the cluster nodes for execution. \n",
        "* The training script name, train_iris.py\n",
        "* The input dataset for training, passed as an argument to your training script. `as_named_input()` is required so that the input dataset can be referenced by the assigned name in your training script. \n",
        "* The compute target. In this case you will use the AmlCompute you created\n",
        "* The environment definition for the experiment"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.core import ScriptRunConfig\n",
        "\n",
        "src = ScriptRunConfig(source_directory=script_folder,\n",
        "                      script='train_iris.py',\n",
        "                      arguments=[dataset.as_named_input('iris')],\n",
        "                      compute_target=compute_target,\n",
        "                      environment=sklearn_env)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Submit job to run\n",
        "Submit the ScriptRunConfig to the Azure ML experiment to kick off the execution."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "run = exp.submit(src)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.widgets import RunDetails\n",
        "\n",
        "# monitor the run\n",
        "RunDetails(run).show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Use datasets to mount files to a remote compute\n",
        "\n",
        "You can use the `Dataset` object to mount or download files referred by it. When you mount a file system, you attach that file system to a directory (mount point) and make it available to the system. Because mounting load files at the time of processing, it is usually faster than download.<br> \n",
        "Note: mounting is only available for Linux-based compute (DSVM/VM, AMLCompute, HDInsights)."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Upload data files into datastore\n",
        "We will first load diabetes data from `scikit-learn` to the train-dataset folder."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from sklearn.datasets import load_diabetes\n",
        "import numpy as np\n",
        "\n",
        "training_data = load_diabetes()\n",
        "np.save(file='train-dataset/features.npy', arr=training_data['data'])\n",
        "np.save(file='train-dataset/labels.npy', arr=training_data['target'])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now let's upload the 2 files into the default datastore under a path named `diabetes`:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "datastore.upload_files(['train-dataset/features.npy', 'train-dataset/labels.npy'], target_path='diabetes', overwrite=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Create a FileDataset\n",
        "\n",
        "[FileDataset](https://docs.microsoft.com/python/api/azureml-core/azureml.data.file_dataset.filedataset?view=azure-ml-py) references single or multiple files in your datastores or public URLs. Using this method, you can download or mount the files to your compute as a FileDataset object. The files can be in any format, which enables a wider range of machine learning scenarios, including deep learning."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.core import Dataset\n",
        "\n",
        "dataset = Dataset.File.from_files(path = [(datastore, 'diabetes/')])\n",
        "\n",
        "# see a list of files referenced by dataset\n",
        "dataset.to_path()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Create a training script\n",
        "\n",
        "To submit the job to the cluster, first create a training script. Run the following code to create the training script called `train_diabetes.py` in the script_folder. "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "%%writefile $script_folder/train_diabetes.py\n",
        "\n",
        "import os\n",
        "import glob\n",
        "import argparse\n",
        "\n",
        "from azureml.core.run import Run\n",
        "from sklearn.linear_model import Ridge\n",
        "from sklearn.metrics import mean_squared_error\n",
        "from sklearn.model_selection import train_test_split\n",
        "# sklearn.externals.joblib is removed in 0.23\n",
        "from sklearn import __version__ as sklearnver\n",
        "from packaging.version import Version\n",
        "if Version(sklearnver) < Version(\"0.23.0\"):\n",
        "    from sklearn.externals import joblib\n",
        "else:\n",
        "    import joblib\n",
        "\n",
        "import numpy as np\n",
        "\n",
        "parser = argparse.ArgumentParser()\n",
        "parser.add_argument('--data-folder', type=str, help='training dataset')\n",
        "args = parser.parse_args()\n",
        "\n",
        "os.makedirs('./outputs', exist_ok=True)\n",
        "\n",
        "base_path = args.data_folder\n",
        "\n",
        "run = Run.get_context()\n",
        "\n",
        "X = np.load(glob.glob(os.path.join(base_path, '**/features.npy'), recursive=True)[0])\n",
        "y = np.load(glob.glob(os.path.join(base_path, '**/labels.npy'), recursive=True)[0])\n",
        "\n",
        "X_train, X_test, y_train, y_test = train_test_split(\n",
        "    X, y, test_size=0.2, random_state=0)\n",
        "data = {'train': {'X': X_train, 'y': y_train},\n",
        "        'test': {'X': X_test, 'y': y_test}}\n",
        "\n",
        "# list of numbers from 0.0 to 1.0 with a 0.05 interval\n",
        "alphas = np.arange(0.0, 1.0, 0.05)\n",
        "\n",
        "for alpha in alphas:\n",
        "    # use Ridge algorithm to create a regression model\n",
        "    reg = Ridge(alpha=alpha)\n",
        "    reg.fit(data['train']['X'], data['train']['y'])\n",
        "\n",
        "    preds = reg.predict(data['test']['X'])\n",
        "    mse = mean_squared_error(preds, data['test']['y'])\n",
        "    run.log('alpha', alpha)\n",
        "    run.log('mse', mse)\n",
        "\n",
        "    model_file_name = 'ridge_{0:.2f}.pkl'.format(alpha)\n",
        "    with open(model_file_name, 'wb') as file:\n",
        "        joblib.dump(value=reg, filename='outputs/' + model_file_name)\n",
        "\n",
        "    print('alpha is {0:.2f}, and mse is {1:0.2f}'.format(alpha, mse))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Configure & Run"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now configure your run. We will reuse the same `sklearn_env` environment from the previous run. Once the environment is built, and if you don't change your dependencies, it will be reused in subsequent runs. \n",
        "\n",
        "We will pass in the DatasetConsumptionConfig of our FileDataset to the `'--data-folder'` argument of the script. Azure ML will resolve this to mount point of the data on the compute target, which we parse in the training script."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.core import ScriptRunConfig\n",
        "\n",
        "src = ScriptRunConfig(source_directory=script_folder, \n",
        "                      script='train_diabetes.py', \n",
        "                      # to mount the dataset on the remote compute and pass the mounted path as an argument to the training script\n",
        "                      arguments =['--data-folder', dataset.as_mount()],\n",
        "                      compute_target=compute_target,\n",
        "                      environment=sklearn_env)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "run = exp.submit(config=src)\n",
        "\n",
        "# monitor the run\n",
        "RunDetails(run).show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Display run results\n",
        "You now have a model trained on a remote cluster. Retrieve all the metrics logged during the run, including the accuracy of the model:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "run.wait_for_completion()\n",
        "metrics = run.get_metrics()\n",
        "print(metrics)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Register datasets\n",
        "Use the register() method to register datasets to your workspace so they can be shared with others, reused across various experiments, and referred to by name in your training script."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "dataset = dataset.register(workspace = ws,\n",
        "                           name = 'diabetes dataset',\n",
        "                           description='training dataset',\n",
        "                           create_new_version=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Register models with datasets\n",
        "The last step in the training script wrote the model files in a directory named `outputs` in the VM of the cluster where the job is executed. `outputs` is a special directory in that all content in this directory is automatically uploaded to your workspace. This content appears in the run record in the experiment under your workspace. Hence, the model file is now also available in your workspace.\n",
        "\n",
        "You can register models with datasets for reproducibility and auditing purpose."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# find the index where MSE is the smallest\n",
        "indices = list(range(0, len(metrics['mse'])))\n",
        "min_mse_index = min(indices, key=lambda x: metrics['mse'][x])\n",
        "\n",
        "print('When alpha is {1:0.2f}, we have min MSE {0:0.2f}.'.format(\n",
        "    metrics['mse'][min_mse_index], \n",
        "    metrics['alpha'][min_mse_index]\n",
        "))"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# find the best model\n",
        "best_alpha = metrics['alpha'][min_mse_index]\n",
        "model_file_name = 'ridge_{0:.2f}.pkl'.format(best_alpha)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# register the best model with the input dataset\n",
        "model = run.register_model(model_name='sklearn_diabetes', model_path=os.path.join('outputs', model_file_name),\n",
        "                           datasets =[('training data',dataset)])"
      ]
    }
  ],
  "metadata": {
    "authors": [
      {
        "name": "sihhu"
      }
    ],
    "category": "tutorial",
    "compute": [
      "Remote"
    ],
    "datasets": [
      "Iris",
      "Diabetes"
    ],
    "deployment": [
      "None"
    ],
    "exclude_from_index": false,
    "framework": [
      "Azure ML"
    ],
    "friendly_name": "Train with Datasets (Tabular and File)",
    "index_order": 1,
    "kernelspec": {
      "display_name": "Python 3.6",
      "language": "python",
      "name": "python36"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.9"
    },
    "star_tag": [
      "featured"
    ],
    "tags": [
      "Dataset",
      "Estimator",
      "ScriptRun"
    ],
    "task": "Train"
  },
  "nbformat": 4,
  "nbformat_minor": 2
}