MachineLearningNotebooks/how-to-use-azureml/work-with-data/datasets/datasets-tutorial/tabular-dataset-tutorial.ipynb

{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Tutorial: Learn how to use TabularDatasets in Azure Machine Learning"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "In this tutorial, you will learn how to use Azure Machine Learning Datasets to train a classification model with the Azure Machine Learning SDK for Python. You will:\n",
        "\n",
        "&#x2611; Setup a Python environment and import packages\n",
        "\n",
        "&#x2611; Load the Titanic data from your Azure Blob Storage. (The [original data](https://www.kaggle.com/c/titanic/data) can be found on Kaggle)\n",
        "\n",
        "&#x2611; Create and register a TabularDataset in your workspace\n",
        "\n",
        "&#x2611; Train a classification model using the TabularDataset"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Pre-requisites:\n",
        "To create and work with datasets, you need:\n",
        "* An Azure subscription. If you don\u00e2\u20ac\u2122t have an Azure subscription, create a free account before you begin. Try the [free or paid version of Azure Machine Learning service](https://aka.ms/AMLFree) today.\n",
        "* An [Azure Machine Learning service workspace](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-workspace)\n",
        "* The [Azure Machine Learning SDK for Python installed](https://docs.microsoft.com/python/api/overview/azure/ml/install?view=azure-ml-py), which includes the azureml-datasets package.\n",
        "\n",
        "Data and train.py script to store in your Azure Blob Storage Account.\n",
        " * [Titanic data](./train-dataset/Titanic.csv)\n",
        " * [train.py](./train-dataset/train.py)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Initialize a Workspace\n",
        "\n",
        "Initialize a workspace object from persisted configuration"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import azureml.core\n",
        "from azureml.core import Workspace, Datastore, Dataset\n",
        "\n",
        "# Get existing workspace from config.json file in the same folder as the tutorial notebook\n",
        "# You can download the config file from your workspace\n",
        "workspace = Workspace.from_config()\n",
        "print(workspace)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Create a TabularDataset\n",
        "\n",
        "Datasets are categorized into various types based on how users consume them in training. In this tutorial, you will create and use a TabularDataset in training. A TabularDataset represents data in a tabular format by parsing the provided file or list of files. TabularDataset can be created from csv, tsv, parquet files, SQL query results etc. For the complete list, please visit our [documentation](https://aka.ms/tabulardataset-api-reference). It provides you with the ability to materialize the data into a pandas DataFrame."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "By creating a dataset, you create a reference to the data source location, along with a copy of its metadata. The data remains in  its existing location, so no extra storage cost is incurred.\n",
        "\n",
        "We will now upload the [Titanic data](./train-dataset/Titanic.csv) to the default datastore(blob) within your workspace.."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "datastore = workspace.get_default_datastore()\n",
        "datastore.upload_files(files = ['./train-dataset/Titanic.csv'],\n",
        "                       target_path = 'train-dataset/',\n",
        "                       overwrite = True,\n",
        "                       show_progress = True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Then we will create an unregistered TabularDataset pointing to the path in the datastore. We also support create a Dataset from multiple paths. [learn more](https://aka.ms/azureml/howto/createdatasets) "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "dataset = Dataset.Tabular.from_delimited_files(path = [(datastore, 'train-dataset/Titanic.csv')])\n",
        "\n",
        "#preview the first 3 rows of the dataset\n",
        "dataset.take(3).to_pandas_dataframe()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Use the `register()` method to register datasets to your workspace so they can be shared with others, reused across various experiments, and refered to by name in your training script."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "dataset = dataset.register(workspace = workspace,\n",
        "                           name = 'titanic dataset',\n",
        "                           description='training dataset')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Create or Attach existing AmlCompute\n",
        "You will need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for your training. In this tutorial, you create `AmlCompute` as your training compute resource.\n",
        "\n",
        "**Creation of AmlCompute takes approximately 5 minutes.** If the AmlCompute with that name is already in your workspace this code will skip the creation process.\n",
        "\n",
        "As with other Azure services, there are limits on certain resources (e.g. AmlCompute) associated with the Azure Machine Learning service. Please read [this article](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-quotas) on the default limits and how to request more quota."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.core.compute import AmlCompute\n",
        "from azureml.core.compute import ComputeTarget\n",
        "\n",
        "# Choose a name for your cluster.\n",
        "amlcompute_cluster_name = \"your cluster name\"\n",
        "\n",
        "found = False\n",
        "# Check if this compute target already exists in the workspace.\n",
        "cts = workspace.compute_targets\n",
        "if amlcompute_cluster_name in cts and cts[amlcompute_cluster_name].type == 'AmlCompute':\n",
        "    found = True\n",
        "    print('Found existing compute target.')\n",
        "    compute_target = cts[amlcompute_cluster_name]\n",
        "\n",
        "if not found:\n",
        "    print('Creating a new compute target...')\n",
        "    provisioning_config = AmlCompute.provisioning_configuration(vm_size = \"STANDARD_D2_V2\", # for GPU, use \"STANDARD_NC6\"\n",
        "                                                                #vm_priority = 'lowpriority', # optional\n",
        "                                                                max_nodes = 6)\n",
        "\n",
        "    # Create the cluster.\\n\",\n",
        "    compute_target = ComputeTarget.create(ws, amlcompute_cluster_name, provisioning_config)\n",
        "\n",
        "print('Checking cluster status...')\n",
        "# Can poll for a minimum number of nodes and for a specific timeout.\n",
        "# If no min_node_count is provided, it will use the scale settings for the cluster.\n",
        "compute_target.wait_for_completion(show_output = True, min_node_count = None, timeout_in_minutes = 20)\n",
        "\n",
        "# For a more detailed view of current AmlCompute status, use get_status()."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Create an Experiment\n",
        "**Experiment** is a logical container in an Azure ML Workspace. It hosts run records which can include run metrics and output artifacts from your experiments."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.core import Experiment\n",
        "\n",
        "experiment_name = 'training-datasets'\n",
        "experiment = Experiment(workspace = workspace, name = experiment_name)\n",
        "project_folder = './train-dataset/'"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Configure & Run"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.core.runconfig import RunConfiguration\n",
        "from azureml.core.conda_dependencies import CondaDependencies\n",
        "import pkg_resources\n",
        "\n",
        "# create a new RunConfig object\n",
        "conda_run_config = RunConfiguration(framework=\"python\")\n",
        "\n",
        "# Set compute target to AmlCompute\n",
        "conda_run_config.target = compute_target\n",
        "conda_run_config.environment.docker.enabled = True\n",
        "conda_run_config.environment.docker.base_image = azureml.core.runconfig.DEFAULT_CPU_IMAGE\n",
        "\n",
        "dprep_dependency = 'azureml-dataprep==' + pkg_resources.get_distribution(\"azureml-dataprep\").version\n",
        "\n",
        "cd = CondaDependencies.create(pip_packages=['azureml-sdk', 'scikit-learn', 'pandas', dprep_dependency])\n",
        "conda_run_config.environment.python.conda_dependencies = cd"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# create a new RunConfig object\n",
        "run_config = RunConfiguration()\n",
        "\n",
        "run_config.environment.python.user_managed_dependencies = True\n",
        "\n",
        "from azureml.core import Run\n",
        "from azureml.core import ScriptRunConfig\n",
        "\n",
        "src = ScriptRunConfig(source_directory=project_folder, \n",
        "                      script='train.py', \n",
        "                      run_config=conda_run_config) \n",
        "run = experiment.submit(config=src)\n",
        "run.wait_for_completion(show_output=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## View run history details"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "run"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "You have now finished using a dataset from start to finish of your experiment!"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/work-with-data/datasets/datasets-tutorial/datasets-tutorial.png)"
      ]
    }
  ],
  "metadata": {
    "authors": [
      {
        "name": "cforbe"
      }
    ],
    "kernelspec": {
      "display_name": "Python 3.6",
      "language": "python",
      "name": "python36"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.8"
    },
    "notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License."
  },
  "nbformat": 4,
  "nbformat_minor": 2
}