MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/dataset-remote-execution/auto-ml-dataset-remote-execution.ipynb

{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Copyright (c) Microsoft Corporation. All rights reserved.\n",
        "\n",
        "Licensed under the MIT License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/dataprep-remote-execution/auto-ml-dataprep-remote-execution.png)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Automated Machine Learning\n",
        "_**Load Data using `TabularDataset` for Remote Execution (AmlCompute)**_\n",
        "\n",
        "## Contents\n",
        "1. [Introduction](#Introduction)\n",
        "1. [Setup](#Setup)\n",
        "1. [Data](#Data)\n",
        "1. [Train](#Train)\n",
        "1. [Results](#Results)\n",
        "1. [Test](#Test)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Introduction\n",
        "In this example we showcase how you can use AzureML Dataset to load data for AutoML.\n",
        "\n",
        "Make sure you have executed the [configuration](../../../configuration.ipynb) before running this notebook.\n",
        "\n",
        "In this notebook you will learn how to:\n",
        "1. Create a `TabularDataset` pointing to the training data.\n",
        "2. Pass the `TabularDataset` to AutoML for a remote run."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Setup"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "As part of the setup you have already created an Azure ML `Workspace` object. For AutoML you will need to create an `Experiment` object, which is a named object in a `Workspace` used to run experiments."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import logging\n",
        "\n",
        "import pandas as pd\n",
        "\n",
        "import azureml.core\n",
        "from azureml.core.experiment import Experiment\n",
        "from azureml.core.workspace import Workspace\n",
        "from azureml.core.dataset import Dataset\n",
        "from azureml.train.automl import AutoMLConfig"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "ws = Workspace.from_config()\n",
        "\n",
        "# choose a name for experiment\n",
        "experiment_name = 'automl-dataset-remote-bai'\n",
        " \n",
        "experiment = Experiment(ws, experiment_name)\n",
        " \n",
        "output = {}\n",
        "output['SDK version'] = azureml.core.VERSION\n",
        "output['Subscription ID'] = ws.subscription_id\n",
        "output['Workspace Name'] = ws.name\n",
        "output['Resource Group'] = ws.resource_group\n",
        "output['Location'] = ws.location\n",
        "output['Experiment Name'] = experiment.name\n",
        "pd.set_option('display.max_colwidth', -1)\n",
        "outputDf = pd.DataFrame(data = output, index = [''])\n",
        "outputDf.T"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Data"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# The data referenced here was a 1MB simple random sample of the Chicago Crime data into a local temporary directory.\n",
        "example_data = 'https://dprepdata.blob.core.windows.net/demo/crime0-random.csv'\n",
        "dataset = Dataset.Tabular.from_delimited_files(example_data)\n",
        "dataset.take(5).to_pandas_dataframe()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Review the data\n",
        "\n",
        "You can peek the result of a `TabularDataset` at any range using `skip(i)` and `take(j).to_pandas_dataframe()`. Doing so evaluates only `j` records, which makes it fast even against large datasets.\n",
        "\n",
        "`TabularDataset` objects are immutable and are composed of a list of subsetting transformations (optional)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "X = dataset.drop_columns(columns=['Primary Type', 'FBI Code'])\n",
        "y = dataset.keep_columns(columns=['Primary Type'], validate=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Train\n",
        "\n",
        "This creates a general AutoML settings object applicable for both local and remote runs."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "automl_settings = {\n",
        "    \"iteration_timeout_minutes\" : 10,\n",
        "    \"iterations\" : 2,\n",
        "    \"primary_metric\" : 'AUC_weighted',\n",
        "    \"preprocess\" : True,\n",
        "    \"verbosity\" : logging.INFO\n",
        "}"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Create or Attach an AmlCompute cluster"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.core.compute import AmlCompute\n",
        "from azureml.core.compute import ComputeTarget\n",
        "\n",
        "# Choose a name for your cluster.\n",
        "amlcompute_cluster_name = \"automlc2\"\n",
        "\n",
        "found = False\n",
        "\n",
        "# Check if this compute target already exists in the workspace.\n",
        "\n",
        "cts = ws.compute_targets\n",
        "if amlcompute_cluster_name in cts and cts[amlcompute_cluster_name].type == 'AmlCompute':\n",
        "    found = True\n",
        "    print('Found existing compute target.')\n",
        "    compute_target = cts[amlcompute_cluster_name]\n",
        "\n",
        "if not found:\n",
        "    print('Creating a new compute target...')\n",
        "    provisioning_config = AmlCompute.provisioning_configuration(vm_size = \"STANDARD_D2_V2\", # for GPU, use \"STANDARD_NC6\"\n",
        "                                                                #vm_priority = 'lowpriority', # optional\n",
        "                                                                max_nodes = 6)\n",
        "\n",
        "    # Create the cluster.\\n\",\n",
        "    compute_target = ComputeTarget.create(ws, amlcompute_cluster_name, provisioning_config)\n",
        "\n",
        "print('Checking cluster status...')\n",
        "# Can poll for a minimum number of nodes and for a specific timeout.\n",
        "# If no min_node_count is provided, it will use the scale settings for the cluster.\n",
        "compute_target.wait_for_completion(show_output = True, min_node_count = None, timeout_in_minutes = 20)\n",
        "\n",
        "# For a more detailed view of current AmlCompute status, use get_status()."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.core.runconfig import RunConfiguration\n",
        "from azureml.core.conda_dependencies import CondaDependencies\n",
        "import pkg_resources\n",
        "\n",
        "# create a new RunConfig object\n",
        "conda_run_config = RunConfiguration(framework=\"python\")\n",
        "\n",
        "# Set compute target to AmlCompute\n",
        "conda_run_config.target = compute_target\n",
        "conda_run_config.environment.docker.enabled = True\n",
        "\n",
        "cd = CondaDependencies.create(conda_packages=['numpy','py-xgboost<=0.80'])\n",
        "conda_run_config.environment.python.conda_dependencies = cd"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Pass Data with `TabularDataset` Objects\n",
        "\n",
        "The `TabularDataset` objects captured above can also be passed to the `submit` method for a remote run. AutoML will serialize the `TabularDataset` object and send it to the remote compute target. The `TabularDataset` will not be evaluated locally."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "automl_config = AutoMLConfig(task = 'classification',\n",
        "                             debug_log = 'automl_errors.log',\n",
        "                             run_configuration=conda_run_config,\n",
        "                             X = X,\n",
        "                             y = y,\n",
        "                             **automl_settings)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "remote_run = experiment.submit(automl_config, show_output = True)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "remote_run"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Pre-process cache cleanup\n",
        "The preprocess data gets cache at user default file store. When the run is completed the cache can be cleaned by running below cell"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "remote_run.clean_preprocessor_cache()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Cancelling Runs\n",
        "You can cancel ongoing remote runs using the `cancel` and `cancel_iteration` functions."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Cancel the ongoing experiment and stop scheduling new iterations.\n",
        "# remote_run.cancel()\n",
        "\n",
        "# Cancel iteration 1 and move onto iteration 2.\n",
        "# remote_run.cancel_iteration(1)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Results"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "#### Widget for Monitoring Runs\n",
        "\n",
        "The widget will first report a \"loading\" status while running the first iteration. After completing the first iteration, an auto-updating graph and table will be shown. The widget will refresh once per minute, so you should see the graph update as child runs complete.\n",
        "\n",
        "**Note:** The widget displays a link at the bottom. Use this link to open a web interface to explore the individual run details."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.widgets import RunDetails\n",
        "RunDetails(remote_run).show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "#### Retrieve All Child Runs\n",
        "You can also use SDK methods to fetch all the child runs and see individual metrics that we log."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "children = list(remote_run.get_children())\n",
        "metricslist = {}\n",
        "for run in children:\n",
        "    properties = run.get_properties()\n",
        "    metrics = {k: v for k, v in run.get_metrics().items() if isinstance(v, float)}\n",
        "    metricslist[int(properties['iteration'])] = metrics\n",
        "    \n",
        "rundata = pd.DataFrame(metricslist).sort_index(1)\n",
        "rundata"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Retrieve the Best Model\n",
        "\n",
        "Below we select the best pipeline from our iterations. The `get_output` method returns the best run and the fitted model. Overloads on `get_output` allow you to retrieve the best run and fitted model for *any* logged metric or for a particular *iteration*."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "best_run, fitted_model = remote_run.get_output()\n",
        "print(best_run)\n",
        "print(fitted_model)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "#### Best Model Based on Any Other Metric\n",
        "Show the run and the model that has the smallest `log_loss` value:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "lookup_metric = \"log_loss\"\n",
        "best_run, fitted_model = remote_run.get_output(metric = lookup_metric)\n",
        "print(best_run)\n",
        "print(fitted_model)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "#### Model from a Specific Iteration\n",
        "Show the run and the model from the first iteration:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "iteration = 0\n",
        "best_run, fitted_model = remote_run.get_output(iteration = iteration)\n",
        "print(best_run)\n",
        "print(fitted_model)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Test\n",
        "\n",
        "#### Load Test Data\n",
        "For the test data, it should have the same preparation step as the train data. Otherwise it might get failed at the preprocessing step."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "dataset_test = Dataset.Tabular.from_delimited_files(path='https://dprepdata.blob.core.windows.net/demo/crime0-test.csv')\n",
        "\n",
        "df_test = dataset_test.to_pandas_dataframe()\n",
        "df_test = df_test[pd.notnull(df_test['Primary Type'])]\n",
        "\n",
        "y_test = df_test[['Primary Type']]\n",
        "X_test = df_test.drop(['Primary Type', 'FBI Code'], axis=1)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "#### Testing Our Best Fitted Model\n",
        "We will use confusion matrix to see how our model works."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from pandas_ml import ConfusionMatrix\n",
        "\n",
        "ypred = fitted_model.predict(X_test)\n",
        "\n",
        "cm = ConfusionMatrix(y_test['Primary Type'], ypred)\n",
        "\n",
        "print(cm)\n",
        "\n",
        "cm.plot()"
      ]
    }
  ],
  "metadata": {
    "authors": [
      {
        "name": "savitam"
      }
    ],
    "kernelspec": {
      "display_name": "Python 3.6",
      "language": "python",
      "name": "python36"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.5"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 2
}