MachineLearningNotebooks/automl/08.auto-ml-remote-execution-with-text-file-on-DSVM.ipynb

{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Copyright (c) Microsoft Corporation. All rights reserved.\n",
        "\n",
        "Licensed under the MIT License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# AutoML 08: Remote Execution with Text File\n",
        "\n",
        "This sample accesses a data file on a remote DSVM. This is more efficient than reading the file from Azure Blob storage in the `get_data` method.\n",
        "\n",
        "Make sure you have executed the [00.configuration](00.configuration.ipynb) before running this notebook.\n",
        "\n",
        "In this notebook you will learn how to:\n",
        "1. Configure the DSVM to allow files to be accessed directly by the `get_data` function.\n",
        "2. Using `get_data` to return data from a local file.\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Create an Experiment\n",
        "\n",
        "As part of the setup you have already created an Azure ML `Workspace` object. For AutoML you will need to create an `Experiment` object, which is a named object in a `Workspace` used to run experiments."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import logging\n",
        "import os\n",
        "import random\n",
        "\n",
        "from matplotlib import pyplot as plt\n",
        "from matplotlib.pyplot import imshow\n",
        "import numpy as np\n",
        "import pandas as pd\n",
        "from sklearn import datasets\n",
        "\n",
        "import azureml.core\n",
        "from azureml.core.experiment import Experiment\n",
        "from azureml.core.workspace import Workspace\n",
        "from azureml.train.automl import AutoMLConfig\n",
        "from azureml.train.automl.run import AutoMLRun"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "ws = Workspace.from_config()\n",
        "\n",
        "# choose a name for experiment\n",
        "experiment_name = 'automl-remote-dsvm-file'\n",
        "# project folder\n",
        "project_folder = './sample_projects/automl-remote-dsvm-file'\n",
        "\n",
        "experiment=Experiment(ws, experiment_name)\n",
        "\n",
        "output = {}\n",
        "output['SDK version'] = azureml.core.VERSION\n",
        "output['Subscription ID'] = ws.subscription_id\n",
        "output['Workspace'] = ws.name\n",
        "output['Resource Group'] = ws.resource_group\n",
        "output['Location'] = ws.location\n",
        "output['Project Directory'] = project_folder\n",
        "output['Experiment Name'] = experiment.name\n",
        "pd.set_option('display.max_colwidth', -1)\n",
        "pd.DataFrame(data=output, index=['']).T"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Diagnostics\n",
        "\n",
        "Opt-in diagnostics for better experience, quality, and security of future releases."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.telemetry import set_diagnostics_collection\n",
        "set_diagnostics_collection(send_diagnostics = True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Create a Remote Linux DSVM\n",
        "**Note:** If creation fails with a message about Marketplace purchase eligibilty, start creation of a DSVM through the [Azure portal](https://portal.azure.com), and select \"Want to create programmatically\" to enable programmatic creation. Once you've enabled this setting, you can exit the portal without actually creating the DSVM, and creation of the DSVM through the notebook should work.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.core.compute import DsvmCompute\n",
        "\n",
        "dsvm_name = 'mydsvm'\n",
        "try:\n",
        "    dsvm_compute = DsvmCompute(ws, dsvm_name)\n",
        "    print('Found existing DSVM.')\n",
        "except:\n",
        "    print('Creating a new DSVM.')\n",
        "    dsvm_config = DsvmCompute.provisioning_configuration(vm_size = \"Standard_D2_v2\")\n",
        "    dsvm_compute = DsvmCompute.create(ws, name = dsvm_name, provisioning_configuration = dsvm_config)\n",
        "    dsvm_compute.wait_for_completion(show_output = True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Copy the Data File to the DSVM\n",
        "Download the data file and copy the data file to the `/tmp/data` on the DSVM."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "df = pd.read_csv(\"https://automldemods.blob.core.windows.net/datasets/PlayaEvents2016,_1.6MB,_3.4k-rows.cleaned.2.tsv\",\n",
        "                     delimiter=\"\\t\", quotechar='\"')\n",
        "df.to_csv(\"data.tsv\", sep=\"\\t\", quotechar='\"', index=False)\n",
        "\n",
        "# Now copy the file data.tsv to the folder /tmp/data on the DSVM."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Create the `get_data.py` File\n",
        "For remote executions you should author a `get_data.py` file containing a `get_data()` function. This file should be in the root directory of the project. You can encapsulate code to read data either from a blob storage or local disk in this file.\n",
        "In this example, the `get_data()` function returns a [dictionary](README.md#getdata)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "if not os.path.exists(project_folder):\n",
        "    os.makedirs(project_folder)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "%%writefile $project_folder/get_data.py\n",
        "\n",
        "import pandas as pd\n",
        "from sklearn.model_selection import train_test_split\n",
        "from sklearn.preprocessing import LabelEncoder\n",
        "import os\n",
        "\n",
        "def get_data():\n",
        "    # Burning man 2016 data\n",
        "    df = pd.read_csv('/tmp/data/data.tsv',\n",
        "                     delimiter=\"\\t\", quotechar='\"')\n",
        "    # get integer labels\n",
        "    le = LabelEncoder()\n",
        "    le.fit(df[\"Label\"].values)\n",
        "    y = le.transform(df[\"Label\"].values)\n",
        "    df = df.drop([\"Label\"], axis=1)\n",
        "\n",
        "    df_train, _, y_train, _ = train_test_split(df, y, test_size = 0.1, random_state = 42)\n",
        "\n",
        "    return { \"X\" : df.values, \"y\" : y }"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Configure AutoML <a class=\"anchor\" id=\"Instatiate-AutoML-Remote-DSVM\"></a>\n",
        "\n",
        "You can specify `automl_settings` as `**kwargs` as well. Also note that you can use a `get_data()` function for local excutions too.\n",
        "\n",
        "**Note:** When using Remote DSVM, you can't pass Numpy arrays directly to the fit method.\n",
        "\n",
        "|Property|Description|\n",
        "|-|-|\n",
        "|**primary_metric**|This is the metric that you want to optimize. Classification supports the following primary metrics: <br><i>accuracy</i><br><i>AUC_weighted</i><br><i>balanced_accuracy</i><br><i>average_precision_score_weighted</i><br><i>precision_score_weighted</i>|\n",
        "|**max_time_sec**|Time limit in seconds for each iteration.|\n",
        "|**iterations**|Number of iterations. In each iteration AutoML trains a specific pipeline with the data.|\n",
        "|**n_cross_validations**|Number of cross validation splits.|\n",
        "|**concurrent_iterations**|Maximum number of iterations that would be executed in parallel. This should be less than the number of cores on the DSVM.|\n",
        "|**preprocess**|Setting this to *True* enables AutoML to perform preprocessing on the input to handle *missing data*, and perform some common *feature extraction*.|\n",
        "|**max_cores_per_iteration**|Indicates how many cores on the compute target would be used to train a single pipeline.<br>Default is *1*, you can set it to *-1* to use all cores.|"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "automl_settings = {\n",
        "    \"max_time_sec\": 3600,\n",
        "    \"iterations\": 10,\n",
        "    \"n_cross_validations\": 5,\n",
        "    \"primary_metric\": 'AUC_weighted',\n",
        "    \"preprocess\": True,\n",
        "    \"max_cores_per_iteration\": 2,\n",
        "    \"verbosity\": logging.INFO\n",
        "}\n",
        "automl_config = AutoMLConfig(task = 'classification',\n",
        "                             debug_log = 'automl_errors.log',\n",
        "                             path = project_folder,\n",
        "                             compute_target = dsvm_compute,\n",
        "                             data_script = project_folder + \"/get_data.py\",\n",
        "                             **automl_settings)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Train the Model <a class=\"anchor\" id=\"Training-the-model-Remote-DSVM\"></a>\n",
        "\n",
        "Call the `submit` method on the experiment object and pass the run configuration. For remote runs the execution is asynchronous, so you will see the iterations get populated as they complete. You can interact with the widgets and models even when the experiment is running to retrieve the best model up to that point. Once you are satisfied with the model, you can cancel a particular iteration or the whole run.\n",
        "\n",
        "In this example, we specify `show_output = False` to suppress console output while the run is in progress."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "remote_run = experiment.submit(automl_config, show_output = False)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Exploring the results"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "#### Widget for Monitoring Runs\n",
        "\n",
        "The widget will first report a \"loading\" status while running the first iteration. After completing the first iteration, an auto-updating graph and table will be shown. The widget will refresh once per minute, so you should see the graph update as child runs complete.\n",
        "\n",
        "You can click on a pipeline to see run properties and output logs. Logs are also available on the DSVM under `/tmp/azureml_run/{iterationid}/azureml-logs`.\n",
        "\n",
        "**Note:** The widget displays a link at the bottom. Use this link to open a web interface to explore the individual run details."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.train.widgets import RunDetails\n",
        "RunDetails(remote_run).show() "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n",
        "#### Retrieve All Child Runs\n",
        "You can also use SDK methods to fetch all the child runs and see individual metrics that we log."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "children = list(remote_run.get_children())\n",
        "metricslist = {}\n",
        "for run in children:\n",
        "    properties = run.get_properties()\n",
        "    metrics = {k: v for k, v in run.get_metrics().items() if isinstance(v, float)}\n",
        "    metricslist[int(properties['iteration'])] = metrics\n",
        "\n",
        "rundata = pd.DataFrame(metricslist).sort_index(1)\n",
        "rundata"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Cancelling Runs\n",
        "You can cancel ongoing remote runs using the `cancel` and `cancel_iteration` functions."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Cancel the ongoing experiment and stop scheduling new iterations.\n",
        "# remote_run.cancel()\n",
        "\n",
        "# Cancel iteration 1 and move onto iteration 2.\n",
        "# remote_run.cancel_iteration(1)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Retrieve the Best Model\n",
        "\n",
        "Below we select the best pipeline from our iterations. The `get_output` method on `automl_classifier` returns the best run and the fitted model for the last invocation. Overloads on `get_output` allow you to retrieve the best run and fitted model for *any* logged metric or for a particular *iteration*."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "best_run, fitted_model = remote_run.get_output()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "#### Best Model Based on Any Other Metric\n",
        "Show the run and the model which has the smallest `accuracy` value:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# lookup_metric = \"accuracy\"\n",
        "# best_run, fitted_model = remote_run.get_output(metric = lookup_metric)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "#### Model from a Specific Iteration\n",
        "Show the run and the model from the first iteration:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# iteration = 1\n",
        "# best_run, fitted_model = remote_run.get_output(iteration = iteration)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Register the Fitted Model for Deployment"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "description = 'AutoML Model'\n",
        "tags = None\n",
        "remote_run.register_model(description = description, tags = tags)\n",
        "remote_run.model_id # Use this id to deploy the model as a web service in Azure."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Test the Best Fitted Model <a class=\"anchor\" id=\"Testing-the-Fitted-Model-Remote-DSVM\"></a>\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import sklearn\n",
        "from sklearn.model_selection import train_test_split\n",
        "from sklearn.preprocessing import LabelEncoder\n",
        "from pandas_ml import ConfusionMatrix\n",
        "\n",
        "df = pd.read_csv(\"https://automldemods.blob.core.windows.net/datasets/PlayaEvents2016,_1.6MB,_3.4k-rows.cleaned.2.tsv\",\n",
        "                     delimiter=\"\\t\", quotechar='\"')\n",
        "\n",
        "# get integer labels\n",
        "le = LabelEncoder()\n",
        "le.fit(df[\"Label\"].values)\n",
        "y = le.transform(df[\"Label\"].values)\n",
        "df = df.drop([\"Label\"], axis = 1)\n",
        "\n",
        "_, df_test, _, y_test = train_test_split(df, y, test_size = 0.1, random_state = 42)\n",
        "\n",
        "y_pred = fitted_model.predict(df_test.values)\n",
        "\n",
        "y_pred_strings = le.inverse_transform(y_pred)\n",
        "y_test_strings = le.inverse_transform(y_test)\n",
        "\n",
        "cm = ConfusionMatrix(y_test_strings, y_pred_strings)\n",
        "\n",
        "print(cm)\n",
        "\n",
        "cm.plot()"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3.6",
      "language": "python",
      "name": "python36"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.6"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 2
}