MachineLearningNotebooks/automl/04.auto-ml-remote-execution-text-data-blob-store.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copyright (c) Microsoft Corporation. All rights reserved.\n",
    "\n",
    "Licensed under the MIT License."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Auto ML : Remote Execution with Text data from Blobstorage\n",
    "\n",
    "In this example we use the [Burning Man 2016 dataset](https://innovate.burningman.org/datasets-page/) to showcase how you can use AutoML to handle text data from a Azure blobstorage.\n",
    "\n",
    "Make sure you have executed the [00.configuration](00.configuration.ipynb) before running this notebook.\n",
    "\n",
    "In this notebook you would see\n",
    "1. Creating an Experiment using an existing Workspace\n",
    "2. Attaching an existing DSVM to a workspace\n",
    "3. Instantiating AutoMLConfig \n",
    "4. Training the Model using the DSVM\n",
    "5. Exploring the results\n",
    "6. Testing the fitted model\n",
    "\n",
    "In addition this notebook showcases the following features\n",
    "- **Parallel** Executions for iterations\n",
    "- Asyncronous tracking of progress\n",
    "- **Cancelling** individual iterations or the entire run\n",
    "- Retrieving models for any iteration or logged metric\n",
    "- specify automl settings as **kwargs**\n",
    "- handling **text** data with **preprocess** flag\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create Experiment\n",
    "\n",
    "As part of the setup you have already created a <b>Workspace</b>. For AutoML you would need to create an <b>Experiment</b>. An <b>Experiment</b> is a named object in a <b>Workspace</b>, which is used to run experiments."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import logging\n",
    "import os\n",
    "import random\n",
    "\n",
    "from matplotlib import pyplot as plt\n",
    "from matplotlib.pyplot import imshow\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from sklearn import datasets\n",
    "\n",
    "import azureml.core\n",
    "from azureml.core.experiment import Experiment\n",
    "from azureml.core.workspace import Workspace\n",
    "from azureml.train.automl import AutoMLConfig\n",
    "from azureml.train.automl.run import AutoMLRun"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ws = Workspace.from_config()\n",
    "\n",
    "# choose a name for the run history container in the workspace\n",
    "experiment_name = 'automl-remote-dsvm-blobstore'\n",
    "# project folder\n",
    "project_folder = './sample_projects/automl-remote-dsvm-blobstore'\n",
    "\n",
    "experiment = Experiment(ws, experiment_name)\n",
    "\n",
    "output = {}\n",
    "output['SDK version'] = azureml.core.VERSION\n",
    "output['Subscription ID'] = ws.subscription_id\n",
    "output['Workspace'] = ws.name\n",
    "output['Resource Group'] = ws.resource_group\n",
    "output['Location'] = ws.location\n",
    "output['Project Directory'] = project_folder\n",
    "output['Experiment Name'] = experiment.name\n",
    "pd.set_option('display.max_colwidth', -1)\n",
    "pd.DataFrame(data=output, index=['']).T"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Diagnostics\n",
    "\n",
    "Opt-in diagnostics for better experience, quality, and security of future releases"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from azureml.telemetry import set_diagnostics_collection\n",
    "set_diagnostics_collection(send_diagnostics=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Attach a Remote Linux DSVM\n",
    "To use remote docker commpute target:\n",
    "1. Create a Linux DSVM in Azure. Here is some [quick instructions](https://docs.microsoft.com/en-us/azure/machine-learning/desktop-workbench/how-to-create-dsvm-hdi). Make sure you use the Ubuntu flavor, NOT CentOS.  Make sure that disk space is available under /tmp because AutoML creates files under /tmp/azureml_runs. The DSVM should have more cores than the number of parallel runs that you plan to enable. It should also have at least 4Gb per core.\n",
    "2. Enter the IP address, username and password below\n",
    "\n",
    "**Note**: By default SSH runs on port 22 and you don't need to specify it. But if for security reasons you can switch to a different port (such as 5022), you can append the port number to the address. [Read more](https://render.githubusercontent.com/documentation/sdk/ssh-issue.md) on this."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from azureml.core.compute import RemoteCompute\n",
    "\n",
    "# Add your VM information below\n",
    "dsvm_name     = 'mydsvm1'\n",
    "dsvm_ip_addr  = '<<ip_addr>>'\n",
    "dsvm_username = '<<username>>'\n",
    "dsvm_password = '<<password>>'\n",
    "\n",
    "dsvm_compute = RemoteCompute.attach(workspace=ws, name=dsvm_name, address=dsvm_ip_addr, username=dsvm_username, password=dsvm_password, ssh_port=22)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create Get Data File\n",
    "For remote executions you should author a get_data.py file containing a get_data() function. This file should be in the root directory of the project. You can encapsulate code to read data either from a blob storage or local disk in this file.\n",
    "\n",
    "The *get_data()* function returns a [dictionary](README.md#getdata)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if not os.path.exists(project_folder):\n",
    "    os.makedirs(project_folder)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile $project_folder/get_data.py\n",
    "\n",
    "import pandas as pd\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.preprocessing import LabelEncoder\n",
    "\n",
    "def get_data():\n",
    "    # Burning man 2016 data\n",
    "    df = pd.read_csv(\"https://automldemods.blob.core.windows.net/datasets/PlayaEvents2016,_1.6MB,_3.4k-rows.cleaned.2.tsv\",\n",
    "                     delimiter=\"\\t\", quotechar='\"')\n",
    "    # get integer labels\n",
    "    le = LabelEncoder()\n",
    "    le.fit(df[\"Label\"].values)\n",
    "    y = le.transform(df[\"Label\"].values)\n",
    "    df = df.drop([\"Label\"], axis=1)\n",
    "\n",
    "    df_train, _, y_train, _ = train_test_split(df, y, test_size=0.1, random_state=42)\n",
    "\n",
    "    return { \"X\" : df, \"y\" : y }"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### View data\n",
    "\n",
    "You can execute the *get_data()* function locally to view the *train* data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%run  $project_folder/get_data.py\n",
    "data_dict = get_data()\n",
    "df = data_dict[\"X\"]\n",
    "y = data_dict[\"y\"]\n",
    "pd.set_option('display.max_colwidth', 15)\n",
    "df['Label'] = pd.Series(y, index=df.index)\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Instantiate AutoML <a class=\"anchor\" id=\"Instatiate-AutoML-Remote-DSVM\"></a>\n",
    "\n",
    "You can specify automl_settings as **kwargs** as well. Also note that you can use the get_data() symantic for local excutions too. \n",
    "\n",
    "<i>Note: For Remote DSVM and Batch AI you cannot pass Numpy arrays directly to the fit method.</i>\n",
    "\n",
    "|Property|Description|\n",
    "|-|-|\n",
    "|**primary_metric**|This is the metric that you want to optimize.<br> Classification supports the following primary metrics <br><i>accuracy</i><br><i>AUC_weighted</i><br><i>balanced_accuracy</i><br><i>average_precision_score_weighted</i><br><i>precision_score_weighted</i>|\n",
    "|**max_time_sec**|Time limit in seconds for each iteration|\n",
    "|**iterations**|Number of iterations. In each iteration Auto ML trains a specific pipeline with the data|\n",
    "|**n_cross_validations**|Number of cross validation splits|\n",
    "|**concurrent_iterations**|Max number of iterations that would be executed in parallel.  This should be less than the number of cores on the DSVM\n",
    "|**preprocess**| *True/False* <br>Setting this to *True* enables AutoML to perform preprocessing <br>on the input to handle *missing data*, and perform some common *feature extraction*|\n",
    "|**max_cores_per_iteration**| Indicates how many cores on the compute target would be used to train a single pipeline.<br> Default is *1*, you can set it to *-1* to use all cores|"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "automl_settings = {\n",
    "    \"max_time_sec\": 3600,\n",
    "    \"iterations\": 10,\n",
    "    \"n_cross_validations\": 5,\n",
    "    \"primary_metric\": 'AUC_weighted',\n",
    "    \"preprocess\": True,\n",
    "    \"max_cores_per_iteration\": 2\n",
    "}\n",
    "\n",
    "automl_config = AutoMLConfig(task = 'classification',\n",
    "                             path=project_folder,\n",
    "                             compute_target = dsvm_compute,\n",
    "                             data_script = project_folder + \"/get_data.py\",\n",
    "                             **automl_settings\n",
    "                            )\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training the Model <a class=\"anchor\" id=\"Training-the-model-Remote-DSVM\"></a>\n",
    "\n",
    "For remote runs the execution is asynchronous, so you will see the iterations get populated as they complete. You can interact with the widgets/models even when the experiment is running to retreive the best model up to that point. Once you are satisfied with the model you can cancel a particular iteration or the whole run."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "remote_run = experiment.submit(automl_config)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exploring the Results <a class=\"anchor\" id=\"Exploring-the-Results-Remote-DSVM\"></a>\n",
    "#### Widget for monitoring runs\n",
    "\n",
    "The widget will sit on \"loading\" until the first iteration completed, then you will see an auto-updating graph and table show up. It refreshed once per minute, so you should see the graph update as child runs complete.\n",
    "\n",
    "You can click on a pipeline to see run properties and output logs. Logs are also available on the DSVM under /tmp/azureml_run/{iterationid}/azureml-logs\n",
    "\n",
    "NOTE: The widget displays a link at the bottom. This links to a web-ui to explore the individual run details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from azureml.train.widgets import RunDetails\n",
    "RunDetails(remote_run).show() "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "#### Retrieve All Child Runs\n",
    "You can also use sdk methods to fetch all the child runs and see individual metrics that we log. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "children = list(remote_run.get_children())\n",
    "metricslist = {}\n",
    "for run in children:\n",
    "    properties = run.get_properties()\n",
    "    metrics = {k: v for k, v in run.get_metrics().items() if isinstance(v, float)}    \n",
    "    metricslist[int(properties['iteration'])] = metrics\n",
    "\n",
    "rundata = pd.DataFrame(metricslist).sort_index(1)\n",
    "rundata"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Canceling runs\n",
    "You can cancel ongoing remote runs using the *cancel()* and *cancel_iteration()* functions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Cancel the ongoing experiment and stop scheduling new iterations\n",
    "remote_run.cancel()\n",
    "\n",
    "# Cancel iteration 1 and move onto iteration 2\n",
    "# remote_run.cancel_iteration(1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Retrieve the Best Model\n",
    "\n",
    "Below we select the best pipeline from our iterations. The *get_output* method on automl_classifier returns the best run and the fitted model for the last *fit* invocation. There are overloads on *get_output* that allow you to retrieve the best run and fitted model for *any* logged metric or a particular *iteration*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "best_run, fitted_model = remote_run.get_output()\n",
    "print(best_run)\n",
    "print(fitted_model)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Best Model based on any other metric"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# lookup_metric = \"accuracy\"\n",
    "# best_run, fitted_model = remote_run.get_output(metric=lookup_metric)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Model from a specific iteration"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "iteration = 0\n",
    "zero_run, zero_model = remote_run.get_output(iteration=iteration)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Register fitted model for deployment"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "description = 'AutoML Model'\n",
    "tags = None\n",
    "remote_run.register_model(description=description, tags=tags)\n",
    "remote_run.model_id # Use this id to deploy the model as a web service in Azure"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Testing the Fitted Model <a class=\"anchor\" id=\"Testing-the-Fitted-Model-Remote-DSVM\"></a>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sklearn\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.preprocessing import LabelEncoder\n",
    "from pandas_ml import ConfusionMatrix\n",
    "\n",
    "df = pd.read_csv(\"https://automldemods.blob.core.windows.net/datasets/PlayaEvents2016,_1.6MB,_3.4k-rows.cleaned.2.tsv\",\n",
    "                     delimiter=\"\\t\", quotechar='\"')\n",
    "\n",
    "# get integer labels\n",
    "le = LabelEncoder()\n",
    "le.fit(df[\"Label\"].values)\n",
    "y = le.transform(df[\"Label\"].values)\n",
    "df = df.drop([\"Label\"], axis=1)\n",
    "\n",
    "_, df_test, _, y_test = train_test_split(df, y, test_size=0.1, random_state=42)\n",
    "\n",
    "\n",
    "ypred = fitted_model.predict(df_test.values)\n",
    "\n",
    "\n",
    "ypred_strings = le.inverse_transform(ypred)\n",
    "ytest_strings = le.inverse_transform(y_test)\n",
    "\n",
    "cm = ConfusionMatrix(ytest_strings, ypred_strings)\n",
    "\n",
    "print(cm)\n",
    "\n",
    "cm.plot()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.6",
   "language": "python",
   "name": "python36"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}