{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Copyright (c) Microsoft Corporation. All rights reserved.\n",
        "\n",
        "Licensed under the MIT License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Automated Machine Learning\n",
        "_**Remote Execution with DataStore**_\n",
        "\n",
        "## Contents\n",
        "1. [Introduction](#Introduction)\n",
        "1. [Setup](#Setup)\n",
        "1. [Data](#Data)\n",
        "1. [Train](#Train)\n",
        "1. [Results](#Results)\n",
        "1. [Test](#Test)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Introduction\n",
        "This sample accesses a data file on a remote DSVM through DataStore. Advantages of using data store are:\n",
        "1. DataStore secures the access details.\n",
        "2. DataStore supports read, write to blob and file store\n",
        "3. AutoML natively supports copying data from DataStore to DSVM\n",
        "\n",
        "Make sure you have executed the [configuration](../../../configuration.ipynb) before running this notebook.\n",
        "\n",
        "In this notebook you would see\n",
        "1. Storing data in DataStore.\n",
        "2. get_data returning data from DataStore."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Setup\n",
        "\n",
        "As part of the setup you have already created a <b>Workspace</b>. For AutoML you would need to create an <b>Experiment</b>. An <b>Experiment</b> is a named object in a <b>Workspace</b>, which is used to run experiments."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import logging\n",
        "import os\n",
        "import time\n",
        "\n",
        "import numpy as np\n",
        "import pandas as pd\n",
        "\n",
        "import azureml.core\n",
        "from azureml.core.compute import DsvmCompute\n",
        "from azureml.core.experiment import Experiment\n",
        "from azureml.core.workspace import Workspace\n",
        "from azureml.train.automl import AutoMLConfig"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "ws = Workspace.from_config()\n",
        "\n",
        "# choose a name for experiment\n",
        "experiment_name = 'automl-remote-datastore-file'\n",
        "# project folder\n",
        "project_folder = './sample_projects/automl-remote-datastore-file'\n",
        "\n",
        "experiment=Experiment(ws, experiment_name)\n",
        "\n",
        "output = {}\n",
        "output['SDK version'] = azureml.core.VERSION\n",
        "output['Subscription ID'] = ws.subscription_id\n",
        "output['Workspace'] = ws.name\n",
        "output['Resource Group'] = ws.resource_group\n",
        "output['Location'] = ws.location\n",
        "output['Project Directory'] = project_folder\n",
        "output['Experiment Name'] = experiment.name\n",
        "pd.set_option('display.max_colwidth', -1)\n",
        "outputDf = pd.DataFrame(data = output, index = [''])\n",
        "outputDf.T"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Create a Remote Linux DSVM\n",
        "Note: If creation fails with a message about Marketplace purchase eligibilty, go to portal.azure.com, start creating DSVM there, and select \"Want to create programmatically\" to enable programmatic creation. Once you've enabled it, you can exit without actually creating VM.\n",
        "\n",
        "**Note**: By default SSH runs on port 22 and you don't need to specify it. But if for security reasons you can switch to a different port (such as 5022), you can append the port number to the address. [Read more](https://docs.microsoft.com/en-us/azure/virtual-machines/troubleshooting/detailed-troubleshoot-ssh-connection) on this."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "compute_target_name = 'mydsvmc'\n",
        "\n",
        "try:\n",
        "    while ws.compute_targets[compute_target_name].provisioning_state == 'Creating':\n",
        "        time.sleep(1)\n",
        "        \n",
        "    dsvm_compute = DsvmCompute(workspace=ws, name=compute_target_name)\n",
        "    print('found existing:', dsvm_compute.name)\n",
        "except:\n",
        "    dsvm_config = DsvmCompute.provisioning_configuration(vm_size=\"Standard_D2_v2\")\n",
        "    dsvm_compute = DsvmCompute.create(ws, name=compute_target_name, provisioning_configuration=dsvm_config)\n",
        "    dsvm_compute.wait_for_completion(show_output=True)\n",
        "    print(\"Waiting one minute for ssh to be accessible\")\n",
        "    time.sleep(90) # Wait for ssh to be accessible"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Data\n",
        "\n",
        "### Copy data file to local\n",
        "\n",
        "Download the data file.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "if not os.path.isdir('data'):\n",
        "    os.mkdir('data')    "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from sklearn.datasets import fetch_20newsgroups\n",
        "import csv\n",
        "\n",
        "remove = ('headers', 'footers', 'quotes')\n",
        "categories = [\n",
        "        'alt.atheism',\n",
        "        'talk.religion.misc',\n",
        "        'comp.graphics',\n",
        "        'sci.space',\n",
        "    ]\n",
        "data_train = fetch_20newsgroups(subset = 'train', categories = categories,\n",
        "                                    shuffle = True, random_state = 42,\n",
        "                                    remove = remove)\n",
        "    \n",
        "pd.DataFrame(data_train.data).to_csv(\"data/X_train.tsv\", index=False, header=False, quoting=csv.QUOTE_ALL, sep=\"\\t\")\n",
        "pd.DataFrame(data_train.target).to_csv(\"data/y_train.tsv\", index=False, header=False, sep=\"\\t\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Upload data to the cloud"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now make the data accessible remotely by uploading that data from your local machine into Azure so it can be accessed for remote training. The datastore is a convenient construct associated with your workspace for you to upload/download data, and interact with it from your remote compute targets. It is backed by Azure blob storage account.\n",
        "\n",
        "The data.tsv files are uploaded into a directory named data at the root of the datastore."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "#blob_datastore = Datastore(ws, blob_datastore_name)\n",
        "ds = ws.get_default_datastore()\n",
        "print(ds.datastore_type, ds.account_name, ds.container_name)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# ds.upload_files(\"data.tsv\")\n",
        "ds.upload(src_dir='./data', target_path='data', overwrite=True, show_progress=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Configure & Run\n",
        "\n",
        "First let's create a DataReferenceConfigruation object to inform the system what data folder to download to the compute target.\n",
        "The path_on_compute should be an absolute path to ensure that the data files are downloaded only once.   The get_data method should use this same path to access the data files."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.core.runconfig import DataReferenceConfiguration\n",
        "dr = DataReferenceConfiguration(datastore_name=ds.name, \n",
        "                   path_on_datastore='data', \n",
        "                   path_on_compute='/tmp/azureml_runs',\n",
        "                   mode='download', # download files from datastore to compute target\n",
        "                   overwrite=False)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.core.runconfig import RunConfiguration\n",
        "from azureml.core.conda_dependencies import CondaDependencies\n",
        "\n",
        "# create a new RunConfig object\n",
        "conda_run_config = RunConfiguration(framework=\"python\")\n",
        "\n",
        "# Set compute target to the Linux DSVM\n",
        "conda_run_config.target = dsvm_compute\n",
        "# set the data reference of the run coonfiguration\n",
        "conda_run_config.data_references = {ds.name: dr}\n",
        "\n",
        "cd = CondaDependencies.create(pip_packages=['azureml-sdk[automl]'], conda_packages=['numpy'])\n",
        "conda_run_config.environment.python.conda_dependencies = cd"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Create Get Data File\n",
        "For remote executions you should author a get_data.py file containing a get_data() function. This file should be in the root directory of the project. You can encapsulate code to read data either from a blob storage or local disk in this file.\n",
        "\n",
        "The *get_data()* function returns a [dictionary](README.md#getdata).\n",
        "\n",
        "The read_csv uses the path_on_compute value specified in the DataReferenceConfiguration call plus the path_on_datastore folder and then the actual file name."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "if not os.path.exists(project_folder):\n",
        "    os.makedirs(project_folder)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "%%writefile $project_folder/get_data.py\n",
        "\n",
        "import pandas as pd\n",
        "\n",
        "def get_data():\n",
        "    X_train = pd.read_csv(\"/tmp/azureml_runs/data/X_train.tsv\", delimiter=\"\\t\", header=None, quotechar='\"')\n",
        "    y_train = pd.read_csv(\"/tmp/azureml_runs/data/y_train.tsv\", delimiter=\"\\t\", header=None, quotechar='\"')\n",
        "\n",
        "    return { \"X\" : X_train.values, \"y\" : y_train[0].values }"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Train\n",
        "\n",
        "You can specify automl_settings as **kwargs** as well. Also note that you can use the get_data() symantic for local excutions too. \n",
        "\n",
        "<i>Note: For Remote DSVM and Batch AI you cannot pass Numpy arrays directly to AutoMLConfig.</i>\n",
        "\n",
        "|Property|Description|\n",
        "|-|-|\n",
        "|**primary_metric**|This is the metric that you want to optimize. Classification supports the following primary metrics: <br><i>accuracy</i><br><i>AUC_weighted</i><br><i>average_precision_score_weighted</i><br><i>norm_macro_recall</i><br><i>precision_score_weighted</i>|\n",
        "|**iteration_timeout_minutes**|Time limit in minutes for each iteration|\n",
        "|**iterations**|Number of iterations. In each iteration Auto ML trains a specific pipeline with the data|\n",
        "|**n_cross_validations**|Number of cross validation splits|\n",
        "|**max_concurrent_iterations**|Max number of iterations that would be executed in parallel.  This should be less than the number of cores on the DSVM\n",
        "|**preprocess**| *True/False* <br>Setting this to *True* enables Auto ML to perform preprocessing <br>on the input to handle *missing data*, and perform some common *feature extraction*|\n",
        "|**enable_cache**|Setting this to *True* enables preprocess done once and reuse the same preprocessed data for all the iterations. Default value is True.|\n",
        "|**max_cores_per_iteration**| Indicates how many cores on the compute target would be used to train a single pipeline.<br> Default is *1*, you can set it to *-1* to use all cores|"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "automl_settings = {\n",
        "    \"iteration_timeout_minutes\": 60,\n",
        "    \"iterations\": 4,\n",
        "    \"n_cross_validations\": 5,\n",
        "    \"primary_metric\": 'AUC_weighted',\n",
        "    \"preprocess\": True,\n",
        "    \"max_cores_per_iteration\": 1,\n",
        "    \"verbosity\": logging.INFO\n",
        "}\n",
        "automl_config = AutoMLConfig(task = 'classification',\n",
        "                             debug_log = 'automl_errors.log',\n",
        "                             path=project_folder,\n",
        "                             run_configuration=conda_run_config,\n",
        "                             #compute_target = dsvm_compute,\n",
        "                             data_script = project_folder + \"/get_data.py\",\n",
        "                             **automl_settings\n",
        "                            )"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "For remote runs the execution is asynchronous, so you will see the iterations get populated as they complete. You can interact with the widgets/models even when the experiment is running to retreive the best model up to that point. Once you are satisfied with the model you can cancel a particular iteration or the whole run."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "remote_run = experiment.submit(automl_config, show_output=False)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "remote_run"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Results\n",
        "#### Widget for monitoring runs\n",
        "\n",
        "The widget will sit on \"loading\" until the first iteration completed, then you will see an auto-updating graph and table show up. It refreshed once per minute, so you should see the graph update as child runs complete.\n",
        "\n",
        "You can click on a pipeline to see run properties and output logs. Logs are also available on the DSVM under /tmp/azureml_run/{iterationid}/azureml-logs\n",
        "\n",
        "NOTE: The widget displays a link at the bottom. This links to a web-ui to explore the individual run details."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.widgets import RunDetails\n",
        "RunDetails(remote_run).show() "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Wait until the run finishes.\n",
        "remote_run.wait_for_completion(show_output = True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n",
        "#### Retrieve All Child Runs\n",
        "You can also use sdk methods to fetch all the child runs and see individual metrics that we log. "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "children = list(remote_run.get_children())\n",
        "metricslist = {}\n",
        "for run in children:\n",
        "    properties = run.get_properties()\n",
        "    metrics = {k: v for k, v in run.get_metrics().items() if isinstance(v, float)}    \n",
        "    metricslist[int(properties['iteration'])] = metrics\n",
        "\n",
        "rundata = pd.DataFrame(metricslist).sort_index(1)\n",
        "rundata"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Canceling Runs\n",
        "You can cancel ongoing remote runs using the *cancel()* and *cancel_iteration()* functions"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Cancel the ongoing experiment and stop scheduling new iterations\n",
        "# remote_run.cancel()\n",
        "\n",
        "# Cancel iteration 1 and move onto iteration 2\n",
        "# remote_run.cancel_iteration(1)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Pre-process cache cleanup\n",
        "The preprocess data gets cache at user default file store. When the run is completed the cache can be cleaned by running below cell"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "remote_run.clean_preprocessor_cache()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Retrieve the Best Model\n",
        "\n",
        "Below we select the best pipeline from our iterations. The *get_output* method returns the best run and the fitted model. There are overloads on *get_output* that allow you to retrieve the best run and fitted model for *any* logged metric or a particular *iteration*."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "best_run, fitted_model = remote_run.get_output()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "#### Best Model based on any other metric"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# lookup_metric = \"accuracy\"\n",
        "# best_run, fitted_model = remote_run.get_output(metric=lookup_metric)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "#### Model from a specific iteration"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# iteration = 1\n",
        "# best_run, fitted_model = remote_run.get_output(iteration=iteration)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Test\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Load test data.\n",
        "from pandas_ml import ConfusionMatrix\n",
        "\n",
        "data_test = fetch_20newsgroups(subset = 'test', categories = categories,\n",
        "                               shuffle = True, random_state = 42,\n",
        "                               remove = remove)\n",
        "\n",
        "X_test = np.array(data_test.data).reshape((len(data_test.data),1))\n",
        "y_test = data_test.target\n",
        "\n",
        "# Test our best pipeline.\n",
        "\n",
        "y_pred = fitted_model.predict(X_test)\n",
        "y_pred_strings = [data_test.target_names[i] for i in y_pred]\n",
        "y_test_strings = [data_test.target_names[i] for i in y_test]\n",
        "\n",
        "cm = ConfusionMatrix(y_test_strings, y_pred_strings)\n",
        "print(cm)\n",
        "cm.plot()"
      ]
    }
  ],
  "metadata": {
    "authors": [
      {
        "name": "savitam"
      }
    ],
    "kernelspec": {
      "display_name": "Python 3.6",
      "language": "python",
      "name": "python36"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.6"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 2
}