MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/dataprep-remote-execution/auto-ml-dataprep-remote-execution.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copyright (c) Microsoft Corporation. All rights reserved.\n",
    "\n",
    "Licensed under the MIT License."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Automated Machine Learning: Prepare Data using `azureml.dataprep` for Remote Execution (DSVM)\n",
    "In this example we showcase how you can use the `azureml.dataprep` SDK to load and prepare data for AutoML. `azureml.dataprep` can also be used standalone; full documentation can be found [here](https://github.com/Microsoft/PendletonDocs).\n",
    "\n",
    "Make sure you have executed the [configuration](../configuration.ipynb) before running this notebook.\n",
    "\n",
    "In this notebook you will learn how to:\n",
    "1. Define data loading and preparation steps in a `Dataflow` using `azureml.dataprep`.\n",
    "2. Pass the `Dataflow` to AutoML for a local run.\n",
    "3. Pass the `Dataflow` to AutoML for a remote run."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Compatibility\n",
    "\n",
    "Currently, Data Prep only supports __Ubuntu 16__ and __Red Hat Enterprise Linux 7__. We are working on supporting more linux distros."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Diagnostics\n",
    "\n",
    "Opt-in diagnostics for better experience, quality, and security of future releases."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from azureml.telemetry import set_diagnostics_collection\n",
    "set_diagnostics_collection(send_diagnostics = True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create an Experiment\n",
    "\n",
    "As part of the setup you have already created an Azure ML `Workspace` object. For AutoML you will need to create an `Experiment` object, which is a named object in a `Workspace` used to run experiments."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import logging\n",
    "import os\n",
    "import time\n",
    "\n",
    "import pandas as pd\n",
    "\n",
    "import azureml.core\n",
    "from azureml.core.compute import DsvmCompute\n",
    "from azureml.core.experiment import Experiment\n",
    "from azureml.core.workspace import Workspace\n",
    "import azureml.dataprep as dprep\n",
    "from azureml.train.automl import AutoMLConfig"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ws = Workspace.from_config()\n",
    " \n",
    "# choose a name for experiment\n",
    "experiment_name = 'automl-dataprep-remote-dsvm'\n",
    "# project folder\n",
    "project_folder = './sample_projects/automl-dataprep-remote-dsvm'\n",
    " \n",
    "experiment = Experiment(ws, experiment_name)\n",
    " \n",
    "output = {}\n",
    "output['SDK version'] = azureml.core.VERSION\n",
    "output['Subscription ID'] = ws.subscription_id\n",
    "output['Workspace Name'] = ws.name\n",
    "output['Resource Group'] = ws.resource_group\n",
    "output['Location'] = ws.location\n",
    "output['Project Directory'] = project_folder\n",
    "output['Experiment Name'] = experiment.name\n",
    "pd.set_option('display.max_colwidth', -1)\n",
    "pd.DataFrame(data = output, index = ['']).T"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading Data using DataPrep"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# You can use `auto_read_file` which intelligently figures out delimiters and datatypes of a file.\n",
    "# The data referenced here was pulled from `sklearn.datasets.load_digits()`.\n",
    "simple_example_data_root = 'https://dprepdata.blob.core.windows.net/automl-notebook-data/'\n",
    "X = dprep.auto_read_file(simple_example_data_root + 'X.csv').skip(1)  # Remove the header row.\n",
    "\n",
    "# You can also use `read_csv` and `to_*` transformations to read (with overridable delimiter)\n",
    "# and convert column types manually.\n",
    "# Here we read a comma delimited file and convert all columns to integers.\n",
    "y = dprep.read_csv(simple_example_data_root + 'y.csv').to_long(dprep.ColumnSelector(term='.*', use_regex = True))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Review the Data Preparation Result\n",
    "\n",
    "You can peek the result of a Dataflow at any range using `skip(i)` and `head(j)`. Doing so evaluates only `j` records for all the steps in the Dataflow, which makes it fast even against large datasets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X.skip(1).head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Configure AutoML\n",
    "\n",
    "This creates a general AutoML settings object applicable for both local and remote runs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "automl_settings = {\n",
    "    \"iteration_timeout_minutes\" : 10,\n",
    "    \"iterations\" : 2,\n",
    "    \"primary_metric\" : 'AUC_weighted',\n",
    "    \"preprocess\" : False,\n",
    "    \"verbosity\" : logging.INFO,\n",
    "    \"n_cross_validations\": 3\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Remote Run"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create or Attach a Remote Linux DSVM"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dsvm_name = 'mydsvmc'\n",
    "\n",
    "try:\n",
    "    while ws.compute_targets[dsvm_name].provisioning_state == 'Creating':\n",
    "        time.sleep(1)\n",
    "        \n",
    "    dsvm_compute = DsvmCompute(ws, dsvm_name)\n",
    "    print('Found existing DVSM.')\n",
    "except:\n",
    "    print('Creating a new DSVM.')\n",
    "    dsvm_config = DsvmCompute.provisioning_configuration(vm_size = \"Standard_D2_v2\")\n",
    "    dsvm_compute = DsvmCompute.create(ws, name = dsvm_name, provisioning_configuration = dsvm_config)\n",
    "    dsvm_compute.wait_for_completion(show_output = True)\n",
    "    print(\"Waiting one minute for ssh to be accessible\")\n",
    "    time.sleep(60) # Wait for ssh to be accessible"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from azureml.core.runconfig import RunConfiguration\n",
    "from azureml.core.conda_dependencies import CondaDependencies\n",
    "\n",
    "conda_run_config = RunConfiguration(framework=\"python\")\n",
    "\n",
    "conda_run_config.target = dsvm_compute\n",
    "\n",
    "cd = CondaDependencies.create(pip_packages=['azureml-sdk[automl]'], conda_packages=['numpy'])\n",
    "conda_run_config.environment.python.conda_dependencies = cd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Pass Data with `Dataflow` Objects\n",
    "\n",
    "The `Dataflow` objects captured above can also be passed to the `submit` method for a remote run. AutoML will serialize the `Dataflow` object and send it to the remote compute target. The `Dataflow` will not be evaluated locally."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "automl_config = AutoMLConfig(task = 'classification',\n",
    "                             debug_log = 'automl_errors.log',\n",
    "                             path = project_folder,\n",
    "                             run_configuration=conda_run_config,\n",
    "                             X = X,\n",
    "                             y = y,\n",
    "                             **automl_settings)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "remote_run = experiment.submit(automl_config, show_output = True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Explore the Results"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Widget for Monitoring Runs\n",
    "\n",
    "The widget will first report a \"loading\" status while running the first iteration. After completing the first iteration, an auto-updating graph and table will be shown. The widget will refresh once per minute, so you should see the graph update as child runs complete.\n",
    "\n",
    "**Note:** The widget displays a link at the bottom. Use this link to open a web interface to explore the individual run details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from azureml.widgets import RunDetails\n",
    "RunDetails(remote_run).show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Retrieve All Child Runs\n",
    "You can also use SDK methods to fetch all the child runs and see individual metrics that we log."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "children = list(remote_run.get_children())\n",
    "metricslist = {}\n",
    "for run in children:\n",
    "    properties = run.get_properties()\n",
    "    metrics = {k: v for k, v in run.get_metrics().items() if isinstance(v, float)}\n",
    "    metricslist[int(properties['iteration'])] = metrics\n",
    "    \n",
    "import pandas as pd\n",
    "rundata = pd.DataFrame(metricslist).sort_index(1)\n",
    "rundata"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Retrieve the Best Model\n",
    "\n",
    "Below we select the best pipeline from our iterations. The `get_output` method returns the best run and the fitted model. Overloads on `get_output` allow you to retrieve the best run and fitted model for *any* logged metric or for a particular *iteration*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "best_run, fitted_model = remote_run.get_output()\n",
    "print(best_run)\n",
    "print(fitted_model)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Best Model Based on Any Other Metric\n",
    "Show the run and the model that has the smallest `log_loss` value:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lookup_metric = \"log_loss\"\n",
    "best_run, fitted_model = remote_run.get_output(metric = lookup_metric)\n",
    "print(best_run)\n",
    "print(fitted_model)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Model from a Specific Iteration\n",
    "Show the run and the model from the first iteration:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "iteration = 0\n",
    "best_run, fitted_model = remote_run.get_output(iteration = iteration)\n",
    "print(best_run)\n",
    "print(fitted_model)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Test the Best Fitted Model\n",
    "\n",
    "#### Load Test Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn import datasets\n",
    "\n",
    "digits = datasets.load_digits()\n",
    "X_test = digits.data[:10, :]\n",
    "y_test = digits.target[:10]\n",
    "images = digits.images[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Testing Our Best Fitted Model\n",
    "We will try to predict 2 digits and see how our model works."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Randomly select digits and test\n",
    "from matplotlib import pyplot as plt\n",
    "from matplotlib.pyplot import imshow\n",
    "import random\n",
    "import numpy as np\n",
    "\n",
    "for index in np.random.choice(len(y_test), 2, replace = False):\n",
    "    print(index)\n",
    "    predicted = fitted_model.predict(X_test[index:index + 1])[0]\n",
    "    label = y_test[index]\n",
    "    title = \"Label value = %d  Predicted value = %d \" % (label, predicted)\n",
    "    fig = plt.figure(1, figsize=(3,3))\n",
    "    ax1 = fig.add_axes((0,0,.8,.8))\n",
    "    ax1.set_title(title)\n",
    "    plt.imshow(images[index], cmap = plt.cm.gray_r, interpolation = 'nearest')\n",
    "    plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Appendix"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Capture the `Dataflow` Objects for Later Use in AutoML\n",
    "\n",
    "`Dataflow` objects are immutable and are composed of a list of data preparation steps. A `Dataflow` object can be branched at any point for further usage."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# sklearn.digits.data + target\n",
    "digits_complete = dprep.auto_read_file('https://dprepdata.blob.core.windows.net/automl-notebook-data/digits-complete.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`digits_complete` (sourced from `sklearn.datasets.load_digits()`) is forked into `dflow_X` to capture all the feature columns and `dflow_y` to capture the label column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "digits_complete.to_pandas_dataframe().shape\n",
    "labels_column = 'Column64'\n",
    "dflow_X = digits_complete.drop_columns(columns = [labels_column])\n",
    "dflow_y = digits_complete.keep_columns(columns = [labels_column])"
   ]
  }
 ],
 "metadata": {
  "authors": [
   {
    "name": "savitam"
   }
  ],
  "kernelspec": {
   "display_name": "Python 3.6",
   "language": "python",
   "name": "python36"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}