MachineLearningNotebooks/automl/13.auto-ml-dataprep.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copyright (c) Microsoft Corporation. All rights reserved.\n",
    "\n",
    "Licensed under the MIT License."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# AutoML 13: Prepare Data using `azureml.dataprep`\n",
    "In this example we showcase how you can use `azureml.dataprep` SDK to load and prepare data for AutoML. `azureml.dataprep` can also be used standalone - full documentation can be found [here](https://github.com/Microsoft/PendletonDocs).\n",
    "\n",
    "Make sure you have executed the [setup](00.configuration.ipynb) before running this notebook.\n",
    "\n",
    "In this notebook you would see\n",
    "1. Defining data loading and preparation steps in a `Dataflow` using `azureml.dataprep`\n",
    "2. Passing the `Dataflow` to AutoML for local run\n",
    "3. Passing the `Dataflow` to AutoML for remote run"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Install `azureml.dataprep` SDK"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Please restart your kernel after the below installs.\n",
    "\n",
    "Tornado must be downgraded to a pre-5 version due to a known Tornado x Jupyter event loop bug."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install --upgrade --extra-index-url https://dataprepdownloads.azureedge.net/pypi/autoML-BD0E9CABED27C837/0.1.1809.11043 azureml-dataprep --no-cache-dir --force-reinstall\n",
    "!pip install tornado==4.5.1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Diagnostics\n",
    "\n",
    "Opt-in diagnostics for better experience, quality, and security of future releases"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from azureml.telemetry import set_diagnostics_collection\n",
    "set_diagnostics_collection(send_diagnostics=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create Experiment\n",
    "\n",
    "As part of the setup you have already created a <b>Workspace</b>. For AutoML you would need to create an <b>Experiment</b>. An <b>Experiment</b> is a named object in a <b>Workspace</b>, which is used to run experiments."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import logging\n",
    "import os\n",
    "\n",
    "import pandas as pd\n",
    "\n",
    "import azureml.core\n",
    "from azureml.core.compute import DsvmCompute\n",
    "from azureml.core.experiment import Experiment\n",
    "from azureml.core.runconfig import CondaDependencies\n",
    "from azureml.core.runconfig import RunConfiguration\n",
    "from azureml.core.workspace import Workspace\n",
    "import azureml.dataprep as dprep\n",
    "from azureml.train.automl import AutoMLConfig"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ws = Workspace.from_config()\n",
    " \n",
    "# choose a name for experiment\n",
    "experiment_name = 'automl-dataprep-classification'\n",
    "# project folder\n",
    "project_folder = './sample_projects/automl-dataprep-classification'\n",
    " \n",
    "experiment=Experiment(ws, experiment_name)\n",
    " \n",
    "output = {}\n",
    "output['SDK version'] = azureml.core.VERSION\n",
    "output['Subscription ID'] = ws.subscription_id\n",
    "output['Workspace Name'] = ws.name\n",
    "output['Resource Group'] = ws.resource_group\n",
    "output['Location'] = ws.location\n",
    "output['Project Directory'] = project_folder\n",
    "output['Experiment Name'] = experiment.name\n",
    "pd.set_option('display.max_colwidth', -1)\n",
    "pd.DataFrame(data = output, index = ['']).T"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading Data using DataPrep"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# You can use `smart_read_file` which intelligently figures out delimiters and datatypes of a file\n",
    "# data pulled from sklearn.datasets.load_digits()\n",
    "simple_example_data_root = 'https://dprepdata.blob.core.windows.net/automl-notebook-data/'\n",
    "X = dprep.smart_read_file(simple_example_data_root + 'X.csv').skip(1)  # remove header\n",
    "\n",
    "# You can also use `read_csv` and `to_*` transformations to read (with overridable delimiter) \n",
    "# and convert column types manually. \n",
    "# Here we read a comma delimited file and convert all columns to integers.\n",
    "y = dprep.read_csv(simple_example_data_root + 'y.csv').to_long(dprep.ColumnSelector(term='.*', use_regex=True))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Review the Data Preparation Result\n",
    "\n",
    "You can peek the result of a Dataflow at any range using `skip(i)` and `head(j)`. Doing so evaluates only `j` records for all the steps in the Dataflow, which makes it fast even against large dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X.skip(1).head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Instantiate AutoML Settings\n",
    "\n",
    "This creates a general Auto ML Settings applicable for both Local and Remote runs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "automl_settings = {\n",
    "    \"max_time_sec\": 600,\n",
    "    \"iterations\": 2,\n",
    "    \"primary_metric\": 'AUC_weighted',\n",
    "    \"preprocess\": False,\n",
    "    \"verbosity\": logging.INFO,\n",
    "    \"n_cross_validations\" : 3\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Local Run"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Pass data with Dataflows\n",
    "\n",
    "The `Dataflow` objects captured above can be passed to `submit` method for local run. AutoML will retrieve the results from the `Dataflow` for model training."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "automl_config = AutoMLConfig(task = 'classification',\n",
    "                             debug_log = 'automl_errors.log',\n",
    "                             X = X,\n",
    "                             y = y,                             \n",
    "                             **automl_settings)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "local_run = experiment.submit(automl_config, show_output=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Remote Run"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create or Attach a Remote Linux DSVM"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dsvm_name = 'mydsvm'\n",
    "try:\n",
    "    dsvm_compute = DsvmCompute(ws, dsvm_name)\n",
    "    print('found existing dsvm.')\n",
    "except:\n",
    "    print('creating new dsvm.')\n",
    "    dsvm_config = DsvmCompute.provisioning_configuration(vm_size = \"Standard_D2_v2\")\n",
    "    dsvm_compute = DsvmCompute.create(ws, name = dsvm_name, provisioning_configuration = dsvm_config)\n",
    "    dsvm_compute.wait_for_completion(show_output = True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Update Conda Dependency file to have AutoML and DataPrep SDK\n",
    "\n",
    "Currently AutoML and DataPrep SDK is not installed with Azure ML SDK by default. Due to this we update the conda dependency file to add such dependencies."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cd = CondaDependencies()\n",
    "cd.set_pip_index_url(index_url=\"--index-url https://azuremlsdktestpypi.azureedge.net/sdk-release/master/588E708E0DF342C4A80BD954289657CF\")\n",
    "cd.set_pip_index_url(index_url=\"--extra-index-url https://dataprepdownloads.azureedge.net/pypi/autoML-BD0E9CABED27C837/0.1.1809.11043 --extra-index-url https://pypi.python.org/simple\")\n",
    "cd.remove_pip_package(pip_package=\"azureml-defaults\")\n",
    "cd.add_pip_package(pip_package='azureml-core')\n",
    "cd.add_pip_package(pip_package='azureml-telemetry')\n",
    "cd.add_pip_package(pip_package='azureml-train-automl')\n",
    "cd.add_pip_package(pip_package='azureml-dataprep')\n",
    "cd.add_pip_package(pip_package='tornado==4.5.1')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create a RunConfiguration with DSVM name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "run_config = RunConfiguration(conda_dependencies=cd)\n",
    "run_config.target = dsvm_compute\n",
    "run_config.auto_prepare_environment = True"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Pass data with Dataflows\n",
    "\n",
    "The `Dataflow` objects captured above can also be passed to `submit` method for remote run. AutoML will serialize the `Dataflow` and send to remote compute target. The `Dataflow` will not be evaluated locally."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "automl_config = AutoMLConfig(task = 'classification',\n",
    "                                debug_log = 'automl_errors.log',\n",
    "                                path=project_folder,\n",
    "                                run_configuration = run_config,\n",
    "                                X = X,\n",
    "                                y = y,\n",
    "                                **automl_settings)\n",
    "remote_run = experiment.submit(automl_config, show_output=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exploring the results"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Widget for monitoring runs\n",
    "\n",
    "The widget will sit on \"loading\" until the first iteration completed, then you will see an auto-updating graph and table show up. It refreshed once per minute, so you should see the graph update as child runs complete.\n",
    "\n",
    "NOTE: The widget displays a link at the bottom. This links to a web-ui to explore the individual run details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from azureml.train.widgets import RunDetails\n",
    "RunDetails(local_run).show() "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Retrieve All Child Runs\n",
    "You can also use sdk methods to fetch all the child runs and see individual metrics that we log. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "children = list(local_run.get_children())\n",
    "metricslist = {}\n",
    "for run in children:\n",
    "    properties = run.get_properties()\n",
    "    metrics = {k: v for k, v in run.get_metrics().items() if isinstance(v, float)}    \n",
    "    metricslist[int(properties['iteration'])] = metrics\n",
    "    \n",
    "import pandas as pd\n",
    "rundata = pd.DataFrame(metricslist).sort_index(1)\n",
    "rundata"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Retrieve the Best Model\n",
    "\n",
    "Below we select the best pipeline from our iterations. The *get_output* method on automl_classifier returns the best run and the fitted model for the last *fit* invocation. There are overloads on *get_output* that allow you to retrieve the best run and fitted model for *any* logged metric or a particular *iteration*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "best_run, fitted_model = local_run.get_output()\n",
    "print(best_run)\n",
    "print(fitted_model)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Best Model based on any other metric\n",
    "Give me the run and the model that has the smallest `log_loss`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lookup_metric = \"log_loss\"\n",
    "best_run, fitted_model = local_run.get_output(metric = lookup_metric)\n",
    "print(best_run)\n",
    "print(fitted_model)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Best Model based on any iteration\n",
    "Give me the run and the model from the 1st iteration:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "iteration = 0\n",
    "best_run, fitted_model = local_run.get_output(iteration = iteration)\n",
    "print(best_run)\n",
    "print(fitted_model)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Testing the Fitted Model \n",
    "\n",
    "#### Load Test Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn import datasets\n",
    "\n",
    "digits = datasets.load_digits()\n",
    "X_digits = digits.data[:10, :]\n",
    "y_digits = digits.target[:10]\n",
    "images = digits.images[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Testing our best pipeline\n",
    "We will try to predict 2 digits and see how our model works."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Randomly select digits and test\n",
    "from matplotlib import pyplot as plt\n",
    "from matplotlib.pyplot import imshow\n",
    "import random\n",
    "import numpy as np\n",
    "\n",
    "for index in np.random.choice(len(y_digits), 2):\n",
    "    print(index)\n",
    "    predicted = fitted_model.predict(X_digits[index:index + 1])[0]\n",
    "    label = y_digits[index]\n",
    "    title = \"Label value = %d  Predicted value = %d \" % ( label,predicted)\n",
    "    fig = plt.figure(1, figsize=(3,3))\n",
    "    ax1 = fig.add_axes((0,0,.8,.8))\n",
    "    ax1.set_title(title)\n",
    "    plt.imshow(images[index], cmap=plt.cm.gray_r, interpolation='nearest')\n",
    "    plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Appendix"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Capture the Dataflows to use for AutoML later\n",
    "\n",
    "`Dataflow` objects are immutable. Each of them is composed of a list of data preparation steps. A `Dataflow` can be branched at any point for further usage."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# sklearn.digits.data + target\n",
    "digits_complete = dprep.smart_read_file('https://dprepdata.blob.core.windows.net/automl-notebook-data/digits-complete.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`digits_complete` (sourced from `sklearn.datasets.load_digits()`)is forked into `dflow_X` to capture all the feature columns and `dflow_y` to capture the label column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "digits_complete.to_pandas_dataframe().shape\n",
    "labels_column = 'Column64'\n",
    "dflow_X = digits_complete.drop_columns(columns=[labels_column])\n",
    "dflow_y = digits_complete.keep_columns(columns=[labels_column])"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}