{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copyright (c) Microsoft Corporation. All rights reserved.\n",
    "\n",
    "Licensed under the MIT License."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Automated Machine Learning: Prepare Data using `azureml.dataprep` for Local Execution\n",
    "In this example we showcase how you can use the `azureml.dataprep` SDK to load and prepare data for AutoML. `azureml.dataprep` can also be used standalone; full documentation can be found [here](https://github.com/Microsoft/PendletonDocs).\n",
    "\n",
    "Make sure you have executed the [configuration](../configuration.ipynb) before running this notebook.\n",
    "\n",
    "In this notebook you will learn how to:\n",
    "1. Define data loading and preparation steps in a `Dataflow` using `azureml.dataprep`.\n",
    "2. Pass the `Dataflow` to AutoML for a local run.\n",
    "3. Pass the `Dataflow` to AutoML for a remote run."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Diagnostics\n",
    "\n",
    "Opt-in diagnostics for better experience, quality, and security of future releases."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from azureml.telemetry import set_diagnostics_collection\n",
    "set_diagnostics_collection(send_diagnostics = True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create an Experiment\n",
    "\n",
    "As part of the setup you have already created an Azure ML `Workspace` object. For AutoML you will need to create an `Experiment` object, which is a named object in a `Workspace` used to run experiments."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import logging\n",
    "import os\n",
    "\n",
    "import pandas as pd\n",
    "\n",
    "import azureml.core\n",
    "from azureml.core.experiment import Experiment\n",
    "from azureml.core.workspace import Workspace\n",
    "import azureml.dataprep as dprep\n",
    "from azureml.train.automl import AutoMLConfig"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ws = Workspace.from_config()\n",
    " \n",
    "# choose a name for experiment\n",
    "experiment_name = 'automl-dataprep-local'\n",
    "# project folder\n",
    "project_folder = './sample_projects/automl-dataprep-local'\n",
    " \n",
    "experiment = Experiment(ws, experiment_name)\n",
    " \n",
    "output = {}\n",
    "output['SDK version'] = azureml.core.VERSION\n",
    "output['Subscription ID'] = ws.subscription_id\n",
    "output['Workspace Name'] = ws.name\n",
    "output['Resource Group'] = ws.resource_group\n",
    "output['Location'] = ws.location\n",
    "output['Project Directory'] = project_folder\n",
    "output['Experiment Name'] = experiment.name\n",
    "pd.set_option('display.max_colwidth', -1)\n",
    "pd.DataFrame(data = output, index = ['']).T"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading Data using DataPrep"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# You can use `auto_read_file` which intelligently figures out delimiters and datatypes of a file.\n",
    "# The data referenced here was pulled from `sklearn.datasets.load_digits()`.\n",
    "simple_example_data_root = 'https://dprepdata.blob.core.windows.net/automl-notebook-data/'\n",
    "X = dprep.auto_read_file(simple_example_data_root + 'X.csv').skip(1)  # Remove the header row.\n",
    "\n",
    "# You can also use `read_csv` and `to_*` transformations to read (with overridable delimiter)\n",
    "# and convert column types manually.\n",
    "# Here we read a comma delimited file and convert all columns to integers.\n",
    "y = dprep.read_csv(simple_example_data_root + 'y.csv').to_long(dprep.ColumnSelector(term='.*', use_regex = True))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Review the Data Preparation Result\n",
    "\n",
    "You can peek the result of a Dataflow at any range using `skip(i)` and `head(j)`. Doing so evaluates only `j` records for all the steps in the Dataflow, which makes it fast even against large datasets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X.skip(1).head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Configure AutoML\n",
    "\n",
    "This creates a general AutoML settings object applicable for both local and remote runs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "automl_settings = {\n",
    "    \"iteration_timeout_minutes\" : 10,\n",
    "    \"iterations\" : 2,\n",
    "    \"primary_metric\" : 'AUC_weighted',\n",
    "    \"preprocess\" : False,\n",
    "    \"verbosity\" : logging.INFO,\n",
    "    \"n_cross_validations\": 3\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Local Run"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Pass Data with `Dataflow` Objects\n",
    "\n",
    "The `Dataflow` objects captured above can be passed to the `submit` method for a local run. AutoML will retrieve the results from the `Dataflow` for model training."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "automl_config = AutoMLConfig(task = 'classification',\n",
    "                             debug_log = 'automl_errors.log',\n",
    "                             X = X,\n",
    "                             y = y,\n",
    "                             **automl_settings)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "local_run = experiment.submit(automl_config, show_output = True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Explore the Results"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Widget for Monitoring Runs\n",
    "\n",
    "The widget will first report a \"loading\" status while running the first iteration. After completing the first iteration, an auto-updating graph and table will be shown. The widget will refresh once per minute, so you should see the graph update as child runs complete.\n",
    "\n",
    "**Note:** The widget displays a link at the bottom. Use this link to open a web interface to explore the individual run details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from azureml.widgets import RunDetails\n",
    "RunDetails(local_run).show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Retrieve All Child Runs\n",
    "You can also use SDK methods to fetch all the child runs and see individual metrics that we log."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "children = list(local_run.get_children())\n",
    "metricslist = {}\n",
    "for run in children:\n",
    "    properties = run.get_properties()\n",
    "    metrics = {k: v for k, v in run.get_metrics().items() if isinstance(v, float)}\n",
    "    metricslist[int(properties['iteration'])] = metrics\n",
    "    \n",
    "import pandas as pd\n",
    "rundata = pd.DataFrame(metricslist).sort_index(1)\n",
    "rundata"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Retrieve the Best Model\n",
    "\n",
    "Below we select the best pipeline from our iterations. The `get_output` method returns the best run and the fitted model. Overloads on `get_output` allow you to retrieve the best run and fitted model for *any* logged metric or for a particular *iteration*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "best_run, fitted_model = local_run.get_output()\n",
    "print(best_run)\n",
    "print(fitted_model)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Best Model Based on Any Other Metric\n",
    "Show the run and the model that has the smallest `log_loss` value:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lookup_metric = \"log_loss\"\n",
    "best_run, fitted_model = local_run.get_output(metric = lookup_metric)\n",
    "print(best_run)\n",
    "print(fitted_model)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Model from a Specific Iteration\n",
    "Show the run and the model from the first iteration:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "iteration = 0\n",
    "best_run, fitted_model = local_run.get_output(iteration = iteration)\n",
    "print(best_run)\n",
    "print(fitted_model)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Test the Best Fitted Model\n",
    "\n",
    "#### Load Test Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn import datasets\n",
    "\n",
    "digits = datasets.load_digits()\n",
    "X_test = digits.data[:10, :]\n",
    "y_test = digits.target[:10]\n",
    "images = digits.images[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Testing Our Best Fitted Model\n",
    "We will try to predict 2 digits and see how our model works."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Randomly select digits and test\n",
    "from matplotlib import pyplot as plt\n",
    "from matplotlib.pyplot import imshow\n",
    "import random\n",
    "import numpy as np\n",
    "\n",
    "for index in np.random.choice(len(y_test), 2, replace = False):\n",
    "    print(index)\n",
    "    predicted = fitted_model.predict(X_test[index:index + 1])[0]\n",
    "    label = y_test[index]\n",
    "    title = \"Label value = %d  Predicted value = %d \" % (label, predicted)\n",
    "    fig = plt.figure(1, figsize=(3,3))\n",
    "    ax1 = fig.add_axes((0,0,.8,.8))\n",
    "    ax1.set_title(title)\n",
    "    plt.imshow(images[index], cmap = plt.cm.gray_r, interpolation = 'nearest')\n",
    "    plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Appendix"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Capture the `Dataflow` Objects for Later Use in AutoML\n",
    "\n",
    "`Dataflow` objects are immutable and are composed of a list of data preparation steps. A `Dataflow` object can be branched at any point for further usage."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# sklearn.digits.data + target\n",
    "digits_complete = dprep.auto_read_file('https://dprepdata.blob.core.windows.net/automl-notebook-data/digits-complete.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`digits_complete` (sourced from `sklearn.datasets.load_digits()`) is forked into `dflow_X` to capture all the feature columns and `dflow_y` to capture the label column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "digits_complete.to_pandas_dataframe().shape\n",
    "labels_column = 'Column64'\n",
    "dflow_X = digits_complete.drop_columns(columns = [labels_column])\n",
    "dflow_y = digits_complete.keep_columns(columns = [labels_column])"
   ]
  }
 ],
 "metadata": {
  "authors": [
   {
    "name": "savitam"
   }
  ],
  "kernelspec": {
   "display_name": "Python 3.6",
   "language": "python",
   "name": "python36"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}