MachineLearningNotebooks/automl/06.auto-ml-sparse-data-custom-cv-split.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copyright (c) Microsoft Corporation. All rights reserved.\n",
    "\n",
    "Licensed under the MIT License."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# AutoML 06: Custom CV splits, handling sparse data\n",
    "\n",
    "In this example we use the scikit learn's [20newsgroup](In this example we use the scikit learn's [digit dataset](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html) to showcase how you can use AutoML for handling sparse data and specify custom cross validation splits.\n",
    "\n",
    "Make sure you have executed the [00.configuration](00.configuration.ipynb) before running this notebook.\n",
    "\n",
    "In this notebook you would see\n",
    "1. Creating an Experiment using an existing Workspace\n",
    "2. Instantiating AutoMLConfig\n",
    "4. Training the Model\n",
    "5. Exploring the results\n",
    "6. Testing the fitted model\n",
    "\n",
    "In addition this notebook showcases the following features\n",
    "- **Custom CV** splits \n",
    "- Handling **Sparse Data** in the input"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create Experiment\n",
    "\n",
    "As part of the setup you have already created a <b>Workspace</b>. For AutoML you would need to create an <b>Experiment</b>. An <b>Experiment</b> is a named object in a <b>Workspace</b>, which is used to run experiments."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import logging\n",
    "import os\n",
    "import random\n",
    "\n",
    "from matplotlib import pyplot as plt\n",
    "from matplotlib.pyplot import imshow\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from sklearn import datasets\n",
    "\n",
    "import azureml.core\n",
    "from azureml.core.experiment import Experiment\n",
    "from azureml.core.workspace import Workspace\n",
    "from azureml.train.automl import AutoMLConfig\n",
    "from azureml.train.automl.run import AutoMLRun"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ws = Workspace.from_config()\n",
    "\n",
    "# choose a name for the experiment\n",
    "experiment_name = 'automl-local-missing-data'\n",
    "# project folder\n",
    "project_folder = './sample_projects/automl-local-missing-data'\n",
    "\n",
    "experiment = Experiment(ws, experiment_name)\n",
    "\n",
    "output = {}\n",
    "output['SDK version'] = azureml.core.VERSION\n",
    "output['Subscription ID'] = ws.subscription_id\n",
    "output['Workspace'] = ws.name\n",
    "output['Resource Group'] = ws.resource_group\n",
    "output['Location'] = ws.location\n",
    "output['Project Directory'] = project_folder\n",
    "output['Experiment Name'] = experiment.name\n",
    "pd.set_option('display.max_colwidth', -1)\n",
    "pd.DataFrame(data=output, index=['']).T"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Diagnostics\n",
    "\n",
    "Opt-in diagnostics for better experience, quality, and security of future releases"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from azureml.telemetry import set_diagnostics_collection\n",
    "set_diagnostics_collection(send_diagnostics=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating Sparse Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.datasets import fetch_20newsgroups\n",
    "from sklearn.feature_extraction.text import HashingVectorizer\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "remove = ('headers', 'footers', 'quotes')\n",
    "categories = [\n",
    "    'alt.atheism',\n",
    "    'talk.religion.misc',\n",
    "    'comp.graphics',\n",
    "    'sci.space',\n",
    "]\n",
    "data_train = fetch_20newsgroups(subset='train', categories=categories,\n",
    "                                shuffle=True, random_state=42,\n",
    "                                remove=remove)\n",
    "\n",
    "X_train, X_validation, y_train, y_validation = train_test_split(data_train.data, data_train.target, test_size=0.33, random_state=42)\n",
    "\n",
    "\n",
    "vectorizer = HashingVectorizer(stop_words='english', alternate_sign=False,\n",
    "                               n_features=2**16)\n",
    "X_train = vectorizer.transform(X_train)\n",
    "X_validation = vectorizer.transform(X_validation)\n",
    "\n",
    "summary_df = pd.DataFrame(index = ['No of Samples', 'No of Features'])\n",
    "summary_df['Train Set'] = [X_train.shape[0], X_train.shape[1]]\n",
    "summary_df['Validation Set'] = [X_validation.shape[0], X_validation.shape[1]]\n",
    "summary_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Instantiate Auto ML Config\n",
    "\n",
    "This defines the settings and data used to run the experiment.\n",
    "\n",
    "|Property|Description|\n",
    "|-|-|\n",
    "|**task**|classification or regression|\n",
    "|**primary_metric**|This is the metric that you want to optimize.<br> Classification supports the following primary metrics <br><i>accuracy</i><br><i>AUC_weighted</i><br><i>balanced_accuracy</i><br><i>average_precision_score_weighted</i><br><i>precision_score_weighted</i>|\n",
    "|**max_time_sec**|Time limit in seconds for each iteration|\n",
    "|**iterations**|Number of iterations. In each iteration Auto ML trains a specific pipeline with the data|\n",
    "|**preprocess**| *True/False* <br>Setting this to *True* enables Auto ML to perform preprocessing <br>on the input to handle *missing data*, and perform some common *feature extraction*<br>*Note: If input data is Sparse you cannot use preprocess=True*|\n",
    "|**X**|(sparse) array-like, shape = [n_samples, n_features]|\n",
    "|**y**|(sparse) array-like, shape = [n_samples, ], [n_samples, n_classes]<br>Multi-class targets. An indicator matrix turns on multilabel classification.  This should be an array of integers. |\n",
    "|**X_valid**|(sparse) array-like, shape = [n_samples, n_features] for the custom Validation set|\n",
    "|**y_valid**|(sparse) array-like, shape = [n_samples, ], [n_samples, n_classes]<br>Multi-class targets. An indicator matrix turns on multilabel classification. for the custom Validation set|\n",
    "|**path**|Relative path to the project folder.  AutoML stores configuration files for the experiment under this folder. You can specify a new empty folder.|"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "automl_config = AutoMLConfig(task = 'classification',\n",
    "                             debug_log='automl_errors.log',\n",
    "                             primary_metric='AUC_weighted',\n",
    "                             max_time_sec=3600,\n",
    "                             iterations=5,\n",
    "                             preprocess=False,\n",
    "                             verbosity=logging.INFO,\n",
    "                             X = X_train, \n",
    "                             y = y_train,\n",
    "                             X_valid = X_validation, \n",
    "                             y_valid = y_validation, \n",
    "                             path=project_folder)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training the Model\n",
    "\n",
    "You can call the submit method on the experiment object and pass the run configuration. For Local runs the execution is synchronous. Depending on the data and number of iterations this can run for while.\n",
    "You will see the currently running iterations printing to the console."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "local_run = experiment.submit(automl_config, show_output=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exploring the results"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Widget for monitoring runs\n",
    "\n",
    "The widget will sit on \"loading\" until the first iteration completed, then you will see an auto-updating graph and table show up. It refreshed once per minute, so you should see the graph update as child runs complete.\n",
    "\n",
    "NOTE: The widget displays a link at the bottom. This links to a web-ui to explore the individual run details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from azureml.train.widgets import RunDetails\n",
    "RunDetails(local_run).show() "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "#### Retrieve All Child Runs\n",
    "You can also use sdk methods to fetch all the child runs and see individual metrics that we log. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "children = list(local_run.get_children())\n",
    "metricslist = {}\n",
    "for run in children:\n",
    "    properties = run.get_properties()\n",
    "    metrics = {k: v for k, v in run.get_metrics().items() if isinstance(v, float)}    \n",
    "    metricslist[int(properties['iteration'])] = metrics\n",
    "    \n",
    "rundata = pd.DataFrame(metricslist).sort_index(1)\n",
    "rundata"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Retrieve the Best Model\n",
    "\n",
    "Below we select the best pipeline from our iterations. The *get_output* method on automl_classifier returns the best run and the fitted model for the last *fit* invocation. There are overloads on *get_output* that allow you to retrieve the best run and fitted model for *any* logged metric or a particular *iteration*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "best_run, fitted_model = local_run.get_output()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Best Model based on any other metric"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# lookup_metric = \"accuracy\"\n",
    "# best_run, fitted_model = local_run.get_output(metric=lookup_metric)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Model from a specific iteration"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# iteration = 3\n",
    "# best_run, fitted_model = local_run.get_output(iteration=iteration)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Register fitted model for deployment"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "description = 'AutoML Model'\n",
    "tags = None\n",
    "local_run.register_model(description=description, tags=tags)\n",
    "local_run.model_id # Use this id to deploy the model as a web service in Azure"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Testing the Fitted Model "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "digits = datasets.load_digits()### Testing the Fitted Model\n",
    "\n",
    "#### Load Test Data\n",
    "import sklearn\n",
    "from pandas_ml import ConfusionMatrix\n",
    "\n",
    "remove = ('headers', 'footers', 'quotes')\n",
    "categories = [\n",
    "    'alt.atheism',\n",
    "    'talk.religion.misc',\n",
    "    'comp.graphics',\n",
    "    'sci.space',\n",
    "]\n",
    "\n",
    "\n",
    "data_test = fetch_20newsgroups(subset='test', categories=categories,\n",
    "                                shuffle=True, random_state=42,\n",
    "                                remove=remove)\n",
    "\n",
    "vectorizer = HashingVectorizer(stop_words='english', alternate_sign=False,\n",
    "                               n_features=2**16)\n",
    "\n",
    "X_test = vectorizer.transform(data_test.data)\n",
    "y_test = data_test.target\n",
    "\n",
    "#### Testing our best pipeline\n",
    "\n",
    "ypred = fitted_model.predict(X_test)\n",
    "ypred_strings = [categories[i] for i in ypred]\n",
    "ytest_strings = [categories[i] for i in y_test]\n",
    "\n",
    "cm = ConfusionMatrix(ytest_strings, ypred_strings)\n",
    "print(cm)\n",
    "cm.plot()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.6",
   "language": "python",
   "name": "python36"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}