MachineLearningNotebooks/tutorials/03.auto-train-models.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copyright (c) Microsoft Corporation. All rights reserved.\n",
    "\n",
    "Licensed under the MIT License."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Tutorial: Automatically train a classification model with Azure Automated Machine Learning\n",
    "\n",
    "In this tutorial, you'll learn how to automatically generate a  machine learning model.  This model can then be deployed following the workflow in the [Deploy a model](tutorial-deploy-models-with-aml.md) tutorial.\n",
    "\n",
    "[flow diagram](imgs/nn.png)\n",
    "\n",
    "Similar to the [train models tutorial](tutorial-train-models-with-aml.md), this tutorial classifies handwritten images of digits (0-9) from the [MNIST](http://yann.lecun.com/exdb/mnist/) dataset.\n",
    "\n",
    "You'll learn how to:\n",
    "\n",
    "> * Set up your development environment\n",
    "> * Access and examine the data\n",
    "> * Train using an automated classifier locally with custom parameters\n",
    "> * Explore the results\n",
    "> * Review training results\n",
    "> * Register the best model\n",
    "\n",
    "## Prerequisites\n",
    "\n",
    "Use [these instructions](https://aka.ms/aml-how-to-configure-environment) to:  \n",
    "* Create a workspace and its configuration file (**config.json**)  \n",
    "* Upload your **config.json** to the same folder as this notebook"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Start a notebook\n",
    "\n",
    "To follow along, start a new notebook from the same directory as **config.json** and copy the code from the sections below.\n",
    "\n",
    "\n",
    "## Set up your development environment\n",
    "\n",
    "All the setup for your development work can be accomplished in the Python notebook.  Setup includes:\n",
    "\n",
    "* Import Python packages\n",
    "* Configure a workspace to enable communication between your local computer and remote resources\n",
    "* Create a directory to store training scripts\n",
    "\n",
    "### Import packages\n",
    "Import Python packages you need in this tutorial."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import azureml.core\n",
    "import pandas as pd\n",
    "from azureml.core.workspace import Workspace\n",
    "from azureml.train.automl.run import AutoMLRun\n",
    "import time\n",
    "import logging\n",
    "from sklearn import datasets\n",
    "from matplotlib import pyplot as plt\n",
    "from matplotlib.pyplot import imshow\n",
    "import random\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Configure workspace\n",
    "\n",
    "Create a workspace object from the existing workspace. `Workspace.from_config()` reads the file **aml_config/config.json** and loads the details into an object named `ws`.  `ws` is used throughout the rest of the code in this tutorial.\n",
    "\n",
    "Once you have a workspace object, specify a name for the experiment and create and register a local directory with the workspace. The history of all runs is recorded under the specified experiment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ws = Workspace.from_config()\n",
    "# choose a name for the run history container in the workspace\n",
    "experiment_name = 'automl-classifier'\n",
    "# project folder\n",
    "project_folder = './automl-classifier'\n",
    "\n",
    "import os\n",
    "\n",
    "output = {}\n",
    "output['SDK version'] = azureml.core.VERSION\n",
    "output['Subscription ID'] = ws.subscription_id\n",
    "output['Workspace'] = ws.name\n",
    "output['Resource Group'] = ws.resource_group\n",
    "output['Location'] = ws.location\n",
    "output['Project Directory'] = project_folder\n",
    "pd.set_option('display.max_colwidth', -1)\n",
    "pd.DataFrame(data=output, index=['']).T"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Explore data\n",
    "\n",
    "The initial training tutorial used a high-resolution version  of the MNIST dataset (28x28 pixels).  Since auto training requires many iterations, this tutorial uses a smaller resolution version  of the images (8x8 pixels) to demonstrate the concepts while speeding up the time needed for each iteration."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn import datasets\n",
    "\n",
    "digits = datasets.load_digits()\n",
    "\n",
    "# only take the first 100 rows if you want the training steps to run faster\n",
    "X_digits = digits.data[100:,:]\n",
    "y_digits = digits.target[100:]\n",
    "\n",
    "# use full dataset\n",
    "#X_digits = digits.data\n",
    "#y_digits = digits.target"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Display some sample images\n",
    "\n",
    "Load the data into `numpy` arrays. Then use `matplotlib` to plot 30 random images from the dataset with their labels above them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "count = 0\n",
    "sample_size = 30\n",
    "plt.figure(figsize = (16, 6))\n",
    "for i in np.random.permutation(X_digits.shape[0])[:sample_size]:\n",
    "    count = count + 1\n",
    "    plt.subplot(1, sample_size, count)\n",
    "    plt.axhline('')\n",
    "    plt.axvline('')\n",
    "    plt.text(x = 2, y = -2, s = y_digits[i], fontsize = 18)\n",
    "    plt.imshow(X_digits[i].reshape(8, 8), cmap = plt.cm.Greys)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You now have the necessary packages and data ready for auto training for your model. \n",
    "\n",
    "## Auto train a model \n",
    "\n",
    "To auto train a model, first define settings for autogeneration and tuning and then run the automatic classifier.\n",
    "\n",
    "\n",
    "### Define settings for autogeneration and tuning\n",
    "\n",
    "Define the experiment parameters and models settings for autogeneration and tuning.  \n",
    "\n",
    "\n",
    "|Property| Value in this tutorial |Description|\n",
    "|----|----|---|\n",
    "|**primary_metric**|AUC Weighted | Metric that you want to optimize.|\n",
    "|**max_time_sec**|12,000|Time limit in seconds for each iteration|\n",
    "|**iterations**|20|Number of iterations. In each iteration, the model trains with the data with a specific pipeline|\n",
    "|**n_cross_validations**|5|Number of cross validation splits|\n",
    "|**preprocess**|True| *True/False* Enables experiment to perform preprocessing on the input.  Preprocessing handles *missing data*, and performs some common *feature extraction*|\n",
    "|**exit_score**|0.994|*double* value indicating the target for *primary_metric*. Once the target is surpassed the run terminates|\n",
    "|**blacklist_algos**|['kNN','LinearSVM']|*Array* of *strings* indicating algorithms to ignore.\n",
    "|**concurrent_iterations**|5|Max number of iterations that would be executed in parallel. This number should be less than the number of cores on the DSVM. Used in remote training.|"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from azureml.train.automl import AutoMLConfig\n",
    "\n",
    "##Local compute \n",
    "Automl_config = AutoMLConfig(task = 'classification',\n",
    "                             primary_metric = 'AUC_weighted',\n",
    "                             max_time_sec = 12000,\n",
    "                             iterations = 20,\n",
    "                             n_cross_validations = 3,\n",
    "                             blacklist_algos = ['kNN','LinearSVM'],\n",
    "                             X = X_digits,\n",
    "                             y = y_digits,\n",
    "                             path=project_folder)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Run the automatic classifier\n",
    "\n",
    "Start the experiment to run locally. Define the compute target as local and set the output to true to view progress on the experiment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from azureml.core.experiment import Experiment\n",
    "experiment=Experiment(ws, experiment_name)\n",
    "local_run = experiment.submit(Automl_config, show_output=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Explore the results\n",
    "\n",
    "Explore the results of automatic training with a Jupyter widget or by examining the experiment history.\n",
    "\n",
    "### Jupyter widget\n",
    "\n",
    "Use the Jupyter notebook widget to see a graph and a table of all results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from azureml.train.widgets import RunDetails\n",
    "RunDetails(local_run).show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Retrieve all iterations\n",
    "\n",
    "View the experiment history and see individual metrics for each iteration run."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "children = list(local_run.get_children())\n",
    "metricslist = {}\n",
    "for run in children:\n",
    "    properties = run.get_properties()\n",
    "    metrics = {k: v for k, v in run.get_metrics().items() if isinstance(v, float)}   \n",
    "    metricslist[int(properties['iteration'])] = metrics\n",
    "\n",
    "import pandas as pd\n",
    "rundata = pd.DataFrame(metricslist).sort_index(1)\n",
    "rundata"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# find the run with the highest accuracy value.\n",
    "best_run, fitted_model = local_run.get_output()\n",
    "\n",
    "# register model in workspace\n",
    "description = 'Automated Machine Learning Model'\n",
    "tags = None\n",
    "local_run.register_model(description=description, tags=tags)\n",
    "local_run.model_id # Use this id to deploy the model as a web service in Azure"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Test the best model\n",
    "\n",
    "Use the model to predict a few random digits.  Display the predicted and the image.  Red font and inverse image (white on black) is used to highlight the misclassified samples.\n",
    "\n",
    "Since the model accuracy is high, you might have to run the following code a few times before you can see a misclassified sample."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# find 30 random samples from test set\n",
    "n = 30\n",
    "sample_indices = np.random.permutation(X_digits.shape[0])[0:n]\n",
    "test_samples = X_digits[sample_indices]\n",
    "\n",
    "\n",
    "# predict using the  model\n",
    "result = fitted_model.predict(test_samples)\n",
    "\n",
    "# compare actual value vs. the predicted values:\n",
    "i = 0\n",
    "plt.figure(figsize = (20, 1))\n",
    "\n",
    "for s in sample_indices:\n",
    "    plt.subplot(1, n, i + 1)\n",
    "    plt.axhline('')\n",
    "    plt.axvline('')\n",
    "    \n",
    "    # use different color for misclassified sample\n",
    "    font_color = 'red' if y_digits[s] != result[i] else 'black'\n",
    "    clr_map = plt.cm.gray if y_digits[s] != result[i] else plt.cm.Greys\n",
    "    \n",
    "    plt.text(x = 2, y = -2, s = result[i], fontsize = 18, color = font_color)\n",
    "    plt.imshow(X_digits[s].reshape(8, 8), cmap = clr_map)\n",
    "    \n",
    "    i = i + 1\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Next steps\n",
    "\n",
    "In this Azure Machine Learning tutorial, you used Python to:\n",
    "\n",
    "> * Set up your development environment\n",
    "> * Access and examine the data\n",
    "> * Train using an automated classifier locally with custom parameters\n",
    "> * Explore the results\n",
    "> * Review training results\n",
    "> * Register the best model\n",
    "\n",
    "Learn more about [how to configure settings for automatic training]() or [how to use automatic training on a remote resource]()."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  },
  "msauthor": "sgilley"
 },
 "nbformat": 4,
 "nbformat_minor": 2
}