MachineLearningNotebooks/automl/11.auto-ml-sample-weight.ipynb

{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Copyright (c) Microsoft Corporation. All rights reserved.\n",
        "\n",
        "Licensed under the MIT License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# AutoML 11: Sample Weight\n",
        "\n",
        "In this example we use the scikit-learn's [digit dataset](http://scikit-learn.org/stable/datasets/index.html#optical-recognition-of-handwritten-digits-dataset) to showcase how you can use sample weight with AutoML. Sample weight is used where some sample values are more important than others.\n",
        "\n",
        "Make sure you have executed the [00.configuration](00.configuration.ipynb) before running this notebook.\n",
        "\n",
        "In this notebook you will learn how to configure AutoML to use `sample_weight` and you will see the difference sample weight makes to the test results.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Create an Experiment\n",
        "\n",
        "As part of the setup you have already created an Azure ML `Workspace` object. For AutoML you will need to create an `Experiment` object, which is a named object in a `Workspace` used to run experiments."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import logging\n",
        "import os\n",
        "import random\n",
        "\n",
        "from matplotlib import pyplot as plt\n",
        "from matplotlib.pyplot import imshow\n",
        "import numpy as np\n",
        "import pandas as pd\n",
        "from sklearn import datasets\n",
        "\n",
        "import azureml.core\n",
        "from azureml.core.experiment import Experiment\n",
        "from azureml.core.workspace import Workspace\n",
        "from azureml.train.automl import AutoMLConfig\n",
        "from azureml.train.automl.run import AutoMLRun"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "ws = Workspace.from_config()\n",
        "\n",
        "# Choose names for the regular and the sample weight experiments.\n",
        "experiment_name = 'non_sample_weight_experiment'\n",
        "sample_weight_experiment_name = 'sample_weight_experiment'\n",
        "\n",
        "project_folder = './sample_projects/automl-local-classification'\n",
        "\n",
        "experiment = Experiment(ws, experiment_name)\n",
        "sample_weight_experiment=Experiment(ws, sample_weight_experiment_name)\n",
        "\n",
        "output = {}\n",
        "output['SDK version'] = azureml.core.VERSION\n",
        "output['Subscription ID'] = ws.subscription_id\n",
        "output['Workspace Name'] = ws.name\n",
        "output['Resource Group'] = ws.resource_group\n",
        "output['Location'] = ws.location\n",
        "output['Project Directory'] = project_folder\n",
        "output['Experiment Name'] = experiment.name\n",
        "pd.set_option('display.max_colwidth', -1)\n",
        "pd.DataFrame(data = output, index = ['']).T"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Diagnostics\n",
        "\n",
        "Opt-in diagnostics for better experience, quality, and security of future releases."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.telemetry import set_diagnostics_collection\n",
        "set_diagnostics_collection(send_diagnostics = True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Configure AutoML\n",
        "\n",
        "Instantiate two `AutoMLConfig` objects. One will be used with `sample_weight` and one without."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "digits = datasets.load_digits()\n",
        "X_train = digits.data[100:,:]\n",
        "y_train = digits.target[100:]\n",
        "\n",
        "# The example makes the sample weight 0.9 for the digit 4 and 0.1 for all other digits.\n",
        "# This makes the model more likely to classify as 4 if the image it not clear.\n",
        "sample_weight = np.array([(0.9 if x == 4 else 0.01) for x in y_train])\n",
        "\n",
        "automl_classifier = AutoMLConfig(task = 'classification',\n",
        "                                 debug_log = 'automl_errors.log',\n",
        "                                 primary_metric = 'AUC_weighted',\n",
        "                                 max_time_sec = 3600,\n",
        "                                 iterations = 10,\n",
        "                                 n_cross_validations = 2,\n",
        "                                 verbosity = logging.INFO,\n",
        "                                 X = X_train, \n",
        "                                 y = y_train,\n",
        "                                 path = project_folder)\n",
        "\n",
        "automl_sample_weight = AutoMLConfig(task = 'classification',\n",
        "                                    debug_log = 'automl_errors.log',\n",
        "                                    primary_metric = 'AUC_weighted',\n",
        "                                    max_time_sec = 3600,\n",
        "                                    iterations = 10,\n",
        "                                    n_cross_validations = 2,\n",
        "                                    verbosity = logging.INFO,\n",
        "                                    X = X_train, \n",
        "                                    y = y_train,\n",
        "                                    sample_weight = sample_weight,\n",
        "                                    path = project_folder)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Train the Models\n",
        "\n",
        "Call the `submit` method on the experiment objects and pass the run configuration. Execution of local runs is synchronous. Depending on the data and the number of iterations this can run for a while.\n",
        "In this example, we specify `show_output = True` to print currently running iterations to the console."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "local_run = experiment.submit(automl_classifier, show_output = True)\n",
        "sample_weight_run = sample_weight_experiment.submit(automl_sample_weight, show_output = True)\n",
        "\n",
        "best_run, fitted_model = local_run.get_output()\n",
        "best_run_sample_weight, fitted_model_sample_weight = sample_weight_run.get_output()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Test the Best Fitted Model\n",
        "\n",
        "#### Load Test Data"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "digits = datasets.load_digits()\n",
        "X_test = digits.data[:100, :]\n",
        "y_test = digits.target[:100]\n",
        "images = digits.images[:100]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "#### Compare the Models\n",
        "The prediction from the sample weight model is more likely to correctly predict 4's.  However, it is also more likely to predict 4 for some images that are not labelled as 4."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Randomly select digits and test.\n",
        "for index in range(0,len(y_test)):\n",
        "    predicted = fitted_model.predict(X_test[index:index + 1])[0]\n",
        "    predicted_sample_weight = fitted_model_sample_weight.predict(X_test[index:index + 1])[0]\n",
        "    label = y_test[index]\n",
        "    if predicted == 4 or predicted_sample_weight == 4 or label == 4:\n",
        "        title = \"Label value = %d  Predicted value = %d Prediced with sample weight = %d\" % (label, predicted, predicted_sample_weight)\n",
        "        fig = plt.figure(1, figsize=(3,3))\n",
        "        ax1 = fig.add_axes((0,0,.8,.8))\n",
        "        ax1.set_title(title)\n",
        "        plt.imshow(images[index], cmap = plt.cm.gray_r, interpolation = 'nearest')\n",
        "        plt.show()"
      ]
    }
  ],
  "metadata": {
    "authors": [
      {
        "name": "savitam"
      }
    ],
    "kernelspec": {
      "display_name": "Python 3.6",
      "language": "python",
      "name": "python36"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.5"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 2
}