MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/sample-weight/auto-ml-sample-weight.ipynb

{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Copyright (c) Microsoft Corporation. All rights reserved.\n",
        "\n",
        "Licensed under the MIT License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Automated Machine Learning\n",
        "_**Sample Weight**_\n",
        "\n",
        "## Contents\n",
        "1. [Introduction](#Introduction)\n",
        "1. [Setup](#Setup)\n",
        "1. [Train](#Train)\n",
        "1. [Test](#Test)\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Introduction\n",
        "In this example we use the scikit-learn's [digit dataset](http://scikit-learn.org/stable/datasets/index.html#optical-recognition-of-handwritten-digits-dataset) to showcase how you can use sample weight with AutoML. Sample weight is used where some sample values are more important than others.\n",
        "\n",
        "Make sure you have executed the [configuration](../../../configuration.ipynb) before running this notebook.\n",
        "\n",
        "In this notebook you will learn how to configure AutoML to use `sample_weight` and you will see the difference sample weight makes to the test results."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Setup\n",
        "\n",
        "As part of the setup you have already created an Azure ML `Workspace` object. For AutoML you will need to create an `Experiment` object, which is a named object in a `Workspace` used to run experiments."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import logging\n",
        "\n",
        "from matplotlib import pyplot as plt\n",
        "import numpy as np\n",
        "import pandas as pd\n",
        "from sklearn import datasets\n",
        "\n",
        "import azureml.core\n",
        "from azureml.core.experiment import Experiment\n",
        "from azureml.core.workspace import Workspace\n",
        "from azureml.train.automl import AutoMLConfig"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "ws = Workspace.from_config()\n",
        "\n",
        "# Choose names for the regular and the sample weight experiments.\n",
        "experiment_name = 'non_sample_weight_experiment'\n",
        "sample_weight_experiment_name = 'sample_weight_experiment'\n",
        "\n",
        "project_folder = './sample_projects/sample_weight'\n",
        "\n",
        "experiment = Experiment(ws, experiment_name)\n",
        "sample_weight_experiment=Experiment(ws, sample_weight_experiment_name)\n",
        "\n",
        "output = {}\n",
        "output['SDK version'] = azureml.core.VERSION\n",
        "output['Subscription ID'] = ws.subscription_id\n",
        "output['Workspace Name'] = ws.name\n",
        "output['Resource Group'] = ws.resource_group\n",
        "output['Location'] = ws.location\n",
        "output['Project Directory'] = project_folder\n",
        "output['Experiment Name'] = experiment.name\n",
        "pd.set_option('display.max_colwidth', -1)\n",
        "outputDf = pd.DataFrame(data = output, index = [''])\n",
        "outputDf.T"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Train\n",
        "\n",
        "Instantiate two `AutoMLConfig` objects. One will be used with `sample_weight` and one without."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "digits = datasets.load_digits()\n",
        "X_train = digits.data[100:,:]\n",
        "y_train = digits.target[100:]\n",
        "\n",
        "# The example makes the sample weight 0.9 for the digit 4 and 0.1 for all other digits.\n",
        "# This makes the model more likely to classify as 4 if the image it not clear.\n",
        "sample_weight = np.array([(0.9 if x == 4 else 0.01) for x in y_train])\n",
        "\n",
        "automl_classifier = AutoMLConfig(task = 'classification',\n",
        "                                 debug_log = 'automl_errors.log',\n",
        "                                 primary_metric = 'AUC_weighted',\n",
        "                                 iteration_timeout_minutes = 60,\n",
        "                                 iterations = 10,\n",
        "                                 n_cross_validations = 2,\n",
        "                                 verbosity = logging.INFO,\n",
        "                                 X = X_train, \n",
        "                                 y = y_train,\n",
        "                                 path = project_folder)\n",
        "\n",
        "automl_sample_weight = AutoMLConfig(task = 'classification',\n",
        "                                    debug_log = 'automl_errors.log',\n",
        "                                    primary_metric = 'AUC_weighted',\n",
        "                                    iteration_timeout_minutes = 60,\n",
        "                                    iterations = 10,\n",
        "                                    n_cross_validations = 2,\n",
        "                                    verbosity = logging.INFO,\n",
        "                                    X = X_train, \n",
        "                                    y = y_train,\n",
        "                                    sample_weight = sample_weight,\n",
        "                                    path = project_folder)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Call the `submit` method on the experiment objects and pass the run configuration. Execution of local runs is synchronous. Depending on the data and the number of iterations this can run for a while.\n",
        "In this example, we specify `show_output = True` to print currently running iterations to the console."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "local_run = experiment.submit(automl_classifier, show_output = True)\n",
        "sample_weight_run = sample_weight_experiment.submit(automl_sample_weight, show_output = True)\n",
        "\n",
        "best_run, fitted_model = local_run.get_output()\n",
        "best_run_sample_weight, fitted_model_sample_weight = sample_weight_run.get_output()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Test\n",
        "\n",
        "#### Load Test Data"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "digits = datasets.load_digits()\n",
        "X_test = digits.data[:100, :]\n",
        "y_test = digits.target[:100]\n",
        "images = digits.images[:100]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "#### Compare the Models\n",
        "The prediction from the sample weight model is more likely to correctly predict 4's.  However, it is also more likely to predict 4 for some images that are not labelled as 4."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Randomly select digits and test.\n",
        "for index in range(0,len(y_test)):\n",
        "    predicted = fitted_model.predict(X_test[index:index + 1])[0]\n",
        "    predicted_sample_weight = fitted_model_sample_weight.predict(X_test[index:index + 1])[0]\n",
        "    label = y_test[index]\n",
        "    if predicted == 4 or predicted_sample_weight == 4 or label == 4:\n",
        "        title = \"Label value = %d  Predicted value = %d Prediced with sample weight = %d\" % (label, predicted, predicted_sample_weight)\n",
        "        fig = plt.figure(1, figsize=(3,3))\n",
        "        ax1 = fig.add_axes((0,0,.8,.8))\n",
        "        ax1.set_title(title)\n",
        "        plt.imshow(images[index], cmap = plt.cm.gray_r, interpolation = 'nearest')\n",
        "        plt.show()"
      ]
    }
  ],
  "metadata": {
    "authors": [
      {
        "name": "savitam"
      }
    ],
    "kernelspec": {
      "display_name": "Python 3.6",
      "language": "python",
      "name": "python36"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.5"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 2
}