MachineLearningNotebooks/how-to-use-azureml/explain-model/azure-integration/run-history/save-retrieve-explanations-run-history.ipynb

{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Copyright (c) Microsoft Corporation. All rights reserved.\n",
        "\n",
        "Licensed under the MIT License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/explain-model/azure-integration/run-history/save-retrieve-explanations-run-history.png)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Save and retrieve explanations via Azure Machine Learning Run History\n",
        "\n",
        "_**This notebook showcases how to use the Azure Machine Learning Interpretability SDK to save and retrieve classification model explanations to/from Azure Machine Learning Run History.**_\n",
        "\n",
        "\n",
        "## Table of Contents\n",
        "\n",
        "1. [Introduction](#Introduction)\n",
        "1. [Setup](#Setup)\n",
        "1. [Run model explainer locally at training time](#Explain)\n",
        "    1. Apply feature transformations\n",
        "    1. Train a binary classification model\n",
        "    1. Explain the model on raw features\n",
        "        1. Generate global explanations\n",
        "        1. Generate local explanations\n",
        "1. [Upload model explanations to Azure Machine Learning Run History](#Upload)\n",
        "1. [Download model explanations from Azure Machine Learning Run History](#Download)\n",
        "1. [Visualize explanations](#Visualize)\n",
        "1. [Next steps](#Next)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Introduction\n",
        "\n",
        "This notebook showcases how to explain a classification model predictions locally at training time, upload explanations to the Azure Machine Learning's run history, and download previously-uploaded explanations from the Run History.\n",
        "It demonstrates the API calls that you need to make to upload/download the global and local explanations and a visualization dashboard that provides an interactive way of discovering patterns in data and downloaded explanations.\n",
        "\n",
        "We will showcase three tabular data explainers: TabularExplainer (SHAP), MimicExplainer (global surrogate), and PFIExplainer.\n",
        "\n",
        "\n",
        "\n",
        "Problem: IBM employee attrition classification with scikit-learn (run model explainer locally and upload explanation to the Azure Machine Learning Run History)\n",
        "\n",
        "1. Train a SVM classification model using Scikit-learn\n",
        "2. Run 'explain-model-sample' with AML Run History, which leverages run history service to store and manage the explanation data\n",
        "---\n",
        "\n",
        "Setup: If you are using Jupyter notebooks, the extensions should be installed automatically with the package.\n",
        "If you are using Jupyter Labs run the following command:\n",
        "```\n",
        "(myenv) $ jupyter labextension install @jupyter-widgets/jupyterlab-manager\n",
        "```\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Explain\n",
        "\n",
        "### Run model explainer locally at training time"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from sklearn.pipeline import Pipeline\n",
        "from sklearn.impute import SimpleImputer\n",
        "from sklearn.preprocessing import StandardScaler, OneHotEncoder\n",
        "from sklearn.svm import SVC\n",
        "import pandas as pd\n",
        "import numpy as np\n",
        "\n",
        "# Explainers:\n",
        "# 1. SHAP Tabular Explainer\n",
        "from interpret.ext.blackbox import TabularExplainer\n",
        "\n",
        "# OR\n",
        "\n",
        "# 2. Mimic Explainer\n",
        "from interpret.ext.blackbox import MimicExplainer\n",
        "# You can use one of the following four interpretable models as a global surrogate to the black box model\n",
        "from interpret.ext.glassbox import LGBMExplainableModel\n",
        "from interpret.ext.glassbox import LinearExplainableModel\n",
        "from interpret.ext.glassbox import SGDExplainableModel\n",
        "from interpret.ext.glassbox import DecisionTreeExplainableModel\n",
        "\n",
        "# OR\n",
        "\n",
        "# 3. PFI Explainer\n",
        "from interpret.ext.blackbox import PFIExplainer "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Load the IBM employee attrition data"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Get the IBM employee attrition dataset\n",
        "outdirname = 'dataset.6.21.19'\n",
        "try:\n",
        "    from urllib import urlretrieve\n",
        "except ImportError:\n",
        "    from urllib.request import urlretrieve\n",
        "import zipfile\n",
        "zipfilename = outdirname + '.zip'\n",
        "urlretrieve('https://publictestdatasets.blob.core.windows.net/data/' + zipfilename, zipfilename)\n",
        "with zipfile.ZipFile(zipfilename, 'r') as unzip:\n",
        "    unzip.extractall('.')\n",
        "attritionData = pd.read_csv('./WA_Fn-UseC_-HR-Employee-Attrition.csv')\n",
        "\n",
        "# Dropping Employee count as all values are 1 and hence attrition is independent of this feature\n",
        "attritionData = attritionData.drop(['EmployeeCount'], axis=1)\n",
        "# Dropping Employee Number since it is merely an identifier\n",
        "attritionData = attritionData.drop(['EmployeeNumber'], axis=1)\n",
        "\n",
        "attritionData = attritionData.drop(['Over18'], axis=1)\n",
        "\n",
        "# Since all values are 80\n",
        "attritionData = attritionData.drop(['StandardHours'], axis=1)\n",
        "\n",
        "# Converting target variables from string to numerical values\n",
        "target_map = {'Yes': 1, 'No': 0}\n",
        "attritionData[\"Attrition_numerical\"] = attritionData[\"Attrition\"].apply(lambda x: target_map[x])\n",
        "target = attritionData[\"Attrition_numerical\"]\n",
        "\n",
        "attritionXData = attritionData.drop(['Attrition_numerical', 'Attrition'], axis=1)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Split data into train and test\n",
        "from sklearn.model_selection import train_test_split\n",
        "x_train, x_test, y_train, y_test = train_test_split(attritionXData, \n",
        "                                                    target, \n",
        "                                                    test_size=0.2,\n",
        "                                                    random_state=0,\n",
        "                                                    stratify=target)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Creating dummy columns for each categorical feature\n",
        "categorical = []\n",
        "for col, value in attritionXData.iteritems():\n",
        "    if value.dtype == 'object':\n",
        "        categorical.append(col)\n",
        "        \n",
        "# Store the numerical columns in a list numerical\n",
        "numerical = attritionXData.columns.difference(categorical)        "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Transform raw features"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can explain raw features by either using a `sklearn.compose.ColumnTransformer` or a list of fitted transformer tuples. The cell below uses `sklearn.compose.ColumnTransformer`. In case you want to run the example with the list of fitted transformer tuples, comment the cell below and uncomment the cell that follows after. "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from sklearn.compose import ColumnTransformer\n",
        "\n",
        "# We create the preprocessing pipelines for both numeric and categorical data.\n",
        "numeric_transformer = Pipeline(steps=[\n",
        "    ('imputer', SimpleImputer(strategy='median')),\n",
        "    ('scaler', StandardScaler())])\n",
        "\n",
        "categorical_transformer = Pipeline(steps=[\n",
        "    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),\n",
        "    ('onehot', OneHotEncoder(handle_unknown='ignore'))])\n",
        "\n",
        "transformations = ColumnTransformer(\n",
        "    transformers=[\n",
        "        ('num', numeric_transformer, numerical),\n",
        "        ('cat', categorical_transformer, categorical)])\n",
        "\n",
        "# Append classifier to preprocessing pipeline.\n",
        "# Now we have a full prediction pipeline.\n",
        "clf = Pipeline(steps=[('preprocessor', transformations),\n",
        "                      ('classifier', SVC(C=1.0, probability=True))])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "'''\n",
        "# Uncomment below if sklearn-pandas is not installed\n",
        "#!pip install sklearn-pandas\n",
        "from sklearn_pandas import DataFrameMapper\n",
        "\n",
        "# Impute, standardize the numeric features and one-hot encode the categorical features.    \n",
        "\n",
        "\n",
        "numeric_transformations = [([f], Pipeline(steps=[('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())])) for f in numerical]\n",
        "\n",
        "categorical_transformations = [([f], OneHotEncoder(handle_unknown='ignore', sparse=False)) for f in categorical]\n",
        "\n",
        "transformations = numeric_transformations + categorical_transformations\n",
        "\n",
        "# Append classifier to preprocessing pipeline.\n",
        "# Now we have a full prediction pipeline.\n",
        "clf = Pipeline(steps=[('preprocessor', transformations),\n",
        "                      ('classifier', SVC(C=1.0, probability=True))]) \n",
        "\n",
        "\n",
        "\n",
        "'''"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Train a SVM classification model, which you want to explain"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "model = clf.fit(x_train, y_train)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Explain predictions on your local machine"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# 1. Using SHAP TabularExplainer\n",
        "# clf.steps[-1][1] returns the trained classification model\n",
        "explainer = TabularExplainer(clf.steps[-1][1], \n",
        "                             initialization_examples=x_train, \n",
        "                             features=attritionXData.columns, \n",
        "                             classes=[\"Not leaving\", \"leaving\"], \n",
        "                             transformations=transformations)\n",
        "\n",
        "\n",
        "\n",
        "\n",
        "# 2. Using MimicExplainer\n",
        "# augment_data is optional and if true, oversamples the initialization examples to improve surrogate model accuracy to fit original model.  Useful for high-dimensional data where the number of rows is less than the number of columns. \n",
        "# max_num_of_augmentations is optional and defines max number of times we can increase the input data size.\n",
        "# LGBMExplainableModel can be replaced with LinearExplainableModel, SGDExplainableModel, or DecisionTreeExplainableModel\n",
        "# explainer = MimicExplainer(clf.steps[-1][1], \n",
        "#                            x_train, \n",
        "#                            LGBMExplainableModel, \n",
        "#                            augment_data=True, \n",
        "#                            max_num_of_augmentations=10, \n",
        "#                            features=attritionXData.columns, \n",
        "#                            classes=[\"Not leaving\", \"leaving\"], \n",
        "#                            transformations=transformations)\n",
        "\n",
        "\n",
        "\n",
        "\n",
        "\n",
        "# 3. Using PFIExplainer\n",
        "\n",
        "# Use the parameter \"metric\" to pass a metric name or function to evaluate the permutation. \n",
        "# Note that if a metric function is provided a higher value must be better.\n",
        "# Otherwise, take the negative of the function or set the parameter \"is_error_metric\" to True.\n",
        "# Default metrics: \n",
        "# F1 Score for binary classification, F1 Score with micro average for multiclass classification and\n",
        "# Mean absolute error for regression\n",
        "\n",
        "# explainer = PFIExplainer(clf.steps[-1][1], \n",
        "#                          features=x_train.columns, \n",
        "#                          transformations=transformations,\n",
        "#                          classes=[\"Not leaving\", \"leaving\"])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Generate global explanations\n",
        "Explain overall model predictions (global explanation)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Passing in test dataset for evaluation examples - note it must be a representative sample of the original data\n",
        "# x_train can be passed as well, but with more examples explanations will take longer although they may be more accurate\n",
        "global_explanation = explainer.explain_global(x_test)\n",
        "\n",
        "# Note: if you used the PFIExplainer in the previous step, use the next line of code instead\n",
        "# global_explanation = explainer.explain_global(x_test, true_labels=y_test)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Sorted SHAP values\n",
        "print('ranked global importance values: {}'.format(global_explanation.get_ranked_global_values()))\n",
        "# Corresponding feature names\n",
        "print('ranked global importance names: {}'.format(global_explanation.get_ranked_global_names()))\n",
        "# Feature ranks (based on original order of features)\n",
        "print('global importance rank: {}'.format(global_explanation.global_importance_rank))\n",
        "\n",
        "# Note: PFIExplainer does not support per class explanations\n",
        "# Per class feature names\n",
        "print('ranked per class feature names: {}'.format(global_explanation.get_ranked_per_class_names()))\n",
        "# Per class feature importance values\n",
        "print('ranked per class feature values: {}'.format(global_explanation.get_ranked_per_class_values()))"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Print out a dictionary that holds the sorted feature importance names and values\n",
        "print('global importance rank: {}'.format(global_explanation.get_feature_importance_dict()))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Explain overall model predictions as a collection of local (instance-level) explanations"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Feature shap values for all features and all data points in the training data\n",
        "print('local importance values: {}'.format(global_explanation.local_importance_values))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Generate local explanations\n",
        "Explain local data points (individual instances)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Note: PFIExplainer does not support local explanations\n",
        "# You can pass a specific data point or a group of data points to the explain_local function\n",
        "\n",
        "# E.g., Explain the first data point in the test set\n",
        "instance_num = 1\n",
        "local_explanation = explainer.explain_local(x_test[:instance_num])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Get the prediction for the first member of the test set and explain why model made that prediction\n",
        "prediction_value = clf.predict(x_test)[instance_num]\n",
        "\n",
        "sorted_local_importance_values = local_explanation.get_ranked_local_values()[prediction_value]\n",
        "sorted_local_importance_names = local_explanation.get_ranked_local_names()[prediction_value]\n",
        "\n",
        "print('local importance values: {}'.format(sorted_local_importance_values))\n",
        "print('local importance names: {}'.format(sorted_local_importance_names))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Upload\n",
        "Upload explanations to Azure Machine Learning Run History"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import azureml.core\n",
        "from azureml.core import Workspace, Experiment\n",
        "from azureml.interpret import ExplanationClient\n",
        "# Check core SDK version number\n",
        "print(\"SDK version:\", azureml.core.VERSION)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "ws = Workspace.from_config()\n",
        "print('Workspace name: ' + ws.name, \n",
        "      'Azure region: ' + ws.location, \n",
        "      'Subscription id: ' + ws.subscription_id, \n",
        "      'Resource group: ' + ws.resource_group, sep = '\\n')"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "experiment_name = 'explain-model-sample'\n",
        "experiment = Experiment(ws, experiment_name)\n",
        "run = experiment.start_logging()\n",
        "client = ExplanationClient.from_run(run)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Uploading model explanation data for storage or visualization in webUX\n",
        "# The explanation can then be downloaded on any compute\n",
        "# Multiple explanations can be uploaded\n",
        "client.upload_model_explanation(global_explanation, comment='global explanation: all features')\n",
        "# Or you can only upload the explanation object with the top k feature info\n",
        "#client.upload_model_explanation(global_explanation, top_k=2, comment='global explanation: Only top 2 features')"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Uploading model explanation data for storage or visualization in webUX\n",
        "# The explanation can then be downloaded on any compute\n",
        "# Multiple explanations can be uploaded\n",
        "client.upload_model_explanation(local_explanation, comment='local explanation for test point 1: all features')\n",
        "\n",
        "# Alterntively, you can only upload the local explanation object with the top k feature info\n",
        "#client.upload_model_explanation(local_explanation, top_k=2, comment='local explanation: top 2 features')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Download\n",
        "Download explanations from Azure Machine Learning Run History"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# List uploaded explanations\n",
        "client.list_model_explanations()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "for explanation in client.list_model_explanations():\n",
        "    \n",
        "    if explanation['comment'] == 'local explanation for test point 1: all features':\n",
        "        downloaded_local_explanation = client.download_model_explanation(explanation_id=explanation['id'])\n",
        "        # You can pass a k value to only download the top k feature importance values\n",
        "        downloaded_local_explanation_top2 = client.download_model_explanation(top_k=2, explanation_id=explanation['id'])\n",
        "    \n",
        "    \n",
        "    elif explanation['comment'] == 'global explanation: all features':\n",
        "        downloaded_global_explanation = client.download_model_explanation(explanation_id=explanation['id'])\n",
        "        # You can pass a k value to only download the top k feature importance values\n",
        "        downloaded_global_explanation_top2 = client.download_model_explanation(top_k=2, explanation_id=explanation['id'])\n",
        "    "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Visualize\n",
        "Load the visualization dashboard"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from interpret_community.widget import ExplanationDashboard"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "ExplanationDashboard(downloaded_global_explanation, model, datasetX=x_test)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## End\n",
        "Complete the run"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "run.complete()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Next\n",
        "Learn about other use cases of the explain package on a:\n",
        "1. [Training time: regression problem](https://github.com/interpretml/interpret-community/blob/master/notebooks/explain-regression-local.ipynb)       \n",
        "1. [Training time: binary classification problem](https://github.com/interpretml/interpret-community/blob/master/notebooks/explain-binary-classification-local.ipynb)\n",
        "1. [Training time: multiclass classification problem](https://github.com/interpretml/interpret-community/blob/master/notebooks/explain-multiclass-classification-local.ipynb)\n",
        "1. Explain models with engineered features:\n",
        "    1. [Simple feature transformations](https://github.com/interpretml/interpret-community/blob/master/notebooks/simple-feature-transformations-explain-local.ipynb)\n",
        "    1. [Advanced feature transformations](https://github.com/interpretml/interpret-community/blob/master/notebooks/advanced-feature-transformations-explain-local.ipynb)\n",
        "1. [Run explainers remotely on Azure Machine Learning Compute (AMLCompute)](../remote-explanation/explain-model-on-amlcompute.ipynb)\n",
        "1. Inferencing time: deploy a classification model and explainer:\n",
        "    1. [Deploy a locally-trained model and explainer](../scoring-time/train-explain-model-locally-and-deploy.ipynb)\n",
        "    1. [Deploy a locally-trained keras model and explainer](../scoring-time/train-explain-model-keras-locally-and-deploy.ipynb)\n",
        "    1. [Deploy a remotely-trained model and explainer](../scoring-time/train-explain-model-on-amlcompute-and-deploy.ipynb)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": []
    }
  ],
  "metadata": {
    "authors": [
      {
        "name": "mesameki"
      }
    ],
    "kernelspec": {
      "display_name": "Python 3.6",
      "language": "python",
      "name": "python36"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.8"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 2
}