{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Copyright (c) Microsoft Corporation. All rights reserved.\n", "\n", "Licensed under the MIT License." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/explain-model/azure-integration/run-history/save-retrieve-explanations-run-history.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Save and retrieve explanations via Azure Machine Learning Run History\n", "\n", "_**This notebook showcases how to use the Azure Machine Learning Interpretability SDK to save and retrieve classification model explanations to/from Azure Machine Learning Run History.**_\n", "\n", "\n", "## Table of Contents\n", "\n", "1. [Introduction](#Introduction)\n", "1. [Setup](#Setup)\n", "1. [Run model explainer locally at training time](#Explain)\n", " 1. Apply feature transformations\n", " 1. Train a binary classification model\n", " 1. Explain the model on raw features\n", " 1. Generate global explanations\n", " 1. Generate local explanations\n", "1. [Upload model explanations to Azure Machine Learning Run History](#Upload)\n", "1. [Download model explanations from Azure Machine Learning Run History](#Download)\n", "1. [Visualize explanations](#Visualize)\n", "1. [Next steps](#Next)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction\n", "\n", "This notebook showcases how to explain a classification model predictions locally at training time, upload explanations to the Azure Machine Learning's run history, and download previously-uploaded explanations from the Run History.\n", "It demonstrates the API calls that you need to make to upload/download the global and local explanations and a visualization dashboard that provides an interactive way of discovering patterns in data and downloaded explanations.\n", "\n", "We will showcase three tabular data explainers: TabularExplainer (SHAP), MimicExplainer (global surrogate), and PFIExplainer.\n", "\n", "\n", "\n", "Problem: IBM employee attrition classification with scikit-learn (run model explainer locally and upload explanation to the Azure Machine Learning Run History)\n", "\n", "1. Train a SVM classification model using Scikit-learn\n", "2. Run 'explain-model-sample' with AML Run History, which leverages run history service to store and manage the explanation data\n", "---\n", "\n", "Setup: If you are using Jupyter notebooks, the extensions should be installed automatically with the package.\n", "If you are using Jupyter Labs run the following command:\n", "```\n", "(myenv) $ jupyter labextension install @jupyter-widgets/jupyterlab-manager\n", "```\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Explain\n", "\n", "### Run model explainer locally at training time" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.pipeline import Pipeline\n", "from sklearn.impute import SimpleImputer\n", "from sklearn.preprocessing import StandardScaler, OneHotEncoder\n", "from sklearn.svm import SVC\n", "import pandas as pd\n", "import numpy as np\n", "\n", "# Explainers:\n", "# 1. SHAP Tabular Explainer\n", "from interpret.ext.blackbox import TabularExplainer\n", "\n", "# OR\n", "\n", "# 2. Mimic Explainer\n", "from interpret.ext.blackbox import MimicExplainer\n", "# You can use one of the following four interpretable models as a global surrogate to the black box model\n", "from interpret.ext.glassbox import LGBMExplainableModel\n", "from interpret.ext.glassbox import LinearExplainableModel\n", "from interpret.ext.glassbox import SGDExplainableModel\n", "from interpret.ext.glassbox import DecisionTreeExplainableModel\n", "\n", "# OR\n", "\n", "# 3. PFI Explainer\n", "from interpret.ext.blackbox import PFIExplainer " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load the IBM employee attrition data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Get the IBM employee attrition dataset\n", "outdirname = 'dataset.6.21.19'\n", "try:\n", " from urllib import urlretrieve\n", "except ImportError:\n", " from urllib.request import urlretrieve\n", "import zipfile\n", "zipfilename = outdirname + '.zip'\n", "urlretrieve('https://publictestdatasets.blob.core.windows.net/data/' + zipfilename, zipfilename)\n", "with zipfile.ZipFile(zipfilename, 'r') as unzip:\n", " unzip.extractall('.')\n", "attritionData = pd.read_csv('./WA_Fn-UseC_-HR-Employee-Attrition.csv')\n", "\n", "# Dropping Employee count as all values are 1 and hence attrition is independent of this feature\n", "attritionData = attritionData.drop(['EmployeeCount'], axis=1)\n", "# Dropping Employee Number since it is merely an identifier\n", "attritionData = attritionData.drop(['EmployeeNumber'], axis=1)\n", "\n", "attritionData = attritionData.drop(['Over18'], axis=1)\n", "\n", "# Since all values are 80\n", "attritionData = attritionData.drop(['StandardHours'], axis=1)\n", "\n", "# Converting target variables from string to numerical values\n", "target_map = {'Yes': 1, 'No': 0}\n", "attritionData[\"Attrition_numerical\"] = attritionData[\"Attrition\"].apply(lambda x: target_map[x])\n", "target = attritionData[\"Attrition_numerical\"]\n", "\n", "attritionXData = attritionData.drop(['Attrition_numerical', 'Attrition'], axis=1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Split data into train and test\n", "from sklearn.model_selection import train_test_split\n", "x_train, x_test, y_train, y_test = train_test_split(attritionXData, \n", " target, \n", " test_size=0.2,\n", " random_state=0,\n", " stratify=target)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Creating dummy columns for each categorical feature\n", "categorical = []\n", "for col, value in attritionXData.iteritems():\n", " if value.dtype == 'object':\n", " categorical.append(col)\n", " \n", "# Store the numerical columns in a list numerical\n", "numerical = attritionXData.columns.difference(categorical) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Transform raw features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can explain raw features by either using a `sklearn.compose.ColumnTransformer` or a list of fitted transformer tuples. The cell below uses `sklearn.compose.ColumnTransformer`. In case you want to run the example with the list of fitted transformer tuples, comment the cell below and uncomment the cell that follows after. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.compose import ColumnTransformer\n", "\n", "# We create the preprocessing pipelines for both numeric and categorical data.\n", "numeric_transformer = Pipeline(steps=[\n", " ('imputer', SimpleImputer(strategy='median')),\n", " ('scaler', StandardScaler())])\n", "\n", "categorical_transformer = Pipeline(steps=[\n", " ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),\n", " ('onehot', OneHotEncoder(handle_unknown='ignore'))])\n", "\n", "transformations = ColumnTransformer(\n", " transformers=[\n", " ('num', numeric_transformer, numerical),\n", " ('cat', categorical_transformer, categorical)])\n", "\n", "# Append classifier to preprocessing pipeline.\n", "# Now we have a full prediction pipeline.\n", "clf = Pipeline(steps=[('preprocessor', transformations),\n", " ('classifier', SVC(C=1.0, probability=True))])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "'''\n", "# Uncomment below if sklearn-pandas is not installed\n", "#!pip install sklearn-pandas\n", "from sklearn_pandas import DataFrameMapper\n", "\n", "# Impute, standardize the numeric features and one-hot encode the categorical features. \n", "\n", "\n", "numeric_transformations = [([f], Pipeline(steps=[('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())])) for f in numerical]\n", "\n", "categorical_transformations = [([f], OneHotEncoder(handle_unknown='ignore', sparse=False)) for f in categorical]\n", "\n", "transformations = numeric_transformations + categorical_transformations\n", "\n", "# Append classifier to preprocessing pipeline.\n", "# Now we have a full prediction pipeline.\n", "clf = Pipeline(steps=[('preprocessor', transformations),\n", " ('classifier', SVC(C=1.0, probability=True))]) \n", "\n", "\n", "\n", "'''" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Train a SVM classification model, which you want to explain" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model = clf.fit(x_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Explain predictions on your local machine" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# 1. Using SHAP TabularExplainer\n", "# clf.steps[-1][1] returns the trained classification model\n", "explainer = TabularExplainer(clf.steps[-1][1], \n", " initialization_examples=x_train, \n", " features=attritionXData.columns, \n", " classes=[\"Not leaving\", \"leaving\"], \n", " transformations=transformations)\n", "\n", "\n", "\n", "\n", "# 2. Using MimicExplainer\n", "# augment_data is optional and if true, oversamples the initialization examples to improve surrogate model accuracy to fit original model. Useful for high-dimensional data where the number of rows is less than the number of columns. \n", "# max_num_of_augmentations is optional and defines max number of times we can increase the input data size.\n", "# LGBMExplainableModel can be replaced with LinearExplainableModel, SGDExplainableModel, or DecisionTreeExplainableModel\n", "# explainer = MimicExplainer(clf.steps[-1][1], \n", "# x_train, \n", "# LGBMExplainableModel, \n", "# augment_data=True, \n", "# max_num_of_augmentations=10, \n", "# features=attritionXData.columns, \n", "# classes=[\"Not leaving\", \"leaving\"], \n", "# transformations=transformations)\n", "\n", "\n", "\n", "\n", "\n", "# 3. Using PFIExplainer\n", "\n", "# Use the parameter \"metric\" to pass a metric name or function to evaluate the permutation. \n", "# Note that if a metric function is provided a higher value must be better.\n", "# Otherwise, take the negative of the function or set the parameter \"is_error_metric\" to True.\n", "# Default metrics: \n", "# F1 Score for binary classification, F1 Score with micro average for multiclass classification and\n", "# Mean absolute error for regression\n", "\n", "# explainer = PFIExplainer(clf.steps[-1][1], \n", "# features=x_train.columns, \n", "# transformations=transformations,\n", "# classes=[\"Not leaving\", \"leaving\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Generate global explanations\n", "Explain overall model predictions (global explanation)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Passing in test dataset for evaluation examples - note it must be a representative sample of the original data\n", "# x_train can be passed as well, but with more examples explanations will take longer although they may be more accurate\n", "global_explanation = explainer.explain_global(x_test)\n", "\n", "# Note: if you used the PFIExplainer in the previous step, use the next line of code instead\n", "# global_explanation = explainer.explain_global(x_test, true_labels=y_test)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Sorted SHAP values\n", "print('ranked global importance values: {}'.format(global_explanation.get_ranked_global_values()))\n", "# Corresponding feature names\n", "print('ranked global importance names: {}'.format(global_explanation.get_ranked_global_names()))\n", "# Feature ranks (based on original order of features)\n", "print('global importance rank: {}'.format(global_explanation.global_importance_rank))\n", "\n", "# Note: PFIExplainer does not support per class explanations\n", "# Per class feature names\n", "print('ranked per class feature names: {}'.format(global_explanation.get_ranked_per_class_names()))\n", "# Per class feature importance values\n", "print('ranked per class feature values: {}'.format(global_explanation.get_ranked_per_class_values()))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Print out a dictionary that holds the sorted feature importance names and values\n", "print('global importance rank: {}'.format(global_explanation.get_feature_importance_dict()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Explain overall model predictions as a collection of local (instance-level) explanations" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Feature shap values for all features and all data points in the training data\n", "print('local importance values: {}'.format(global_explanation.local_importance_values))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Generate local explanations\n", "Explain local data points (individual instances)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Note: PFIExplainer does not support local explanations\n", "# You can pass a specific data point or a group of data points to the explain_local function\n", "\n", "# E.g., Explain the first data point in the test set\n", "instance_num = 1\n", "local_explanation = explainer.explain_local(x_test[:instance_num])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Get the prediction for the first member of the test set and explain why model made that prediction\n", "prediction_value = clf.predict(x_test)[instance_num]\n", "\n", "sorted_local_importance_values = local_explanation.get_ranked_local_values()[prediction_value]\n", "sorted_local_importance_names = local_explanation.get_ranked_local_names()[prediction_value]\n", "\n", "print('local importance values: {}'.format(sorted_local_importance_values))\n", "print('local importance names: {}'.format(sorted_local_importance_names))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Upload\n", "Upload explanations to Azure Machine Learning Run History" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import azureml.core\n", "from azureml.core import Workspace, Experiment\n", "from azureml.interpret import ExplanationClient\n", "# Check core SDK version number\n", "print(\"SDK version:\", azureml.core.VERSION)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ws = Workspace.from_config()\n", "print('Workspace name: ' + ws.name, \n", " 'Azure region: ' + ws.location, \n", " 'Subscription id: ' + ws.subscription_id, \n", " 'Resource group: ' + ws.resource_group, sep = '\\n')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "experiment_name = 'explain-model-sample'\n", "experiment = Experiment(ws, experiment_name)\n", "run = experiment.start_logging()\n", "client = ExplanationClient.from_run(run)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Uploading model explanation data for storage or visualization in webUX\n", "# The explanation can then be downloaded on any compute\n", "# Multiple explanations can be uploaded\n", "client.upload_model_explanation(global_explanation, comment='global explanation: all features')\n", "# Or you can only upload the explanation object with the top k feature info\n", "#client.upload_model_explanation(global_explanation, top_k=2, comment='global explanation: Only top 2 features')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Uploading model explanation data for storage or visualization in webUX\n", "# The explanation can then be downloaded on any compute\n", "# Multiple explanations can be uploaded\n", "client.upload_model_explanation(local_explanation, comment='local explanation for test point 1: all features')\n", "\n", "# Alterntively, you can only upload the local explanation object with the top k feature info\n", "#client.upload_model_explanation(local_explanation, top_k=2, comment='local explanation: top 2 features')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Download\n", "Download explanations from Azure Machine Learning Run History" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# List uploaded explanations\n", "client.list_model_explanations()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for explanation in client.list_model_explanations():\n", " \n", " if explanation['comment'] == 'local explanation for test point 1: all features':\n", " downloaded_local_explanation = client.download_model_explanation(explanation_id=explanation['id'])\n", " # You can pass a k value to only download the top k feature importance values\n", " downloaded_local_explanation_top2 = client.download_model_explanation(top_k=2, explanation_id=explanation['id'])\n", " \n", " \n", " elif explanation['comment'] == 'global explanation: all features':\n", " downloaded_global_explanation = client.download_model_explanation(explanation_id=explanation['id'])\n", " # You can pass a k value to only download the top k feature importance values\n", " downloaded_global_explanation_top2 = client.download_model_explanation(top_k=2, explanation_id=explanation['id'])\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Visualize\n", "Load the visualization dashboard" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from interpret_community.widget import ExplanationDashboard" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ExplanationDashboard(downloaded_global_explanation, model, datasetX=x_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## End\n", "Complete the run" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "run.complete()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Next\n", "Learn about other use cases of the explain package on a:\n", "1. [Training time: regression problem](https://github.com/interpretml/interpret-community/blob/master/notebooks/explain-regression-local.ipynb) \n", "1. [Training time: binary classification problem](https://github.com/interpretml/interpret-community/blob/master/notebooks/explain-binary-classification-local.ipynb)\n", "1. [Training time: multiclass classification problem](https://github.com/interpretml/interpret-community/blob/master/notebooks/explain-multiclass-classification-local.ipynb)\n", "1. Explain models with engineered features:\n", " 1. [Simple feature transformations](https://github.com/interpretml/interpret-community/blob/master/notebooks/simple-feature-transformations-explain-local.ipynb)\n", " 1. [Advanced feature transformations](https://github.com/interpretml/interpret-community/blob/master/notebooks/advanced-feature-transformations-explain-local.ipynb)\n", "1. [Run explainers remotely on Azure Machine Learning Compute (AMLCompute)](../remote-explanation/explain-model-on-amlcompute.ipynb)\n", "1. Inferencing time: deploy a classification model and explainer:\n", " 1. [Deploy a locally-trained model and explainer](../scoring-time/train-explain-model-locally-and-deploy.ipynb)\n", " 1. [Deploy a locally-trained keras model and explainer](../scoring-time/train-explain-model-keras-locally-and-deploy.ipynb)\n", " 1. [Deploy a remotely-trained model and explainer](../scoring-time/train-explain-model-on-amlcompute-and-deploy.ipynb)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "authors": [ { "name": "mesameki" } ], "kernelspec": { "display_name": "Python 3.6", "language": "python", "name": "python36" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.8" } }, "nbformat": 4, "nbformat_minor": 2 }