Files
MachineLearningNotebooks/how-to-use-azureml/explain-model/azure-integration/run-history/save-retrieve-explanations-run-history.ipynb

648 lines
25 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Copyright (c) Microsoft Corporation. All rights reserved.\n",
"\n",
"Licensed under the MIT License."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/explain-model/azure-integration/run-history/save-retrieve-explanations-run-history.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Save and retrieve explanations via Azure Machine Learning Run History\n",
"\n",
"_**This notebook showcases how to use the Azure Machine Learning Interpretability SDK to save and retrieve classification model explanations to/from Azure Machine Learning Run History.**_\n",
"\n",
"\n",
"## Table of Contents\n",
"\n",
"1. [Introduction](#Introduction)\n",
"1. [Setup](#Setup)\n",
"1. [Run model explainer locally at training time](#Explain)\n",
" 1. Apply feature transformations\n",
" 1. Train a binary classification model\n",
" 1. Explain the model on raw features\n",
" 1. Generate global explanations\n",
" 1. Generate local explanations\n",
"1. [Upload model explanations to Azure Machine Learning Run History](#Upload)\n",
"1. [Download model explanations from Azure Machine Learning Run History](#Download)\n",
"1. [Visualize explanations](#Visualize)\n",
"1. [Next steps](#Next)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Introduction\n",
"\n",
"This notebook showcases how to explain a classification model predictions locally at training time, upload explanations to the Azure Machine Learning's run history, and download previously-uploaded explanations from the Run History.\n",
"It demonstrates the API calls that you need to make to upload/download the global and local explanations and a visualization dashboard that provides an interactive way of discovering patterns in data and downloaded explanations.\n",
"\n",
"We will showcase three tabular data explainers: TabularExplainer (SHAP), MimicExplainer (global surrogate), and PFIExplainer.\n",
"\n",
"\n",
"\n",
"Problem: IBM employee attrition classification with scikit-learn (run model explainer locally and upload explanation to the Azure Machine Learning Run History)\n",
"\n",
"1. Train a SVM classification model using Scikit-learn\n",
"2. Run 'explain-model-sample' with AML Run History, which leverages run history service to store and manage the explanation data\n",
"---\n",
"\n",
"Setup: If you are using Jupyter notebooks, the extensions should be installed automatically with the package.\n",
"If you are using Jupyter Labs run the following command:\n",
"```\n",
"(myenv) $ jupyter labextension install @jupyter-widgets/jupyterlab-manager\n",
"```\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Explain\n",
"\n",
"### Run model explainer locally at training time"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.pipeline import Pipeline\n",
"from sklearn.impute import SimpleImputer\n",
"from sklearn.preprocessing import StandardScaler, OneHotEncoder\n",
"from sklearn.svm import SVC\n",
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"# Explainers:\n",
"# 1. SHAP Tabular Explainer\n",
"from interpret.ext.blackbox import TabularExplainer\n",
"\n",
"# OR\n",
"\n",
"# 2. Mimic Explainer\n",
"from interpret.ext.blackbox import MimicExplainer\n",
"# You can use one of the following four interpretable models as a global surrogate to the black box model\n",
"from interpret.ext.glassbox import LGBMExplainableModel\n",
"from interpret.ext.glassbox import LinearExplainableModel\n",
"from interpret.ext.glassbox import SGDExplainableModel\n",
"from interpret.ext.glassbox import DecisionTreeExplainableModel\n",
"\n",
"# OR\n",
"\n",
"# 3. PFI Explainer\n",
"from interpret.ext.blackbox import PFIExplainer "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Load the IBM employee attrition data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Get the IBM employee attrition dataset\n",
"outdirname = 'dataset.6.21.19'\n",
"try:\n",
" from urllib import urlretrieve\n",
"except ImportError:\n",
" from urllib.request import urlretrieve\n",
"import zipfile\n",
"zipfilename = outdirname + '.zip'\n",
"urlretrieve('https://publictestdatasets.blob.core.windows.net/data/' + zipfilename, zipfilename)\n",
"with zipfile.ZipFile(zipfilename, 'r') as unzip:\n",
" unzip.extractall('.')\n",
"attritionData = pd.read_csv('./WA_Fn-UseC_-HR-Employee-Attrition.csv')\n",
"\n",
"# Dropping Employee count as all values are 1 and hence attrition is independent of this feature\n",
"attritionData = attritionData.drop(['EmployeeCount'], axis=1)\n",
"# Dropping Employee Number since it is merely an identifier\n",
"attritionData = attritionData.drop(['EmployeeNumber'], axis=1)\n",
"\n",
"attritionData = attritionData.drop(['Over18'], axis=1)\n",
"\n",
"# Since all values are 80\n",
"attritionData = attritionData.drop(['StandardHours'], axis=1)\n",
"\n",
"# Converting target variables from string to numerical values\n",
"target_map = {'Yes': 1, 'No': 0}\n",
"attritionData[\"Attrition_numerical\"] = attritionData[\"Attrition\"].apply(lambda x: target_map[x])\n",
"target = attritionData[\"Attrition_numerical\"]\n",
"\n",
"attritionXData = attritionData.drop(['Attrition_numerical', 'Attrition'], axis=1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Split data into train and test\n",
"from sklearn.model_selection import train_test_split\n",
"x_train, x_test, y_train, y_test = train_test_split(attritionXData, \n",
" target, \n",
" test_size=0.2,\n",
" random_state=0,\n",
" stratify=target)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Creating dummy columns for each categorical feature\n",
"categorical = []\n",
"for col, value in attritionXData.iteritems():\n",
" if value.dtype == 'object':\n",
" categorical.append(col)\n",
" \n",
"# Store the numerical columns in a list numerical\n",
"numerical = attritionXData.columns.difference(categorical) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Transform raw features"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can explain raw features by either using a `sklearn.compose.ColumnTransformer` or a list of fitted transformer tuples. The cell below uses `sklearn.compose.ColumnTransformer`. In case you want to run the example with the list of fitted transformer tuples, comment the cell below and uncomment the cell that follows after. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.compose import ColumnTransformer\n",
"\n",
"# We create the preprocessing pipelines for both numeric and categorical data.\n",
"numeric_transformer = Pipeline(steps=[\n",
" ('imputer', SimpleImputer(strategy='median')),\n",
" ('scaler', StandardScaler())])\n",
"\n",
"categorical_transformer = Pipeline(steps=[\n",
" ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),\n",
" ('onehot', OneHotEncoder(handle_unknown='ignore'))])\n",
"\n",
"transformations = ColumnTransformer(\n",
" transformers=[\n",
" ('num', numeric_transformer, numerical),\n",
" ('cat', categorical_transformer, categorical)])\n",
"\n",
"# Append classifier to preprocessing pipeline.\n",
"# Now we have a full prediction pipeline.\n",
"clf = Pipeline(steps=[('preprocessor', transformations),\n",
" ('classifier', SVC(C=1.0, probability=True))])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"'''\n",
"# Uncomment below if sklearn-pandas is not installed\n",
"#!pip install sklearn-pandas\n",
"from sklearn_pandas import DataFrameMapper\n",
"\n",
"# Impute, standardize the numeric features and one-hot encode the categorical features. \n",
"\n",
"\n",
"numeric_transformations = [([f], Pipeline(steps=[('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())])) for f in numerical]\n",
"\n",
"categorical_transformations = [([f], OneHotEncoder(handle_unknown='ignore', sparse=False)) for f in categorical]\n",
"\n",
"transformations = numeric_transformations + categorical_transformations\n",
"\n",
"# Append classifier to preprocessing pipeline.\n",
"# Now we have a full prediction pipeline.\n",
"clf = Pipeline(steps=[('preprocessor', transformations),\n",
" ('classifier', SVC(C=1.0, probability=True))]) \n",
"\n",
"\n",
"\n",
"'''"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Train a SVM classification model, which you want to explain"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"model = clf.fit(x_train, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Explain predictions on your local machine"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 1. Using SHAP TabularExplainer\n",
"# clf.steps[-1][1] returns the trained classification model\n",
"explainer = TabularExplainer(clf.steps[-1][1], \n",
" initialization_examples=x_train, \n",
" features=attritionXData.columns, \n",
" classes=[\"Not leaving\", \"leaving\"], \n",
" transformations=transformations)\n",
"\n",
"\n",
"\n",
"\n",
"# 2. Using MimicExplainer\n",
"# augment_data is optional and if true, oversamples the initialization examples to improve surrogate model accuracy to fit original model. Useful for high-dimensional data where the number of rows is less than the number of columns. \n",
"# max_num_of_augmentations is optional and defines max number of times we can increase the input data size.\n",
"# LGBMExplainableModel can be replaced with LinearExplainableModel, SGDExplainableModel, or DecisionTreeExplainableModel\n",
"# explainer = MimicExplainer(clf.steps[-1][1], \n",
"# x_train, \n",
"# LGBMExplainableModel, \n",
"# augment_data=True, \n",
"# max_num_of_augmentations=10, \n",
"# features=attritionXData.columns, \n",
"# classes=[\"Not leaving\", \"leaving\"], \n",
"# transformations=transformations)\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"# 3. Using PFIExplainer\n",
"\n",
"# Use the parameter \"metric\" to pass a metric name or function to evaluate the permutation. \n",
"# Note that if a metric function is provided a higher value must be better.\n",
"# Otherwise, take the negative of the function or set the parameter \"is_error_metric\" to True.\n",
"# Default metrics: \n",
"# F1 Score for binary classification, F1 Score with micro average for multiclass classification and\n",
"# Mean absolute error for regression\n",
"\n",
"# explainer = PFIExplainer(clf.steps[-1][1], \n",
"# features=x_train.columns, \n",
"# transformations=transformations,\n",
"# classes=[\"Not leaving\", \"leaving\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Generate global explanations\n",
"Explain overall model predictions (global explanation)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Passing in test dataset for evaluation examples - note it must be a representative sample of the original data\n",
"# x_train can be passed as well, but with more examples explanations will take longer although they may be more accurate\n",
"global_explanation = explainer.explain_global(x_test)\n",
"\n",
"# Note: if you used the PFIExplainer in the previous step, use the next line of code instead\n",
"# global_explanation = explainer.explain_global(x_test, true_labels=y_test)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Sorted SHAP values\n",
"print('ranked global importance values: {}'.format(global_explanation.get_ranked_global_values()))\n",
"# Corresponding feature names\n",
"print('ranked global importance names: {}'.format(global_explanation.get_ranked_global_names()))\n",
"# Feature ranks (based on original order of features)\n",
"print('global importance rank: {}'.format(global_explanation.global_importance_rank))\n",
"\n",
"# Note: PFIExplainer does not support per class explanations\n",
"# Per class feature names\n",
"print('ranked per class feature names: {}'.format(global_explanation.get_ranked_per_class_names()))\n",
"# Per class feature importance values\n",
"print('ranked per class feature values: {}'.format(global_explanation.get_ranked_per_class_values()))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Print out a dictionary that holds the sorted feature importance names and values\n",
"print('global importance rank: {}'.format(global_explanation.get_feature_importance_dict()))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Explain overall model predictions as a collection of local (instance-level) explanations"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Feature shap values for all features and all data points in the training data\n",
"print('local importance values: {}'.format(global_explanation.local_importance_values))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Generate local explanations\n",
"Explain local data points (individual instances)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Note: PFIExplainer does not support local explanations\n",
"# You can pass a specific data point or a group of data points to the explain_local function\n",
"\n",
"# E.g., Explain the first data point in the test set\n",
"instance_num = 1\n",
"local_explanation = explainer.explain_local(x_test[:instance_num])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Get the prediction for the first member of the test set and explain why model made that prediction\n",
"prediction_value = clf.predict(x_test)[instance_num]\n",
"\n",
"sorted_local_importance_values = local_explanation.get_ranked_local_values()[prediction_value]\n",
"sorted_local_importance_names = local_explanation.get_ranked_local_names()[prediction_value]\n",
"\n",
"print('local importance values: {}'.format(sorted_local_importance_values))\n",
"print('local importance names: {}'.format(sorted_local_importance_names))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Upload\n",
"Upload explanations to Azure Machine Learning Run History"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import azureml.core\n",
"from azureml.core import Workspace, Experiment\n",
"from azureml.interpret import ExplanationClient\n",
"# Check core SDK version number\n",
"print(\"SDK version:\", azureml.core.VERSION)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ws = Workspace.from_config()\n",
"print('Workspace name: ' + ws.name, \n",
" 'Azure region: ' + ws.location, \n",
" 'Subscription id: ' + ws.subscription_id, \n",
" 'Resource group: ' + ws.resource_group, sep = '\\n')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"experiment_name = 'explain-model-sample'\n",
"experiment = Experiment(ws, experiment_name)\n",
"run = experiment.start_logging()\n",
"client = ExplanationClient.from_run(run)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Uploading model explanation data for storage or visualization in webUX\n",
"# The explanation can then be downloaded on any compute\n",
"# Multiple explanations can be uploaded\n",
"client.upload_model_explanation(global_explanation, comment='global explanation: all features')\n",
"# Or you can only upload the explanation object with the top k feature info\n",
"#client.upload_model_explanation(global_explanation, top_k=2, comment='global explanation: Only top 2 features')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Uploading model explanation data for storage or visualization in webUX\n",
"# The explanation can then be downloaded on any compute\n",
"# Multiple explanations can be uploaded\n",
"client.upload_model_explanation(local_explanation, comment='local explanation for test point 1: all features')\n",
"\n",
"# Alterntively, you can only upload the local explanation object with the top k feature info\n",
"#client.upload_model_explanation(local_explanation, top_k=2, comment='local explanation: top 2 features')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Download\n",
"Download explanations from Azure Machine Learning Run History"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# List uploaded explanations\n",
"client.list_model_explanations()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"for explanation in client.list_model_explanations():\n",
" \n",
" if explanation['comment'] == 'local explanation for test point 1: all features':\n",
" downloaded_local_explanation = client.download_model_explanation(explanation_id=explanation['id'])\n",
" # You can pass a k value to only download the top k feature importance values\n",
" downloaded_local_explanation_top2 = client.download_model_explanation(top_k=2, explanation_id=explanation['id'])\n",
" \n",
" \n",
" elif explanation['comment'] == 'global explanation: all features':\n",
" downloaded_global_explanation = client.download_model_explanation(explanation_id=explanation['id'])\n",
" # You can pass a k value to only download the top k feature importance values\n",
" downloaded_global_explanation_top2 = client.download_model_explanation(top_k=2, explanation_id=explanation['id'])\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Visualize\n",
"Load the visualization dashboard"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from interpret_community.widget import ExplanationDashboard"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ExplanationDashboard(downloaded_global_explanation, model, datasetX=x_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## End\n",
"Complete the run"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"run.complete()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Next\n",
"Learn about other use cases of the explain package on a:\n",
"1. [Training time: regression problem](https://github.com/interpretml/interpret-community/blob/master/notebooks/explain-regression-local.ipynb) \n",
"1. [Training time: binary classification problem](https://github.com/interpretml/interpret-community/blob/master/notebooks/explain-binary-classification-local.ipynb)\n",
"1. [Training time: multiclass classification problem](https://github.com/interpretml/interpret-community/blob/master/notebooks/explain-multiclass-classification-local.ipynb)\n",
"1. Explain models with engineered features:\n",
" 1. [Simple feature transformations](https://github.com/interpretml/interpret-community/blob/master/notebooks/simple-feature-transformations-explain-local.ipynb)\n",
" 1. [Advanced feature transformations](https://github.com/interpretml/interpret-community/blob/master/notebooks/advanced-feature-transformations-explain-local.ipynb)\n",
"1. [Run explainers remotely on Azure Machine Learning Compute (AMLCompute)](../remote-explanation/explain-model-on-amlcompute.ipynb)\n",
"1. Inferencing time: deploy a classification model and explainer:\n",
" 1. [Deploy a locally-trained model and explainer](../scoring-time/train-explain-model-locally-and-deploy.ipynb)\n",
" 1. [Deploy a locally-trained keras model and explainer](../scoring-time/train-explain-model-keras-locally-and-deploy.ipynb)\n",
" 1. [Deploy a remotely-trained model and explainer](../scoring-time/train-explain-model-on-amlcompute-and-deploy.ipynb)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"authors": [
{
"name": "mesameki"
}
],
"kernelspec": {
"display_name": "Python 3.6",
"language": "python",
"name": "python36"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
}
},
"nbformat": 4,
"nbformat_minor": 2
}