mirror of
https://github.com/Azure/MachineLearningNotebooks.git
synced 2025-12-20 09:37:04 -05:00
270 lines
7.7 KiB
Plaintext
270 lines
7.7 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Summary\n",
|
|
"From raw data that is a mixture of categoricals and numeric, featurize the categoricals using one hot encoding. Use tabular explainer to get explain object and then get raw feature importances"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Copyright (c) Microsoft Corporation. All rights reserved.\n",
|
|
"\n",
|
|
"Licensed under the MIT License."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Explain a model with the AML explain-model package on raw features\n",
|
|
"\n",
|
|
"1. Train a Logistic Regression model using Scikit-learn\n",
|
|
"2. Run 'explain_model' with full dataset in local mode, which doesn't contact any Azure services.\n",
|
|
"3. Run 'explain_model' with summarized dataset in local mode, which doesn't contact any Azure services.\n",
|
|
"4. Visualize the global and local explanations with the visualization dashboard."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from sklearn.pipeline import Pipeline\n",
|
|
"from sklearn.impute import SimpleImputer\n",
|
|
"from sklearn.preprocessing import StandardScaler, OneHotEncoder\n",
|
|
"from sklearn.linear_model import LogisticRegression\n",
|
|
"from azureml.contrib.explain.model.tabular_explainer import TabularExplainer\n",
|
|
"from sklearn_pandas import DataFrameMapper\n",
|
|
"import pandas as pd\n",
|
|
"import numpy as np"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"titanic_url = ('https://raw.githubusercontent.com/amueller/'\n",
|
|
" 'scipy-2017-sklearn/091d371/notebooks/datasets/titanic3.csv')\n",
|
|
"data = pd.read_csv(titanic_url)\n",
|
|
"# fill missing values\n",
|
|
"data = data.fillna(method=\"ffill\")\n",
|
|
"data = data.fillna(method=\"bfill\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# 1. Run model explainer locally with full data"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Similar to example [here](https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html#sphx-glr-auto-examples-compose-plot-column-transformer-mixed-types-py), use a subset of columns"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from sklearn.model_selection import train_test_split\n",
|
|
"\n",
|
|
"numeric_features = ['age', 'fare']\n",
|
|
"categorical_features = ['embarked', 'sex', 'pclass']\n",
|
|
"\n",
|
|
"y = data['survived'].values\n",
|
|
"X = data[categorical_features + numeric_features]\n",
|
|
"\n",
|
|
"x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from sklearn.pipeline import Pipeline\n",
|
|
"from sklearn.impute import SimpleImputer\n",
|
|
"from sklearn.preprocessing import StandardScaler, OneHotEncoder\n",
|
|
"from sklearn_pandas import DataFrameMapper\n",
|
|
"\n",
|
|
"# Impute and standardize the numeric features\n",
|
|
"numeric_transformations = [([f], Pipeline(steps=[\n",
|
|
" ('imputer', SimpleImputer(strategy='median')),\n",
|
|
" ('scaler', StandardScaler())])) for f in numeric_features]\n",
|
|
" \n",
|
|
"# One hot encode the categorical features \n",
|
|
"categorical_transformations = [([f], OneHotEncoder(handle_unknown='ignore', sparse=False)) for f in categorical_features]\n",
|
|
"\n",
|
|
"\n",
|
|
"transformations = numeric_transformations + categorical_transformations\n",
|
|
"\n",
|
|
"# Append classifier to preprocessing pipeline.\n",
|
|
"# Now we have a full prediction pipeline.\n",
|
|
"clf = Pipeline(steps=[('preprocessor', DataFrameMapper(transformations)),\n",
|
|
" ('classifier', LogisticRegression(solver='lbfgs'))])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Train a Logistic Regression model, which you want to explain"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"model = clf.fit(x_train, y_train)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Explain predictions on your local machine"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"tabular_explainer = TabularExplainer(clf.steps[-1][1], initialization_examples=x_train, features=x_train.columns, transformations=transformations)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Passing in test dataset for evaluation examples - note it must be a representative sample of the original data\n",
|
|
"# x_train can be passed as well, but with more examples explanations will take longer although they may be more accurate\n",
|
|
"global_explanation = tabular_explainer.explain_global(x_test)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"sorted_global_importance_values = global_explanation.get_ranked_global_values()\n",
|
|
"sorted_global_importance_names = global_explanation.get_ranked_global_names()\n",
|
|
"dict(zip(sorted_global_importance_names, sorted_global_importance_values))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Explain overall model predictions as a collection of local (instance-level) explanations"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# explain the first member of the test set\n",
|
|
"local_explanation = tabular_explainer.explain_local(x_test[:1])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# get the prediction for the first member of the test set and explain why model made that prediction\n",
|
|
"prediction_value = clf.predict(x_test)[0]\n",
|
|
"\n",
|
|
"sorted_local_importance_values = local_explanation.get_ranked_local_values()[prediction_value]\n",
|
|
"sorted_local_importance_names = local_explanation.get_ranked_local_names()[prediction_value]\n",
|
|
"\n",
|
|
"# Sorted local SHAP values\n",
|
|
"print('ranked local importance values: {}'.format(sorted_local_importance_values))\n",
|
|
"# Corresponding feature names\n",
|
|
"print('ranked local importance names: {}'.format(sorted_local_importance_names))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# 2. Load visualization dashboard"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.contrib.explain.model.visualize import ExplanationDashboard"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"ExplanationDashboard(global_explanation, model, x_test)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
}
|
|
],
|
|
"metadata": {
|
|
"authors": [
|
|
{
|
|
"name": "mesameki"
|
|
}
|
|
],
|
|
"kernelspec": {
|
|
"display_name": "Python 3.6",
|
|
"language": "python",
|
|
"name": "python36"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.6.8"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
}
|