mirror of
https://github.com/Azure/MachineLearningNotebooks.git
synced 2025-12-20 01:27:06 -05:00
221 lines
6.4 KiB
Plaintext
221 lines
6.4 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Summary\n",
|
|
"From raw data that is a mixture of categoricals and numeric, featurize the categoricals using one hot encoding. Use tabular explainer to get explain object and then get raw feature importances"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Copyright (c) Microsoft Corporation. All rights reserved.\n",
|
|
"\n",
|
|
"Licensed under the MIT License."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Load titanic dataset. Impute missing values by filling both backward and forward since some data is at the first/last row. This is just for illustration and not a recommended way to impute missing data."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import pandas as pd\n",
|
|
"\n",
|
|
"titanic_url = ('https://raw.githubusercontent.com/amueller/'\n",
|
|
" 'scipy-2017-sklearn/091d371/notebooks/datasets/titanic3.csv')\n",
|
|
"data = pd.read_csv(titanic_url)\n",
|
|
"# fill missing values\n",
|
|
"data = data.fillna(method=\"ffill\")\n",
|
|
"data = data.fillna(method=\"bfill\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"data.columns"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Similar to example [here](https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html#sphx-glr-auto-examples-compose-plot-column-transformer-mixed-types-py), use a subset of columns"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from sklearn.model_selection import train_test_split\n",
|
|
"\n",
|
|
"numeric_features = ['age', 'fare']\n",
|
|
"categorical_features = ['embarked', 'sex', 'pclass']\n",
|
|
"\n",
|
|
"y = data['survived'].values\n",
|
|
"X = data[categorical_features + numeric_features]\n",
|
|
"\n",
|
|
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"One hot encode the categorical features"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from sklearn.preprocessing import OneHotEncoder\n",
|
|
"one_enc = OneHotEncoder()\n",
|
|
"one_enc.fit(X_train[categorical_features])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Columnwise concatenate one hot encoded categoricals and numerical features."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import numpy as np\n",
|
|
"from scipy import sparse\n",
|
|
"def get_feats(X):\n",
|
|
" a = one_enc.transform(X[categorical_features])\n",
|
|
" b = X[numeric_features]\n",
|
|
" return sparse.hstack((one_enc.transform(X[categorical_features]), X[numeric_features].values))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Train a logistic regression model on featurized training data."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from sklearn.linear_model import LogisticRegression\n",
|
|
"\n",
|
|
"X_train_transformed = get_feats(X_train)\n",
|
|
"X_test_transformed = get_feats(X_test)\n",
|
|
"\n",
|
|
"clf = LogisticRegression(solver='lbfgs', max_iter=200)\n",
|
|
"clf.fit(X_train_transformed, y_train)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Get feature mapping between raw and generated features. Using the order in which features are concatenated in `get_feats` and using `categories_` in `OneHotEncoder` we are able to compute this mapping."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"raw_feat_mapping = []\n",
|
|
"start_index = 0\n",
|
|
"for cat_list in one_enc.categories_:\n",
|
|
" raw_feat_mapping.append([start_index + i for i in range(len(cat_list))])\n",
|
|
" start_index += len(cat_list)\n",
|
|
"for i in range(len(numeric_features)):\n",
|
|
" raw_feat_mapping.append([start_index])\n",
|
|
" start_index += 1 "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.explain.model.tabular_explainer import TabularExplainer\n",
|
|
"\n",
|
|
"explainer = TabularExplainer(clf, X_train_transformed)\n",
|
|
"global_explanation = explainer.explain_global(X_test_transformed)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"raw_feat_imps = global_explanation.get_raw_feature_importances(raw_feat_mapping)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"feature_names = categorical_features + numeric_features\n",
|
|
"sorted_indices = np.argsort(raw_feat_imps)[::-1]\n",
|
|
"\n",
|
|
"for i in sorted_indices:\n",
|
|
" print(\"{}: {}\".format(feature_names[i], raw_feat_imps[i]))"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"authors": [
|
|
{
|
|
"name": "hichando"
|
|
}
|
|
],
|
|
"kernelspec": {
|
|
"display_name": "Python 3.6",
|
|
"language": "python",
|
|
"name": "python36"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.6.8"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
} |