{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Summary\n", "From raw data that is a mixture of categoricals and numeric, featurize the categoricals using one hot encoding. Use tabular explainer to get explain object and then get raw feature importances" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Copyright (c) Microsoft Corporation. All rights reserved.\n", "\n", "Licensed under the MIT License." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Load titanic dataset. Impute missing values by filling both backward and forward since some data is at the first/last row. This is just for illustration and not a recommended way to impute missing data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "titanic_url = ('https://raw.githubusercontent.com/amueller/'\n", " 'scipy-2017-sklearn/091d371/notebooks/datasets/titanic3.csv')\n", "data = pd.read_csv(titanic_url)\n", "# fill missing values\n", "data = data.fillna(method=\"ffill\")\n", "data = data.fillna(method=\"bfill\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similar to example [here](https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html#sphx-glr-auto-examples-compose-plot-column-transformer-mixed-types-py), use a subset of columns" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "numeric_features = ['age', 'fare']\n", "categorical_features = ['embarked', 'sex', 'pclass']\n", "\n", "y = data['survived'].values\n", "X = data[categorical_features + numeric_features]\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One hot encode the categorical features" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.preprocessing import OneHotEncoder\n", "one_enc = OneHotEncoder()\n", "one_enc.fit(X_train[categorical_features])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Columnwise concatenate one hot encoded categoricals and numerical features." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "from scipy import sparse\n", "def get_feats(X):\n", " a = one_enc.transform(X[categorical_features])\n", " b = X[numeric_features]\n", " return sparse.hstack((one_enc.transform(X[categorical_features]), X[numeric_features].values))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Train a logistic regression model on featurized training data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import LogisticRegression\n", "\n", "X_train_transformed = get_feats(X_train)\n", "X_test_transformed = get_feats(X_test)\n", "\n", "clf = LogisticRegression(solver='lbfgs', max_iter=200)\n", "clf.fit(X_train_transformed, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get feature mapping between raw and generated features. Using the order in which features are concatenated in `get_feats` and using `categories_` in `OneHotEncoder` we are able to compute this mapping." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "raw_feat_mapping = []\n", "start_index = 0\n", "for cat_list in one_enc.categories_:\n", " raw_feat_mapping.append([start_index + i for i in range(len(cat_list))])\n", " start_index += len(cat_list)\n", "for i in range(len(numeric_features)):\n", " raw_feat_mapping.append([start_index])\n", " start_index += 1 " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from azureml.explain.model.tabular_explainer import TabularExplainer\n", "\n", "explainer = TabularExplainer(clf, X_train_transformed)\n", "global_explanation = explainer.explain_global(X_test_transformed)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "raw_feat_imps = global_explanation.get_raw_feature_importances(raw_feat_mapping)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "feature_names = categorical_features + numeric_features\n", "sorted_indices = np.argsort(raw_feat_imps)[::-1]\n", "\n", "for i in sorted_indices:\n", " print(\"{}: {}\".format(feature_names[i], raw_feat_imps[i]))" ] } ], "metadata": { "authors": [ { "name": "hichando" } ], "kernelspec": { "display_name": "Python 3.6", "language": "python", "name": "python36" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.8" } }, "nbformat": 4, "nbformat_minor": 2 }