# Summary
From raw data that is a mixture of categoricals and numeric, featurize the categoricals using one hot encoding. Use tabular explainer to get explain object and then get raw feature importances

Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

Load titanic dataset. Impute missing values by filling both backward and forward since some data is at the first/last row. This is just for illustration and not a recommended way to impute missing data.

In [None]:
import pandas as pd

titanic_url = ('https://raw.githubusercontent.com/amueller/'
 'scipy-2017-sklearn/091d371/notebooks/datasets/titanic3.csv')
data = pd.read_csv(titanic_url)
# fill missing values
data = data.fillna(method="ffill")
data = data.fillna(method="bfill")

In [None]:
data.columns

Similar to example [here](https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html#sphx-glr-auto-examples-compose-plot-column-transformer-mixed-types-py), use a subset of columns

In [None]:
from sklearn.model_selection import train_test_split

numeric_features = ['age', 'fare']
categorical_features = ['embarked', 'sex', 'pclass']

y = data['survived'].values
X = data[categorical_features + numeric_features]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

One hot encode the categorical features

In [None]:
from sklearn.preprocessing import OneHotEncoder
one_enc = OneHotEncoder()
one_enc.fit(X_train[categorical_features])

Columnwise concatenate one hot encoded categoricals and numerical features.

In [None]:
import numpy as np
from scipy import sparse
def get_feats(X):
 a = one_enc.transform(X[categorical_features])
 b = X[numeric_features]
 return sparse.hstack((one_enc.transform(X[categorical_features]), X[numeric_features].values))

Train a logistic regression model on featurized training data.

In [None]:
from sklearn.linear_model import LogisticRegression

X_train_transformed = get_feats(X_train)
X_test_transformed = get_feats(X_test)

clf = LogisticRegression(solver='lbfgs', max_iter=200)
clf.fit(X_train_transformed, y_train)

Get feature mapping between raw and generated features. Using the order in which features are concatenated in `get_feats` and using `categories_` in `OneHotEncoder` we are able to compute this mapping.

In [None]:
raw_feat_mapping = []
start_index = 0
for cat_list in one_enc.categories_:
 raw_feat_mapping.append([start_index + i for i in range(len(cat_list))])
 start_index += len(cat_list)
for i in range(len(numeric_features)):
 raw_feat_mapping.append([start_index])
 start_index += 1 

In [None]:
from azureml.explain.model.tabular_explainer import TabularExplainer

explainer = TabularExplainer(clf, X_train_transformed)
global_explanation = explainer.explain_global(X_test_transformed)

In [None]:
raw_feat_imps = global_explanation.get_raw_feature_importances(raw_feat_mapping)

In [None]:
feature_names = categorical_features + numeric_features
sorted_indices = np.argsort(raw_feat_imps)[::-1]

for i in sorted_indices:
 print("{}: {}".format(feature_names[i], raw_feat_imps[i]))