MachineLearningNotebooks/contrib/fairness/fairlearn-azureml-mitigation.ipynb

{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Copyright (c) Microsoft Corporation. All rights reserved.  \n",
        "Licensed under the MIT License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/contrib/fairness/fairlearn-azureml-mitigation.png)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Unfairness Mitigation with Fairlearn and Azure Machine Learning\n",
        "**This notebook shows how to upload results from Fairlearn's GridSearch mitigation algorithm into a dashboard in Azure Machine Learning Studio**\n",
        "\n",
        "## Table of Contents\n",
        "\n",
        "1. [Introduction](#Introduction)\n",
        "1. [Loading the Data](#LoadingData)\n",
        "1. [Training an Unmitigated Model](#UnmitigatedModel)\n",
        "1. [Mitigation with GridSearch](#Mitigation)\n",
        "1. [Uploading a Fairness Dashboard to Azure](#AzureUpload)\n",
        "    1. Registering models\n",
        "    1. Computing Fairness Metrics\n",
        "    1. Uploading to Azure\n",
        "1. [Conclusion](#Conclusion)\n",
        "\n",
        "<a id=\"Introduction\"></a>\n",
        "## Introduction\n",
        "This notebook shows how to use [Fairlearn (an open source fairness assessment and unfairness mitigation package)](http://fairlearn.org) and Azure Machine Learning Studio for a binary classification problem. This example uses the well-known adult census dataset. For the purposes of this notebook, we shall treat this as a loan decision problem. We will pretend that the label indicates whether or not each individual repaid a loan in the past. We will use the data to train a predictor to predict whether previously unseen individuals will repay a loan or not. The assumption is that the model predictions are used to decide whether an individual should be offered a loan. Its purpose is purely illustrative of a workflow including a fairness dashboard - in particular, we do **not** include a full discussion of the detailed issues which arise when considering fairness in machine learning. For such discussions, please [refer to the Fairlearn website](http://fairlearn.org/).\n",
        "\n",
        "We will apply the [grid search algorithm](https://fairlearn.org/v0.4.6/api_reference/fairlearn.reductions.html#fairlearn.reductions.GridSearch) from the Fairlearn package using a specific notion of fairness called Demographic Parity. This produces a set of models, and we will view these in a dashboard both locally and in the Azure Machine Learning Studio.\n",
        "\n",
        "### Setup\n",
        "\n",
        "To use this notebook, an Azure Machine Learning workspace is required.\n",
        "Please see the [configuration notebook](../../configuration.ipynb) for information about creating one, if required.\n",
        "This notebook also requires the following packages:\n",
        "* `azureml-contrib-fairness`\n",
        "* `fairlearn>=0.6.2` (pre-v0.5.0 will work with minor modifications)\n",
        "* `joblib`\n",
        "* `liac-arff`\n",
        "* `raiwidgets`\n",
        "\n",
        "Fairlearn relies on features introduced in v0.22.1 of `scikit-learn`. If you have an older version already installed, please uncomment and run the following cell:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# !pip install --upgrade scikit-learn>=0.22.1"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Finally, please ensure that when you downloaded this notebook, you also downloaded the `fairness_nb_utils.py` file from the same location, and placed it in the same directory as this notebook."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<a id=\"LoadingData\"></a>\n",
        "## Loading the Data\n",
        "We use the well-known `adult` census dataset, which we will fetch from the OpenML website. We start with a fairly unremarkable set of imports:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from fairlearn.reductions import GridSearch, DemographicParity, ErrorRate\n",
        "from raiwidgets import FairnessDashboard\n",
        "\n",
        "from sklearn.compose import ColumnTransformer\n",
        "from sklearn.impute import SimpleImputer\n",
        "from sklearn.linear_model import LogisticRegression\n",
        "from sklearn.model_selection import train_test_split\n",
        "from sklearn.preprocessing import StandardScaler, OneHotEncoder\n",
        "from sklearn.compose import make_column_selector as selector\n",
        "from sklearn.pipeline import Pipeline\n",
        "\n",
        "import pandas as pd"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can now load and inspect the data:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from fairness_nb_utils import fetch_census_dataset\n",
        "\n",
        "data = fetch_census_dataset()\n",
        "    \n",
        "# Extract the items we want\n",
        "X_raw = data.data\n",
        "y = (data.target == '>50K') * 1\n",
        "\n",
        "X_raw[\"race\"].value_counts().to_dict()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We are going to treat the sex and race of each individual as protected attributes, and in this particular case we are going to remove these attributes from the main data (this is not always the best option - see the [Fairlearn website](http://fairlearn.github.io/) for further discussion). Protected attributes are often denoted by 'A' in the literature, and we follow that convention here:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "A = X_raw[['sex','race']]\n",
        "X_raw = X_raw.drop(labels=['sex', 'race'], axis = 1)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We now preprocess our data. To avoid the problem of data leakage, we split our data into training and test sets before performing any other transformations. Subsequent transformations (such as scalings) will be fit to the training data set, and then applied to the test dataset."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "(X_train, X_test, y_train, y_test, A_train, A_test) = train_test_split(\n",
        "    X_raw, y, A, test_size=0.3, random_state=12345, stratify=y\n",
        ")\n",
        "\n",
        "# Ensure indices are aligned between X, y and A,\n",
        "# after all the slicing and splitting of DataFrames\n",
        "# and Series\n",
        "\n",
        "X_train = X_train.reset_index(drop=True)\n",
        "X_test = X_test.reset_index(drop=True)\n",
        "y_train = y_train.reset_index(drop=True)\n",
        "y_test = y_test.reset_index(drop=True)\n",
        "A_train = A_train.reset_index(drop=True)\n",
        "A_test = A_test.reset_index(drop=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We have two types of column in the dataset - categorical columns which will need to be one-hot encoded, and numeric ones which will need to be rescaled. We also need to take care of missing values. We use a simple approach here, but please bear in mind that this is another way that bias could be introduced (especially if one subgroup tends to have more missing values).\n",
        "\n",
        "For this preprocessing, we make use of `Pipeline` objects from `sklearn`:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "numeric_transformer = Pipeline(\n",
        "    steps=[\n",
        "        (\"impute\", SimpleImputer()),\n",
        "        (\"scaler\", StandardScaler()),\n",
        "    ]\n",
        ")\n",
        "\n",
        "categorical_transformer = Pipeline(\n",
        "    [\n",
        "        (\"impute\", SimpleImputer(strategy=\"most_frequent\")),\n",
        "        (\"ohe\", OneHotEncoder(handle_unknown=\"ignore\", sparse_output=False)),\n",
        "    ]\n",
        ")\n",
        "\n",
        "preprocessor = ColumnTransformer(\n",
        "    transformers=[\n",
        "        (\"num\", numeric_transformer, selector(dtype_exclude=\"category\")),\n",
        "        (\"cat\", categorical_transformer, selector(dtype_include=\"category\")),\n",
        "    ]\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now, the preprocessing pipeline is defined, we can run it on our training data, and apply the generated transform to our test data:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "X_train = preprocessor.fit_transform(X_train)\n",
        "X_test = preprocessor.transform(X_test)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<a id=\"UnmitigatedModel\"></a>\n",
        "## Training an Unmitigated Model\n",
        "\n",
        "So we have a point of comparison, we first train a model (specifically, logistic regression from scikit-learn) on the raw data, without applying any mitigation algorithm:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "unmitigated_predictor = LogisticRegression(solver='liblinear', fit_intercept=True)\n",
        "\n",
        "unmitigated_predictor.fit(X_train, y_train)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can view this model in the fairness dashboard, and see the disparities which appear:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "FairnessDashboard(sensitive_features=A_test,\n",
        "                  y_true=y_test,\n",
        "                  y_pred={\"unmitigated\": unmitigated_predictor.predict(X_test)})"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Looking at the disparity in accuracy when we select 'Sex' as the sensitive feature, we see that males have an error rate about three times greater than the females. More interesting is the disparity in opportunitiy - males are offered loans at three times the rate of females.\n",
        "\n",
        "Despite the fact that we removed the feature from the training data, our predictor still discriminates based on sex. This demonstrates that simply ignoring a protected attribute when fitting a predictor rarely eliminates unfairness. There will generally be enough other features correlated with the removed attribute to lead to disparate impact."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<a id=\"Mitigation\"></a>\n",
        "## Mitigation with GridSearch\n",
        "\n",
        "The `GridSearch` class in `Fairlearn` implements a simplified version of the exponentiated gradient reduction of [Agarwal et al. 2018](https://arxiv.org/abs/1803.02453). The user supplies a standard ML estimator, which is treated as a blackbox - for this simple example, we shall use the logistic regression estimator from scikit-learn. `GridSearch` works by generating a sequence of relabellings and reweightings, and trains a predictor for each.\n",
        "\n",
        "For this example, we specify demographic parity (on the protected attribute of sex) as the fairness metric. Demographic parity requires that individuals are offered the opportunity (a loan in this example) independent of membership in the protected class (i.e., females and males should be offered loans at the same rate). *We are using this metric for the sake of simplicity* in this example; the appropriate fairness metric can only be selected after *careful examination of the broader context* in which the model is to be used."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "sweep = GridSearch(LogisticRegression(solver='liblinear', fit_intercept=True),\n",
        "                   constraints=DemographicParity(),\n",
        "                   grid_size=71)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "With our estimator created, we can fit it to the data. After `fit()` completes, we extract the full set of predictors from the `GridSearch` object.\n",
        "\n",
        "The following cell trains a many copies of the underlying estimator, and may take a minute or two to run:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "sweep.fit(X_train, y_train,\n",
        "          sensitive_features=A_train.sex)\n",
        "\n",
        "# For Fairlearn pre-v0.5.0, need sweep._predictors\n",
        "predictors = sweep.predictors_"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We could load these predictors into the Fairness dashboard now. However, the plot would be somewhat confusing due to their number. In this case, we are going to remove the predictors which are dominated in the error-disparity space by others from the sweep (note that the disparity will only be calculated for the protected attribute; other potentially protected attributes will *not* be mitigated). In general, one might not want to do this, since there may be other considerations beyond the strict optimisation of error and disparity (of the given protected attribute)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "errors, disparities = [], []\n",
        "for predictor in predictors:\n",
        "    error = ErrorRate()\n",
        "    error.load_data(X_train, pd.Series(y_train), sensitive_features=A_train.sex)\n",
        "    disparity = DemographicParity()\n",
        "    disparity.load_data(X_train, pd.Series(y_train), sensitive_features=A_train.sex)\n",
        "    \n",
        "    errors.append(error.gamma(predictor.predict)[0])\n",
        "    disparities.append(disparity.gamma(predictor.predict).max())\n",
        "    \n",
        "all_results = pd.DataFrame( {\"predictor\": predictors, \"error\": errors, \"disparity\": disparities})\n",
        "\n",
        "dominant_models_dict = dict()\n",
        "base_name_format = \"census_gs_model_{0}\"\n",
        "row_id = 0\n",
        "for row in all_results.itertuples():\n",
        "    model_name = base_name_format.format(row_id)\n",
        "    errors_for_lower_or_eq_disparity = all_results[\"error\"][all_results[\"disparity\"]<=row.disparity]\n",
        "    if row.error <= errors_for_lower_or_eq_disparity.min():\n",
        "        dominant_models_dict[model_name] = row.predictor\n",
        "    row_id = row_id + 1"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can construct predictions for the dominant models (we include the unmitigated predictor as well, for comparison):"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "predictions_dominant = {\"census_unmitigated\": unmitigated_predictor.predict(X_test)}\n",
        "models_dominant = {\"census_unmitigated\": unmitigated_predictor}\n",
        "for name, predictor in dominant_models_dict.items():\n",
        "    value = predictor.predict(X_test)\n",
        "    predictions_dominant[name] = value\n",
        "    models_dominant[name] = predictor"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "These predictions may then be viewed in the fairness dashboard. We include the race column from the dataset, as an alternative basis for assessing the models. However, since we have not based our mitigation on it, the variation in the models with respect to race can be large."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "FairnessDashboard(sensitive_features=A_test, \n",
        "                  y_true=y_test.tolist(),\n",
        "                  y_pred=predictions_dominant)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "When using sex as the sensitive feature and accuracy as the metric, we see a Pareto front forming - the set of predictors which represent optimal tradeoffs between accuracy and disparity in predictions. In the ideal case, we would have a predictor at (1,0) - perfectly accurate and without any unfairness under demographic parity (with respect to the protected attribute \"sex\"). The Pareto front represents the closest we can come to this ideal based on our data and choice of estimator. Note the range of the axes - the disparity axis covers more values than the accuracy, so we can reduce disparity substantially for a small loss in accuracy. Finally, we also see that the unmitigated model is towards the top right of the plot, with high accuracy, but worst disparity.\n",
        "\n",
        "By clicking on individual models on the plot, we can inspect their metrics for disparity and accuracy in greater detail. In a real example, we would then pick the model which represented the best trade-off between accuracy and disparity given the relevant business constraints."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<a id=\"AzureUpload\"></a>\n",
        "## Uploading a Fairness Dashboard to Azure\n",
        "\n",
        "Uploading a fairness dashboard to Azure is a two stage process. The `FairnessDashboard` invoked in the previous section relies on the underlying Python kernel to compute metrics on demand. This is obviously not available when the fairness dashboard is rendered in AzureML Studio. By default, the dashboard in Azure Machine Learning Studio also requires the models to be registered. The required stages are therefore:\n",
        "1. Register the dominant models\n",
        "1. Precompute all the required metrics\n",
        "1. Upload to Azure\n",
        "\n",
        "Before that, we need to connect to Azure Machine Learning Studio:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.core import Workspace, Experiment, Model\n",
        "\n",
        "ws = Workspace.from_config()\n",
        "ws.get_details()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<a id=\"RegisterModels\"></a>\n",
        "### Registering Models\n",
        "\n",
        "The fairness dashboard is designed to integrate with registered models, so we need to do this for the models we want in the Studio portal. The assumption is that the names of the models specified in the dashboard dictionary correspond to the `id`s (i.e. `<name>:<version>` pairs) of registered models in the workspace. We register each of the models in the `models_dominant` dictionary into the workspace. For this, we have to save each model to a file, and then register that file:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import joblib\n",
        "import os\n",
        "\n",
        "os.makedirs('models', exist_ok=True)\n",
        "def register_model(name, model):\n",
        "    print(\"Registering \", name)\n",
        "    model_path = \"models/{0}.pkl\".format(name)\n",
        "    joblib.dump(value=model, filename=model_path)\n",
        "    registered_model = Model.register(model_path=model_path,\n",
        "                                      model_name=name,\n",
        "                                      workspace=ws)\n",
        "    print(\"Registered \", registered_model.id)\n",
        "    return registered_model.id\n",
        "\n",
        "model_name_id_mapping = dict()\n",
        "for name, model in models_dominant.items():\n",
        "    m_id = register_model(name, model)\n",
        "    model_name_id_mapping[name] = m_id"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now, produce new predictions dictionaries, with the updated names:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "predictions_dominant_ids = dict()\n",
        "for name, y_pred in predictions_dominant.items():\n",
        "    predictions_dominant_ids[model_name_id_mapping[name]] = y_pred"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<a id=\"PrecomputeMetrics\"></a>\n",
        "### Precomputing Metrics\n",
        "\n",
        "We create a _dashboard dictionary_ using Fairlearn's `metrics` package. The `_create_group_metric_set` method has arguments similar to the Dashboard constructor, except that the sensitive features are passed as a dictionary (to ensure that names are available), and we must specify the type of prediction. Note that we use the `predictions_dominant_ids` dictionary we just created:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "sf = { 'sex': A_test.sex, 'race': A_test.race }\n",
        "\n",
        "from fairlearn.metrics._group_metric_set import _create_group_metric_set\n",
        "\n",
        "\n",
        "dash_dict = _create_group_metric_set(y_true=y_test,\n",
        "                                     predictions=predictions_dominant_ids,\n",
        "                                     sensitive_features=sf,\n",
        "                                     prediction_type='binary_classification')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<a id=\"DashboardUpload\"></a>\n",
        "### Uploading the Dashboard\n",
        "\n",
        "Now, we import our `contrib` package which contains the routine to perform the upload:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.contrib.fairness import upload_dashboard_dictionary, download_dashboard_by_upload_id"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now we can create an Experiment, then a Run, and upload our dashboard to it:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "exp = Experiment(ws, \"Test_Fairlearn_GridSearch_Census_Demo\")\n",
        "print(exp)\n",
        "\n",
        "run = exp.start_logging()\n",
        "try:\n",
        "    dashboard_title = \"Dominant Models from GridSearch\"\n",
        "    upload_id = upload_dashboard_dictionary(run,\n",
        "                                            dash_dict,\n",
        "                                            dashboard_name=dashboard_title)\n",
        "    print(\"\\nUploaded to id: {0}\\n\".format(upload_id))\n",
        "\n",
        "    downloaded_dict = download_dashboard_by_upload_id(run, upload_id)\n",
        "finally:\n",
        "    run.complete()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The dashboard can be viewed in the Run Details page.\n",
        "\n",
        "Finally, we can verify that the dashboard dictionary which we downloaded matches our upload:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "print(dash_dict == downloaded_dict)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<a id=\"Conclusion\"></a>\n",
        "## Conclusion\n",
        "\n",
        "In this notebook we have demonstrated how to use the `GridSearch` algorithm from Fairlearn to generate a collection of models, and then present them in the fairness dashboard in Azure Machine Learning Studio. Please remember that this notebook has not attempted to discuss the many considerations which should be part of any approach to unfairness mitigation. The [Fairlearn website](http://fairlearn.org/) provides that discussion"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": []
    }
  ],
  "metadata": {
    "authors": [
      {
        "name": "riedgar"
      }
    ],
    "kernelspec": {
      "display_name": "Python 3.8 - AzureML",
      "language": "python",
      "name": "python38-azureml"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.10"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 2
}