{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/work-with-data/dataprep/how-to-guides/impute-missing-values.png)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Impute missing values\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Azure ML Data Prep has the ability to impute missing values in specified columns. In this case, we will attempt to impute the missing _Latitude_ and _Longitude_ values in the input data."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import azureml.dataprep as dprep"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# loading input data\n",
        "dflow = dprep.read_csv(path= '../data/crime-spring.csv')\n",
        "dflow = dflow.keep_columns(['ID', 'Arrest', 'Latitude', 'Longitude'])\n",
        "dflow = dflow.to_number(['Latitude', 'Longitude'])\n",
        "dflow.head(5)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The third record from input data has _Latitude_ and _Longitude_ missing. To impute those missing values, we can use `ImputeMissingValuesBuilder` to learn a fixed program which imputes the columns with either a calculated `MIN`, `MAX` or `MEAN` value or a `CUSTOM` value. When `group_by_columns` is specified, missing values will be imputed by group with `MIN`, `MAX` and `MEAN` calculated per group."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Firstly, let us quickly see check the `MEAN` value of _Latitude_ column."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "dflow_mean = dflow.summarize(group_by_columns=['Arrest'],\n",
        "                       summary_columns=[dprep.SummaryColumnsValue(column_id='Latitude',\n",
        "                                                                 summary_column_name='Latitude_MEAN',\n",
        "                                                                 summary_function=dprep.SummaryFunction.MEAN)])\n",
        "dflow_mean = dflow_mean.filter(dprep.col('Arrest') == 'FALSE')\n",
        "dflow_mean.head(1)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The `MEAN` value of _Latitude_ looks good. So we will impute _Latitude_ with it. As for `Longitude`, we will impute it using `42` based on external knowledge."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# impute with MEAN\n",
        "impute_mean = dprep.ImputeColumnArguments(column_id='Latitude',\n",
        "                                          impute_function=dprep.ReplaceValueFunction.MEAN)\n",
        "# impute with custom value 42\n",
        "impute_custom = dprep.ImputeColumnArguments(column_id='Longitude',\n",
        "                                            custom_impute_value=42)\n",
        "# get instance of ImputeMissingValuesBuilder\n",
        "impute_builder = dflow.builders.impute_missing_values(impute_columns=[impute_mean, impute_custom],\n",
        "                                                   group_by_columns=['Arrest'])\n",
        "# call learn() to learn a fixed program to impute missing values\n",
        "impute_builder.learn()\n",
        "# call to_dataflow() to get a dataflow with impute step added\n",
        "dflow_imputed = impute_builder.to_dataflow()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# check impute result\n",
        "dflow_imputed.head(5)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "As the result above, the missing _Latitude_ has been imputed with the `MEAN` value of `Arrest=='false'` group, and the missing _Longitude_ has been imputed with `42`."
      ]
    }
  ],
  "metadata": {
    "authors": [
      {
        "name": "sihhu"
      }
    ],
    "kernelspec": {
      "display_name": "Python 3.6",
      "language": "python",
      "name": "python36"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.4"
    },
    "notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License."
  },
  "nbformat": 4,
  "nbformat_minor": 2
}