{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/work-with-data/dataprep/how-to-guides/impute-missing-values.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Impute missing values\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Azure ML Data Prep has the ability to impute missing values in specified columns. In this case, we will attempt to impute the missing _Latitude_ and _Longitude_ values in the input data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import azureml.dataprep as dprep" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# loading input data\n", "dflow = dprep.read_csv(path= '../data/crime-spring.csv')\n", "dflow = dflow.keep_columns(['ID', 'Arrest', 'Latitude', 'Longitude'])\n", "dflow = dflow.to_number(['Latitude', 'Longitude'])\n", "dflow.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The third record from input data has _Latitude_ and _Longitude_ missing. To impute those missing values, we can use `ImputeMissingValuesBuilder` to learn a fixed program which imputes the columns with either a calculated `MIN`, `MAX` or `MEAN` value or a `CUSTOM` value. When `group_by_columns` is specified, missing values will be imputed by group with `MIN`, `MAX` and `MEAN` calculated per group." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Firstly, let us quickly see check the `MEAN` value of _Latitude_ column." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dflow_mean = dflow.summarize(group_by_columns=['Arrest'],\n", " summary_columns=[dprep.SummaryColumnsValue(column_id='Latitude',\n", " summary_column_name='Latitude_MEAN',\n", " summary_function=dprep.SummaryFunction.MEAN)])\n", "dflow_mean = dflow_mean.filter(dprep.col('Arrest') == 'FALSE')\n", "dflow_mean.head(1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `MEAN` value of _Latitude_ looks good. So we will impute _Latitude_ with it. As for `Longitude`, we will impute it using `42` based on external knowledge." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# impute with MEAN\n", "impute_mean = dprep.ImputeColumnArguments(column_id='Latitude',\n", " impute_function=dprep.ReplaceValueFunction.MEAN)\n", "# impute with custom value 42\n", "impute_custom = dprep.ImputeColumnArguments(column_id='Longitude',\n", " custom_impute_value=42)\n", "# get instance of ImputeMissingValuesBuilder\n", "impute_builder = dflow.builders.impute_missing_values(impute_columns=[impute_mean, impute_custom],\n", " group_by_columns=['Arrest'])\n", "# call learn() to learn a fixed program to impute missing values\n", "impute_builder.learn()\n", "# call to_dataflow() to get a dataflow with impute step added\n", "dflow_imputed = impute_builder.to_dataflow()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# check impute result\n", "dflow_imputed.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As the result above, the missing _Latitude_ has been imputed with the `MEAN` value of `Arrest=='false'` group, and the missing _Longitude_ has been imputed with `42`." ] } ], "metadata": { "authors": [ { "name": "sihhu" } ], "kernelspec": { "display_name": "Python 3.6", "language": "python", "name": "python36" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" }, "notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License." }, "nbformat": 4, "nbformat_minor": 2 }