Files
MachineLearningNotebooks/work-with-data/dataprep/how-to-guides/impute-missing-values.ipynb

147 lines
5.0 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/work-with-data/dataprep/how-to-guides/impute-missing-values.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Impute missing values\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Azure ML Data Prep has the ability to impute missing values in specified columns. In this case, we will attempt to impute the missing _Latitude_ and _Longitude_ values in the input data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import azureml.dataprep as dprep"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# loading input data\n",
"dflow = dprep.read_csv(path= '../data/crime-spring.csv')\n",
"dflow = dflow.keep_columns(['ID', 'Arrest', 'Latitude', 'Longitude'])\n",
"dflow = dflow.to_number(['Latitude', 'Longitude'])\n",
"dflow.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The third record from input data has _Latitude_ and _Longitude_ missing. To impute those missing values, we can use `ImputeMissingValuesBuilder` to learn a fixed program which imputes the columns with either a calculated `MIN`, `MAX` or `MEAN` value or a `CUSTOM` value. When `group_by_columns` is specified, missing values will be imputed by group with `MIN`, `MAX` and `MEAN` calculated per group."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Firstly, let us quickly see check the `MEAN` value of _Latitude_ column."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dflow_mean = dflow.summarize(group_by_columns=['Arrest'],\n",
" summary_columns=[dprep.SummaryColumnsValue(column_id='Latitude',\n",
" summary_column_name='Latitude_MEAN',\n",
" summary_function=dprep.SummaryFunction.MEAN)])\n",
"dflow_mean = dflow_mean.filter(dprep.col('Arrest') == 'FALSE')\n",
"dflow_mean.head(1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `MEAN` value of _Latitude_ looks good. So we will impute _Latitude_ with it. As for `Longitude`, we will impute it using `42` based on external knowledge."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# impute with MEAN\n",
"impute_mean = dprep.ImputeColumnArguments(column_id='Latitude',\n",
" impute_function=dprep.ReplaceValueFunction.MEAN)\n",
"# impute with custom value 42\n",
"impute_custom = dprep.ImputeColumnArguments(column_id='Longitude',\n",
" custom_impute_value=42)\n",
"# get instance of ImputeMissingValuesBuilder\n",
"impute_builder = dflow.builders.impute_missing_values(impute_columns=[impute_mean, impute_custom],\n",
" group_by_columns=['Arrest'])\n",
"# call learn() to learn a fixed program to impute missing values\n",
"impute_builder.learn()\n",
"# call to_dataflow() to get a dataflow with impute step added\n",
"dflow_imputed = impute_builder.to_dataflow()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# check impute result\n",
"dflow_imputed.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As the result above, the missing _Latitude_ has been imputed with the `MEAN` value of `Arrest=='false'` group, and the missing _Longitude_ has been imputed with `42`."
]
}
],
"metadata": {
"authors": [
{
"name": "sihhu"
}
],
"kernelspec": {
"display_name": "Python 3.6",
"language": "python",
"name": "python36"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
},
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License."
},
"nbformat": 4,
"nbformat_minor": 2
}