mirror of
https://github.com/Azure/MachineLearningNotebooks.git
synced 2025-12-21 10:05:09 -05:00
147 lines
5.0 KiB
Plaintext
147 lines
5.0 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Impute missing values\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Azure ML Data Prep has the ability to impute missing values in specified columns. In this case, we will attempt to impute the missing _Latitude_ and _Longitude_ values in the input data."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import azureml.dataprep as dprep"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# loading input data\n",
|
|
"dflow = dprep.read_csv(path= '../data/crime-spring.csv')\n",
|
|
"dflow = dflow.keep_columns(['ID', 'Arrest', 'Latitude', 'Longitude'])\n",
|
|
"dflow = dflow.to_number(['Latitude', 'Longitude'])\n",
|
|
"dflow.head(5)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"The third record from input data has _Latitude_ and _Longitude_ missing. To impute those missing values, we can use `ImputeMissingValuesBuilder` to learn a fixed program which imputes the columns with either a calculated `MIN`, `MAX` or `MEAN` value or a `CUSTOM` value. When `group_by_columns` is specified, missing values will be imputed by group with `MIN`, `MAX` and `MEAN` calculated per group."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Firstly, let us quickly see check the `MEAN` value of _Latitude_ column."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"dflow_mean = dflow.summarize(group_by_columns=['Arrest'],\n",
|
|
" summary_columns=[dprep.SummaryColumnsValue(column_id='Latitude',\n",
|
|
" summary_column_name='Latitude_MEAN',\n",
|
|
" summary_function=dprep.SummaryFunction.MEAN)])\n",
|
|
"dflow_mean = dflow_mean.filter(dprep.col('Arrest') == 'FALSE')\n",
|
|
"dflow_mean.head(1)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"The `MEAN` value of _Latitude_ looks good. So we will impute _Latitude_ with it. As for `Longitude`, we will impute it using `42` based on external knowledge."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# impute with MEAN\n",
|
|
"impute_mean = dprep.ImputeColumnArguments(column_id='Latitude',\n",
|
|
" impute_function=dprep.ReplaceValueFunction.MEAN)\n",
|
|
"# impute with custom value 42\n",
|
|
"impute_custom = dprep.ImputeColumnArguments(column_id='Longitude',\n",
|
|
" custom_impute_value=42)\n",
|
|
"# get instance of ImputeMissingValuesBuilder\n",
|
|
"impute_builder = dflow.builders.impute_missing_values(impute_columns=[impute_mean, impute_custom],\n",
|
|
" group_by_columns=['Arrest'])\n",
|
|
"# call learn() to learn a fixed program to impute missing values\n",
|
|
"impute_builder.learn()\n",
|
|
"# call to_dataflow() to get a dataflow with impute step added\n",
|
|
"dflow_imputed = impute_builder.to_dataflow()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# check impute result\n",
|
|
"dflow_imputed.head(5)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"As the result above, the missing _Latitude_ has been imputed with the `MEAN` value of `Arrest=='false'` group, and the missing _Longitude_ has been imputed with `42`."
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"authors": [
|
|
{
|
|
"name": "sihhu"
|
|
}
|
|
],
|
|
"kernelspec": {
|
|
"display_name": "Python 3.6",
|
|
"language": "python",
|
|
"name": "python36"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.6.4"
|
|
},
|
|
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License."
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
} |