Delete impute-missing-values.ipynb
This commit is contained in:
@@ -1,148 +0,0 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Impute missing values\n",
|
||||
"Copyright (c) Microsoft Corporation. All rights reserved.<br>\n",
|
||||
"Licensed under the MIT License."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Azure ML Data Prep has the ability to impute missing values in specified columns. In this case, we will attempt to impute the missing _Latitude_ and _Longitude_ values in the input data."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import azureml.dataprep as dprep"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# loading input data\n",
|
||||
"dflow = dprep.read_csv(path= '../data/crime-spring.csv')\n",
|
||||
"dflow = dflow.keep_columns(['ID', 'Arrest', 'Latitude', 'Longitude'])\n",
|
||||
"dflow = dflow.to_number(['Latitude', 'Longitude'])\n",
|
||||
"dflow.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The third record from input data has _Latitude_ and _Longitude_ missing. To impute those missing values, we can use `ImputeMissingValuesBuilder` to learn a fixed program which imputes the columns with either a calculated `MIN`, `MAX` or `MEAN` value or a `CUSTOM` value. When `group_by_columns` is specified, missing values will be imputed by group with `MIN`, `MAX` and `MEAN` calculated per group."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Firstly, let us quickly see check the `MEAN` value of _Latitude_ column."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_mean = dflow.summarize(group_by_columns=['Arrest'],\n",
|
||||
" summary_columns=[dprep.SummaryColumnsValue(column_id='Latitude',\n",
|
||||
" summary_column_name='Latitude_MEAN',\n",
|
||||
" summary_function=dprep.SummaryFunction.MEAN)])\n",
|
||||
"dflow_mean = dflow_mean.filter(dprep.col('Arrest') == 'FALSE')\n",
|
||||
"dflow_mean.head(1)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The `MEAN` value of _Latitude_ looks good. So we will impute _Latitude_ with it. As for `Longitude`, we will impute it using `42` based on external knowledge."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# impute with MEAN\n",
|
||||
"impute_mean = dprep.ImputeColumnArguments(column_id='Latitude',\n",
|
||||
" impute_function=dprep.ReplaceValueFunction.MEAN)\n",
|
||||
"# impute with custom value 42\n",
|
||||
"impute_custom = dprep.ImputeColumnArguments(column_id='Longitude',\n",
|
||||
" custom_impute_value=42)\n",
|
||||
"# get instance of ImputeMissingValuesBuilder\n",
|
||||
"impute_builder = dflow.builders.impute_missing_values(impute_columns=[impute_mean, impute_custom],\n",
|
||||
" group_by_columns=['Arrest'])\n",
|
||||
"# call learn() to learn a fixed program to impute missing values\n",
|
||||
"impute_builder.learn()\n",
|
||||
"# call to_dataflow() to get a dataflow with impute step added\n",
|
||||
"dflow_imputed = impute_builder.to_dataflow()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# check impute result\n",
|
||||
"dflow_imputed.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"As the result above, the missing _Latitude_ has been imputed with the `MEAN` value of `Arrest=='false'` group, and the missing _Longitude_ has been imputed with `42`."
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"authors": [
|
||||
{
|
||||
"name": "sihhu"
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.4"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
Reference in New Issue
Block a user