diff --git a/how-to-use-azureml/work-with-data/dataprep/how-to-guides/impute-missing-values.ipynb b/how-to-use-azureml/work-with-data/dataprep/how-to-guides/impute-missing-values.ipynb deleted file mode 100644 index f86a07f3..00000000 --- a/how-to-use-azureml/work-with-data/dataprep/how-to-guides/impute-missing-values.ipynb +++ /dev/null @@ -1,148 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/work-with-data/dataprep/how-to-guides/impute-missing-values.png)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Impute missing values\n", - "Copyright (c) Microsoft Corporation. All rights reserved.
\n", - "Licensed under the MIT License." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Azure ML Data Prep has the ability to impute missing values in specified columns. In this case, we will attempt to impute the missing _Latitude_ and _Longitude_ values in the input data." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import azureml.dataprep as dprep" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# loading input data\n", - "dflow = dprep.read_csv(path= '../data/crime-spring.csv')\n", - "dflow = dflow.keep_columns(['ID', 'Arrest', 'Latitude', 'Longitude'])\n", - "dflow = dflow.to_number(['Latitude', 'Longitude'])\n", - "dflow.head(5)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The third record from input data has _Latitude_ and _Longitude_ missing. To impute those missing values, we can use `ImputeMissingValuesBuilder` to learn a fixed program which imputes the columns with either a calculated `MIN`, `MAX` or `MEAN` value or a `CUSTOM` value. When `group_by_columns` is specified, missing values will be imputed by group with `MIN`, `MAX` and `MEAN` calculated per group." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Firstly, let us quickly see check the `MEAN` value of _Latitude_ column." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "dflow_mean = dflow.summarize(group_by_columns=['Arrest'],\n", - " summary_columns=[dprep.SummaryColumnsValue(column_id='Latitude',\n", - " summary_column_name='Latitude_MEAN',\n", - " summary_function=dprep.SummaryFunction.MEAN)])\n", - "dflow_mean = dflow_mean.filter(dprep.col('Arrest') == 'FALSE')\n", - "dflow_mean.head(1)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The `MEAN` value of _Latitude_ looks good. So we will impute _Latitude_ with it. As for `Longitude`, we will impute it using `42` based on external knowledge." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# impute with MEAN\n", - "impute_mean = dprep.ImputeColumnArguments(column_id='Latitude',\n", - " impute_function=dprep.ReplaceValueFunction.MEAN)\n", - "# impute with custom value 42\n", - "impute_custom = dprep.ImputeColumnArguments(column_id='Longitude',\n", - " custom_impute_value=42)\n", - "# get instance of ImputeMissingValuesBuilder\n", - "impute_builder = dflow.builders.impute_missing_values(impute_columns=[impute_mean, impute_custom],\n", - " group_by_columns=['Arrest'])\n", - "# call learn() to learn a fixed program to impute missing values\n", - "impute_builder.learn()\n", - "# call to_dataflow() to get a dataflow with impute step added\n", - "dflow_imputed = impute_builder.to_dataflow()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# check impute result\n", - "dflow_imputed.head(5)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "As the result above, the missing _Latitude_ has been imputed with the `MEAN` value of `Arrest=='false'` group, and the missing _Longitude_ has been imputed with `42`." - ] - } - ], - "metadata": { - "authors": [ - { - "name": "sihhu" - } - ], - "kernelspec": { - "display_name": "Python 3.6", - "language": "python", - "name": "python36" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.6.4" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} \ No newline at end of file