Files
MachineLearningNotebooks/how-to-use-azureml/work-with-data/dataprep/how-to-guides/min-max-scaler.ipynb
2019-06-26 14:39:09 -04:00

240 lines
7.2 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/work-with-data/dataprep/how-to-guides/min-max-scaler.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Min-Max Scaler\n",
"Copyright (c) Microsoft Corporation. All rights reserved.<br>\n",
"Licensed under the MIT License."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import azureml.dataprep as dprep"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The min-max scaler scales all values in a column to a desired range (typically [0, 1]). This is also known as feature scaling or unity-based normalization. Min-max scaling is commonly used to normalize numeric columns in a data set for machine learning algorithms."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, load a data set containing information about crime in Chicago. Keep only a few columns."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dflow = dprep.read_csv('../data/crime-spring.csv')\n",
"dflow = dflow.keep_columns(columns=['ID', 'District', 'FBI Code'])\n",
"dflow = dflow.to_number(columns=['District', 'FBI Code'])\n",
"dflow.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using `get_profile()`, you can see the shape of the numeric columns such as the minimum, maximum, count, and number of error values."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dflow.get_profile()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To apply min-max scaling, call the function `min_max_scaler` on the Dataflow and specify the column name. This will trigger a full data scan over the column to determine the min and max values and perform the scaling. Note that the min and max values of the column are preserved at this point. If the same dataflow steps are performed over a different dataset, the min-max scaler must be re-executed."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dflow_district = dflow.min_max_scale(column='District')\n",
"dflow_district.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Look at the data profile to see that the \"District\" column is now scaled; the min is 0 and the max is 1. Any error values and missing values from the source column are preserved."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dflow_district.get_profile()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can also specify a custom range for the scaling. Instead of [0, 1], let's choose [-10, 10]."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dflow_district_range = dflow.min_max_scale(column='District', range_min=-10, range_max=10)\n",
"dflow_district_range.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In some cases, you may want to manually provide the min and max of the data in the source column. For example, you may want to avoid a full data scan because the dataset is large and we already know the min and max. You can provide the known min and max to the `min_max_scaler` function. The column will be scaled using the provided values. For example, if you want to scale the `FBI Code` column with 6 (`data_min`) becoming 0 (`range_min`), the program will scan the data to get `data_max`, which will become 1 (`range_max`)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dflow_fbi = dflow.min_max_scale(column='FBI Code', data_min=6)\n",
"dflow_fbi.get_profile()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using a Min-Max Scaler builder"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For more flexibility when constructing the arguments for the min-max scaling, you can use a Min-Max Scaler builder."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"builder = dflow.builders.min_max_scale(column='District')\n",
"builder"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Calling `builder.learn()` will trigger a full data scan to see what `data_min` and `data_max` are. You can choose whether to use these values or set custom values."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"builder.learn()\n",
"builder"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you want to provide custom values for any of the arguments, you can update the builder object."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"builder.range_max = 10\n",
"builder.data_min = 6\n",
"builder"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When you are satisfied with the arguments, you will call `builder.to_dataflow()` to get the result. Note that the min and max values of the source column is preserved by the builder at this point. If you need to get the true `data_min` and `data_max` values again, you will need to set those arguments on the builder to `None` and then call `builder.learn()` again."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dflow_builder = builder.to_dataflow()\n",
"dflow_builder.head(5)"
]
}
],
"metadata": {
"authors": [
{
"name": "sihhu"
}
],
"kernelspec": {
"display_name": "Python 3.6",
"language": "python",
"name": "python36"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}