mirror of
https://github.com/Azure/MachineLearningNotebooks.git
synced 2025-12-23 11:02:39 -05:00
563 lines
16 KiB
Plaintext
563 lines
16 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Column Manipulations\n",
|
|
"\n",
|
|
"Azure ML Data Prep has many methods for manipulating columns, including basic CUD operations and several other more complex manipulations.\n",
|
|
"\n",
|
|
"This notebook will focus primarily on data-agnostic operations. For all other column manipulation operations, we will link to their specific how-to guide.\n",
|
|
"\n",
|
|
"## Table of Contents\n",
|
|
"[ColumnSelector](#ColumnSelector)<br>\n",
|
|
"[add_column](#add_column)<br>\n",
|
|
"[append_columns](#append_columns)<br>\n",
|
|
"[drop_columns](#drop_columns)<br>\n",
|
|
"[duplicate_column](#duplicate_column)<br>\n",
|
|
"[fuzzy_group_column](#fuzzy_group_column)<br>\n",
|
|
"[keep_columns](#keep_columns)<br>\n",
|
|
"[map_column](#map_column)<br>\n",
|
|
"[new_script_column](#new_script_column)<br>\n",
|
|
"[rename_columns](#rename_columns)<br>\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<a id=\"ColumnSelector\"></a>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## ColumnSelector\n",
|
|
"`ColumnSelector` is a Data Prep class that allows us to select columns by name. The idea is to be able to describe columns generally instead of explicitly, using a search term or regex expression, with various options.\n",
|
|
"\n",
|
|
"Note that a `ColumnSelector` does not represent the columns they match themselves, but the selector of the described columns. Therefore if we use the same `ColumnSelector` on two different dataflows, we may get different results depending on the columns of each dataflow.\n",
|
|
"\n",
|
|
"Column manipulations that can utilize `ColumnSelector` will be noted in their respective sections in this book."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.dataprep import auto_read_file\n",
|
|
"dflow = auto_read_file(path='../data/crime-dirty.csv')\n",
|
|
"dflow.head(5)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"All parameters to a `ColumnSelector` are shown here for completeness. We will use `keep_columns` in our example, which will keep only the columns in the dataflow that we tell it to keep.\n",
|
|
"\n",
|
|
"In the below example, we match all columns with the letter 'i'. Because we set `ignore_case` to false and `match_whole_word` to false, then any column that contains 'i' or 'I' will be selected."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.dataprep import ColumnSelector\n",
|
|
"column_selector = ColumnSelector(term=\"i\",\n",
|
|
" use_regex=False,\n",
|
|
" ignore_case=True,\n",
|
|
" match_whole_word=False,\n",
|
|
" invert=False)\n",
|
|
"dflow_selected = dflow.keep_columns(column_selector)\n",
|
|
"dflow_selected.head(5)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"If we set `invert` to true, we get the opposite of what we matched earlier."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"column_selector = ColumnSelector(term=\"i\",\n",
|
|
" use_regex=False,\n",
|
|
" ignore_case=True,\n",
|
|
" match_whole_word=False,\n",
|
|
" invert=True)\n",
|
|
"dflow_selected = dflow.keep_columns(column_selector)\n",
|
|
"dflow_selected.head(5)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"If we change the search term to 'I' and set case sensitivity to true, we get only the handful of columns that contain an upper case 'I'."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"column_selector = ColumnSelector(term=\"I\",\n",
|
|
" use_regex=False,\n",
|
|
" ignore_case=False,\n",
|
|
" match_whole_word=False,\n",
|
|
" invert=False)\n",
|
|
"dflow_selected = dflow.keep_columns(column_selector)\n",
|
|
"dflow_selected.head(5)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"And if we set `match_whole_word` to true, we get no results at all as there is no column called 'I'."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"column_selector = ColumnSelector(term=\"I\",\n",
|
|
" use_regex=False,\n",
|
|
" ignore_case=False,\n",
|
|
" match_whole_word=True,\n",
|
|
" invert=False)\n",
|
|
"dflow_selected = dflow.keep_columns(column_selector)\n",
|
|
"dflow_selected.head(5)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Finally, the `use_regex` flag dictates whether or not to treat the search term as a regex. It can be combined still with the other options.\n",
|
|
"\n",
|
|
"Here we define all columns that begin with the capital letter 'I'."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"column_selector = ColumnSelector(term=\"I.*\",\n",
|
|
" use_regex=True,\n",
|
|
" ignore_case=True,\n",
|
|
" match_whole_word=True,\n",
|
|
" invert=False)\n",
|
|
"dflow_selected = dflow.keep_columns(column_selector)\n",
|
|
"dflow_selected.head(5)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<a id=\"add_column\"></a>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## add_column"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Please see [add-column-using-expression](add-column-using-expression.ipynb)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<a id=\"append_columns\"></a>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## append_columns"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Please see [append-columns-and-rows](append-columns-and-rows.ipynb)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<a id=\"drop_columns\"></a>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## drop_columns"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Data Prep supports dropping columns one or more columns in a single statement. Supports `ColumnSelector`."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.dataprep import auto_read_file\n",
|
|
"dflow = auto_read_file(path='../data/crime-dirty.csv')\n",
|
|
"dflow.head(5)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Note that there are 22 columns to begin with. We will now drop the 'ID' column and observe that the resulting dataflow contains 21 columns."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"dflow_dropped = dflow.drop_columns('ID')\n",
|
|
"dflow_dropped.head(5)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"We can also drop more than one column at once by passing a list of column names."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"dflow_dropped = dflow_dropped.drop_columns(['IUCR', 'Description'])\n",
|
|
"dflow_dropped.head(5)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<a id=\"duplicate_column\"></a>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## duplicate_column"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Data Prep supports duplicating columns one or more columns in a single statement.\n",
|
|
"\n",
|
|
"Duplicated columns are placed to the immediate right of their source column."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.dataprep import auto_read_file\n",
|
|
"dflow = auto_read_file(path='../data/crime-dirty.csv')\n",
|
|
"dflow.head(5)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"We decide which column(s) to duplicate and what the new column name(s) should be with a key value pairing (dictionary)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"dflow_dupe = dflow.duplicate_column({'ID': 'ID2', 'IUCR': 'IUCR_Clone'})\n",
|
|
"dflow_dupe.head(5)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<a id=\"fuzzy_group_column\"></a>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## fuzzy_group_column"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Please see [fuzzy-group](fuzzy-group.ipynb)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<a id=\"keep_columns\"></a>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## keep_columns"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Data Prep supports keeping one or more columns in a single statement. The resulting dataflow will contain only the column(s) specified; dropping all the other columns. Supports `ColumnSelector`."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.dataprep import auto_read_file\n",
|
|
"dflow = auto_read_file(path='../data/crime-dirty.csv')\n",
|
|
"dflow.head(5)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"dflow_keep = dflow.keep_columns(['ID', 'Date', 'Description'])\n",
|
|
"dflow_keep.head(5)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Similar to `drop_columns`, we can pass a single column name or a list of them."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"dflow_keep = dflow_keep.keep_columns('ID')\n",
|
|
"dflow_keep.head(5)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<a id=\"map_column\"></a>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## map_column"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Data Prep supports string mapping. For a column containing strings, we can provide specific mappings from an original value to a new value, and then produce a new column that contains the mapped values.\n",
|
|
"\n",
|
|
"The mapped columns are placed to the immediate right of their source column."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.dataprep import auto_read_file\n",
|
|
"dflow = auto_read_file(path='../data/crime-dirty.csv')\n",
|
|
"dflow.head(5)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.dataprep import ReplacementsValue\n",
|
|
"replacements = [ReplacementsValue('THEFT', 'THEFT2'), ReplacementsValue('BATTERY', 'BATTERY!!!')]\n",
|
|
"dflow_mapped = dflow.map_column(column='Primary Type', \n",
|
|
" new_column_id='Primary Type V2',\n",
|
|
" replacements=replacements)\n",
|
|
"dflow_mapped.head(5)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<a id=\"new_script_column\"></a>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## new_script_column"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Please see [custom-python-transforms](custom-python-transforms.ipynb)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<a id=\"rename_columns\"></a>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## rename_columns"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Data Prep supports renaming one or more columns in a single statement."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.dataprep import auto_read_file\n",
|
|
"dflow = auto_read_file(path='../data/crime-dirty.csv')\n",
|
|
"dflow.head(5)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"We decide which column(s) to rename and what the new column name(s) should be with a key value pairing (dictionary)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"dflow_renamed = dflow.rename_columns({'ID': 'ID2', 'IUCR': 'IUCR_Clone'})\n",
|
|
"dflow_renamed.head(5)"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"authors": [
|
|
{
|
|
"name": "sihhu"
|
|
}
|
|
],
|
|
"kernelspec": {
|
|
"display_name": "Python 3.6",
|
|
"language": "python",
|
|
"name": "python36"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.6.4"
|
|
},
|
|
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License."
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
} |