diff --git a/work-with-data/dataprep/how-to-guides/column-manipulations.ipynb b/work-with-data/dataprep/how-to-guides/column-manipulations.ipynb deleted file mode 100644 index bf1836f9..00000000 --- a/work-with-data/dataprep/how-to-guides/column-manipulations.ipynb +++ /dev/null @@ -1,563 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/work-with-data/dataprep/how-to-guides/column-manipulations.png)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Column Manipulations\n", - "\n", - "Azure ML Data Prep has many methods for manipulating columns, including basic CUD operations and several other more complex manipulations.\n", - "\n", - "This notebook will focus primarily on data-agnostic operations. For all other column manipulation operations, we will link to their specific how-to guide.\n", - "\n", - "## Table of Contents\n", - "[ColumnSelector](#ColumnSelector)
\n", - "[add_column](#add_column)
\n", - "[append_columns](#append_columns)
\n", - "[drop_columns](#drop_columns)
\n", - "[duplicate_column](#duplicate_column)
\n", - "[fuzzy_group_column](#fuzzy_group_column)
\n", - "[keep_columns](#keep_columns)
\n", - "[map_column](#map_column)
\n", - "[new_script_column](#new_script_column)
\n", - "[rename_columns](#rename_columns)
\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## ColumnSelector\n", - "`ColumnSelector` is a Data Prep class that allows us to select columns by name. The idea is to be able to describe columns generally instead of explicitly, using a search term or regex expression, with various options.\n", - "\n", - "Note that a `ColumnSelector` does not represent the columns they match themselves, but the selector of the described columns. Therefore if we use the same `ColumnSelector` on two different dataflows, we may get different results depending on the columns of each dataflow.\n", - "\n", - "Column manipulations that can utilize `ColumnSelector` will be noted in their respective sections in this book." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from azureml.dataprep import auto_read_file\n", - "dflow = auto_read_file(path='../data/crime-dirty.csv')\n", - "dflow.head(5)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "All parameters to a `ColumnSelector` are shown here for completeness. We will use `keep_columns` in our example, which will keep only the columns in the dataflow that we tell it to keep.\n", - "\n", - "In the below example, we match all columns with the letter 'i'. Because we set `ignore_case` to false and `match_whole_word` to false, then any column that contains 'i' or 'I' will be selected." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from azureml.dataprep import ColumnSelector\n", - "column_selector = ColumnSelector(term=\"i\",\n", - " use_regex=False,\n", - " ignore_case=True,\n", - " match_whole_word=False,\n", - " invert=False)\n", - "dflow_selected = dflow.keep_columns(column_selector)\n", - "dflow_selected.head(5)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "If we set `invert` to true, we get the opposite of what we matched earlier." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "column_selector = ColumnSelector(term=\"i\",\n", - " use_regex=False,\n", - " ignore_case=True,\n", - " match_whole_word=False,\n", - " invert=True)\n", - "dflow_selected = dflow.keep_columns(column_selector)\n", - "dflow_selected.head(5)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "If we change the search term to 'I' and set case sensitivity to true, we get only the handful of columns that contain an upper case 'I'." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "column_selector = ColumnSelector(term=\"I\",\n", - " use_regex=False,\n", - " ignore_case=False,\n", - " match_whole_word=False,\n", - " invert=False)\n", - "dflow_selected = dflow.keep_columns(column_selector)\n", - "dflow_selected.head(5)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "And if we set `match_whole_word` to true, we get no results at all as there is no column called 'I'." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "column_selector = ColumnSelector(term=\"I\",\n", - " use_regex=False,\n", - " ignore_case=False,\n", - " match_whole_word=True,\n", - " invert=False)\n", - "dflow_selected = dflow.keep_columns(column_selector)\n", - "dflow_selected.head(5)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Finally, the `use_regex` flag dictates whether or not to treat the search term as a regex. It can be combined still with the other options.\n", - "\n", - "Here we define all columns that begin with the capital letter 'I'." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "column_selector = ColumnSelector(term=\"I.*\",\n", - " use_regex=True,\n", - " ignore_case=True,\n", - " match_whole_word=True,\n", - " invert=False)\n", - "dflow_selected = dflow.keep_columns(column_selector)\n", - "dflow_selected.head(5)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## add_column" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Please see [add-column-using-expression](add-column-using-expression.ipynb)." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## append_columns" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Please see [append-columns-and-rows](append-columns-and-rows.ipynb)." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## drop_columns" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Data Prep supports dropping columns one or more columns in a single statement. Supports `ColumnSelector`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from azureml.dataprep import auto_read_file\n", - "dflow = auto_read_file(path='../data/crime-dirty.csv')\n", - "dflow.head(5)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Note that there are 22 columns to begin with. We will now drop the 'ID' column and observe that the resulting dataflow contains 21 columns." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "dflow_dropped = dflow.drop_columns('ID')\n", - "dflow_dropped.head(5)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We can also drop more than one column at once by passing a list of column names." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "dflow_dropped = dflow_dropped.drop_columns(['IUCR', 'Description'])\n", - "dflow_dropped.head(5)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## duplicate_column" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Data Prep supports duplicating columns one or more columns in a single statement.\n", - "\n", - "Duplicated columns are placed to the immediate right of their source column." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from azureml.dataprep import auto_read_file\n", - "dflow = auto_read_file(path='../data/crime-dirty.csv')\n", - "dflow.head(5)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We decide which column(s) to duplicate and what the new column name(s) should be with a key value pairing (dictionary)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "dflow_dupe = dflow.duplicate_column({'ID': 'ID2', 'IUCR': 'IUCR_Clone'})\n", - "dflow_dupe.head(5)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## fuzzy_group_column" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Please see [fuzzy-group](fuzzy-group.ipynb)." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## keep_columns" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Data Prep supports keeping one or more columns in a single statement. The resulting dataflow will contain only the column(s) specified; dropping all the other columns. Supports `ColumnSelector`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from azureml.dataprep import auto_read_file\n", - "dflow = auto_read_file(path='../data/crime-dirty.csv')\n", - "dflow.head(5)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "dflow_keep = dflow.keep_columns(['ID', 'Date', 'Description'])\n", - "dflow_keep.head(5)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Similar to `drop_columns`, we can pass a single column name or a list of them." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "dflow_keep = dflow_keep.keep_columns('ID')\n", - "dflow_keep.head(5)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## map_column" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Data Prep supports string mapping. For a column containing strings, we can provide specific mappings from an original value to a new value, and then produce a new column that contains the mapped values.\n", - "\n", - "The mapped columns are placed to the immediate right of their source column." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from azureml.dataprep import auto_read_file\n", - "dflow = auto_read_file(path='../data/crime-dirty.csv')\n", - "dflow.head(5)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from azureml.dataprep import ReplacementsValue\n", - "replacements = [ReplacementsValue('THEFT', 'THEFT2'), ReplacementsValue('BATTERY', 'BATTERY!!!')]\n", - "dflow_mapped = dflow.map_column(column='Primary Type', \n", - " new_column_id='Primary Type V2',\n", - " replacements=replacements)\n", - "dflow_mapped.head(5)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## new_script_column" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Please see [custom-python-transforms](custom-python-transforms.ipynb)." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## rename_columns" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Data Prep supports renaming one or more columns in a single statement." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from azureml.dataprep import auto_read_file\n", - "dflow = auto_read_file(path='../data/crime-dirty.csv')\n", - "dflow.head(5)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We decide which column(s) to rename and what the new column name(s) should be with a key value pairing (dictionary)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "dflow_renamed = dflow.rename_columns({'ID': 'ID2', 'IUCR': 'IUCR_Clone'})\n", - "dflow_renamed.head(5)" - ] - } - ], - "metadata": { - "authors": [ - { - "name": "sihhu" - } - ], - "kernelspec": { - "display_name": "Python 3.6", - "language": "python", - "name": "python36" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.6.4" - }, - "notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License." - }, - "nbformat": 4, - "nbformat_minor": 2 -} \ No newline at end of file