MachineLearningNotebooks/how-to-use-azureml/work-with-data/dataprep/how-to-guides/column-manipulations.ipynb

{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/work-with-data/dataprep/how-to-guides/column-manipulations.png)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Column Manipulations\n",
        "\n",
        "Azure ML Data Prep has many methods for manipulating columns, including basic CUD operations and several other more complex manipulations.\n",
        "\n",
        "This notebook will focus primarily on data-agnostic operations. For all other column manipulation operations, we will link to their specific how-to guide.\n",
        "\n",
        "## Table of Contents\n",
        "[ColumnSelector](#ColumnSelector)<br>\n",
        "[add_column](#add_column)<br>\n",
        "[append_columns](#append_columns)<br>\n",
        "[drop_columns](#drop_columns)<br>\n",
        "[duplicate_column](#duplicate_column)<br>\n",
        "[fuzzy_group_column](#fuzzy_group_column)<br>\n",
        "[keep_columns](#keep_columns)<br>\n",
        "[map_column](#map_column)<br>\n",
        "[new_script_column](#new_script_column)<br>\n",
        "[rename_columns](#rename_columns)<br>\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<a id=\"ColumnSelector\"></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## ColumnSelector\n",
        "`ColumnSelector` is a Data Prep class that allows us to select columns by name. The idea is to be able to describe columns generally instead of explicitly, using a search term or regex expression, with various options.\n",
        "\n",
        "Note that a `ColumnSelector` does not represent the columns they match themselves, but the selector of the described columns. Therefore if we use the same `ColumnSelector` on two different dataflows, we may get different results depending on the columns of each dataflow.\n",
        "\n",
        "Column manipulations that can utilize `ColumnSelector` will be noted in their respective sections in this book."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.dataprep import auto_read_file\n",
        "dflow = auto_read_file(path='../data/crime-dirty.csv')\n",
        "dflow.head(5)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "All parameters to a `ColumnSelector` are shown here for completeness. We will use `keep_columns` in our example, which will keep only the columns in the dataflow that we tell it to keep.\n",
        "\n",
        "In the below example, we match all columns with the letter 'i'. Because we set `ignore_case` to false and `match_whole_word` to false, then any column that contains 'i' or 'I' will be selected."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.dataprep import ColumnSelector\n",
        "column_selector = ColumnSelector(term=\"i\",\n",
        "                                 use_regex=False,\n",
        "                                 ignore_case=True,\n",
        "                                 match_whole_word=False,\n",
        "                                 invert=False)\n",
        "dflow_selected = dflow.keep_columns(column_selector)\n",
        "dflow_selected.head(5)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "If we set `invert` to true, we get the opposite of what we matched earlier."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "column_selector = ColumnSelector(term=\"i\",\n",
        "                                 use_regex=False,\n",
        "                                 ignore_case=True,\n",
        "                                 match_whole_word=False,\n",
        "                                 invert=True)\n",
        "dflow_selected = dflow.keep_columns(column_selector)\n",
        "dflow_selected.head(5)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "If we change the search term to 'I' and set case sensitivity to true, we get only the handful of columns that contain an upper case 'I'."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "column_selector = ColumnSelector(term=\"I\",\n",
        "                                 use_regex=False,\n",
        "                                 ignore_case=False,\n",
        "                                 match_whole_word=False,\n",
        "                                 invert=False)\n",
        "dflow_selected = dflow.keep_columns(column_selector)\n",
        "dflow_selected.head(5)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "And if we set `match_whole_word` to true, we get no results at all as there is no column called 'I'."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "column_selector = ColumnSelector(term=\"I\",\n",
        "                                 use_regex=False,\n",
        "                                 ignore_case=False,\n",
        "                                 match_whole_word=True,\n",
        "                                 invert=False)\n",
        "dflow_selected = dflow.keep_columns(column_selector)\n",
        "dflow_selected.head(5)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Finally, the `use_regex` flag dictates whether or not to treat the search term as a regex. It can be combined still with the other options.\n",
        "\n",
        "Here we define all columns that begin with the capital letter 'I'."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "column_selector = ColumnSelector(term=\"I.*\",\n",
        "                                 use_regex=True,\n",
        "                                 ignore_case=True,\n",
        "                                 match_whole_word=True,\n",
        "                                 invert=False)\n",
        "dflow_selected = dflow.keep_columns(column_selector)\n",
        "dflow_selected.head(5)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<a id=\"add_column\"></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## add_column"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Please see [add-column-using-expression](add-column-using-expression.ipynb)."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<a id=\"append_columns\"></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## append_columns"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Please see [append-columns-and-rows](append-columns-and-rows.ipynb)."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<a id=\"drop_columns\"></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## drop_columns"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Data Prep supports dropping columns one or more columns in a single statement. Supports `ColumnSelector`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.dataprep import auto_read_file\n",
        "dflow = auto_read_file(path='../data/crime-dirty.csv')\n",
        "dflow.head(5)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Note that there are 22 columns to begin with. We will now drop the 'ID' column and observe that the resulting dataflow contains 21 columns."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "dflow_dropped = dflow.drop_columns('ID')\n",
        "dflow_dropped.head(5)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can also drop more than one column at once by passing a list of column names."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "dflow_dropped = dflow_dropped.drop_columns(['IUCR', 'Description'])\n",
        "dflow_dropped.head(5)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<a id=\"duplicate_column\"></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## duplicate_column"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Data Prep supports duplicating columns one or more columns in a single statement.\n",
        "\n",
        "Duplicated columns are placed to the immediate right of their source column."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.dataprep import auto_read_file\n",
        "dflow = auto_read_file(path='../data/crime-dirty.csv')\n",
        "dflow.head(5)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We decide which column(s) to duplicate and what the new column name(s) should be with a key value pairing (dictionary)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "dflow_dupe = dflow.duplicate_column({'ID': 'ID2', 'IUCR': 'IUCR_Clone'})\n",
        "dflow_dupe.head(5)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<a id=\"fuzzy_group_column\"></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## fuzzy_group_column"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Please see [fuzzy-group](fuzzy-group.ipynb)."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<a id=\"keep_columns\"></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## keep_columns"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Data Prep supports keeping one or more columns in a single statement. The resulting dataflow will contain only the column(s) specified; dropping all the other columns. Supports `ColumnSelector`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.dataprep import auto_read_file\n",
        "dflow = auto_read_file(path='../data/crime-dirty.csv')\n",
        "dflow.head(5)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "dflow_keep = dflow.keep_columns(['ID', 'Date', 'Description'])\n",
        "dflow_keep.head(5)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Similar to `drop_columns`, we can pass a single column name or a list of them."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "dflow_keep = dflow_keep.keep_columns('ID')\n",
        "dflow_keep.head(5)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<a id=\"map_column\"></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## map_column"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Data Prep supports string mapping. For a column containing strings, we can provide specific mappings from an original value to a new value, and then produce a new column that contains the mapped values.\n",
        "\n",
        "The mapped columns are placed to the immediate right of their source column."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.dataprep import auto_read_file\n",
        "dflow = auto_read_file(path='../data/crime-dirty.csv')\n",
        "dflow.head(5)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.dataprep import ReplacementsValue\n",
        "replacements = [ReplacementsValue('THEFT', 'THEFT2'), ReplacementsValue('BATTERY', 'BATTERY!!!')]\n",
        "dflow_mapped = dflow.map_column(column='Primary Type', \n",
        "                                new_column_id='Primary Type V2',\n",
        "                                replacements=replacements)\n",
        "dflow_mapped.head(5)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<a id=\"new_script_column\"></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## new_script_column"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Please see [custom-python-transforms](custom-python-transforms.ipynb)."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<a id=\"rename_columns\"></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## rename_columns"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Data Prep supports renaming one or more columns in a single statement."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.dataprep import auto_read_file\n",
        "dflow = auto_read_file(path='../data/crime-dirty.csv')\n",
        "dflow.head(5)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We decide which column(s) to rename and what the new column name(s) should be with a key value pairing (dictionary)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "dflow_renamed = dflow.rename_columns({'ID': 'ID2', 'IUCR': 'IUCR_Clone'})\n",
        "dflow_renamed.head(5)"
      ]
    }
  ],
  "metadata": {
    "authors": [
      {
        "name": "sihhu"
      }
    ],
    "kernelspec": {
      "display_name": "Python 3.6",
      "language": "python",
      "name": "python36"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.4"
    },
    "notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License."
  },
  "nbformat": 4,
  "nbformat_minor": 2
}