MachineLearningNotebooks/work-with-data/dataprep/how-to-guides/split-column-by-example.ipynb

{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/work-with-data/dataprep/how-to-guides/split-column-by-example.png)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Split column by example\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "DataPrep also offers you a way to easily split a column into multiple columns.\n",
        "The SplitColumnByExampleBuilder class lets you generate a proper split program that will work even when the cases are not trivial, like in example below."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import azureml.dataprep as dprep"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "dflow = dprep.read_lines(path='../data/crime.txt')\n",
        "df = dflow.head(10)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "df['Line'].iloc[0]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "As you can see above, you can't split this particular file by space character as it will create too many columns.\n",
        "That's where split_column_by_example could be quite useful."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "builder = dflow.builders.split_column_by_example('Line', keep_delimiters=True)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "builder.preview()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Couple things to take note of here. No examples were given, and yet DataPrep was able to generate quite reasonable split program. \n",
        "We have passed keep_delimiters=True so we can see all the data split into columns. In practice, though, delimiters are rarely useful, so let's exclude them."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "builder.keep_delimiters = False\n",
        "builder.preview()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "This looks pretty good already, except that one case number is split into 2 columns. Taking the first row as an example, we want to keep case number as \"HY329907\" instead of \"HY\" and \"329907\" seperately.  \n",
        "If we request generation of suggested examples we will get a list of examples that require input."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "suggestions = builder.generate_suggested_examples()\n",
        "suggestions"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "suggestions.iloc[0]['Line']"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Having retrieved source value we can now provide an example of desired split.\n",
        "Notice that we chose not to split date and time but rather keep them together in one column."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "builder.add_example(example=(suggestions['Line'].iloc[0], ['10140490','HY329907','7/5/2015  23:50','050XX N NEWLAND AVE','820','THEFT']))"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "builder.preview()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "As we can see from the preview, some of the crime types (`Line_6`) do not show up as expected. Let's try to add one more example. "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "builder.add_example(example=(df['Line'].iloc[1],['10139776','HY329265','7/5/2015  23:30','011XX W MORSE AVE','460','BATTERY']))\n",
        "builder.preview()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "This looks just like what we need. Let's get a dataflow with splited columns and drop original column."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "dflow = builder.to_dataflow()\n",
        "dflow = dflow.drop_columns(['Line'])\n",
        "dflow.head(5)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now we have successfully split the data into useful columns through examples."
      ]
    }
  ],
  "metadata": {
    "authors": [
      {
        "name": "sihhu"
      }
    ],
    "kernelspec": {
      "display_name": "Python 3.6",
      "language": "python",
      "name": "python36"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.8"
    },
    "notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License."
  },
  "nbformat": 4,
  "nbformat_minor": 2
}