From 7d552effb0fa6d2b599b1dead76d7a1023f86c04 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Shan=C3=A9=20Winner?= <43390034+swinner95@users.noreply.github.com> Date: Wed, 21 Aug 2019 10:17:01 -0700 Subject: [PATCH] Delete split-column-by-example.ipynb --- .../split-column-by-example.ipynb | 220 ------------------ 1 file changed, 220 deletions(-) delete mode 100644 work-with-data/dataprep/how-to-guides/split-column-by-example.ipynb diff --git a/work-with-data/dataprep/how-to-guides/split-column-by-example.ipynb b/work-with-data/dataprep/how-to-guides/split-column-by-example.ipynb deleted file mode 100644 index 02c74746..00000000 --- a/work-with-data/dataprep/how-to-guides/split-column-by-example.ipynb +++ /dev/null @@ -1,220 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/work-with-data/dataprep/how-to-guides/split-column-by-example.png)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Split column by example\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "DataPrep also offers you a way to easily split a column into multiple columns.\n", - "The SplitColumnByExampleBuilder class lets you generate a proper split program that will work even when the cases are not trivial, like in example below." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import azureml.dataprep as dprep" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "dflow = dprep.read_lines(path='../data/crime.txt')\n", - "df = dflow.head(10)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "df['Line'].iloc[0]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "As you can see above, you can't split this particular file by space character as it will create too many columns.\n", - "That's where split_column_by_example could be quite useful." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "builder = dflow.builders.split_column_by_example('Line', keep_delimiters=True)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "builder.preview()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Couple things to take note of here. No examples were given, and yet DataPrep was able to generate quite reasonable split program. \n", - "We have passed keep_delimiters=True so we can see all the data split into columns. In practice, though, delimiters are rarely useful, so let's exclude them." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "builder.keep_delimiters = False\n", - "builder.preview()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "This looks pretty good already, except that one case number is split into 2 columns. Taking the first row as an example, we want to keep case number as \"HY329907\" instead of \"HY\" and \"329907\" seperately. \n", - "If we request generation of suggested examples we will get a list of examples that require input." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "suggestions = builder.generate_suggested_examples()\n", - "suggestions" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "suggestions.iloc[0]['Line']" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Having retrieved source value we can now provide an example of desired split.\n", - "Notice that we chose not to split date and time but rather keep them together in one column." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "builder.add_example(example=(suggestions['Line'].iloc[0], ['10140490','HY329907','7/5/2015 23:50','050XX N NEWLAND AVE','820','THEFT']))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "builder.preview()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "As we can see from the preview, some of the crime types (`Line_6`) do not show up as expected. Let's try to add one more example. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "builder.add_example(example=(df['Line'].iloc[1],['10139776','HY329265','7/5/2015 23:30','011XX W MORSE AVE','460','BATTERY']))\n", - "builder.preview()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "This looks just like what we need. Let's get a dataflow with splited columns and drop original column." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "dflow = builder.to_dataflow()\n", - "dflow = dflow.drop_columns(['Line'])\n", - "dflow.head(5)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now we have successfully split the data into useful columns through examples." - ] - } - ], - "metadata": { - "authors": [ - { - "name": "sihhu" - } - ], - "kernelspec": { - "display_name": "Python 3.6", - "language": "python", - "name": "python36" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.6.8" - }, - "notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License." - }, - "nbformat": 4, - "nbformat_minor": 2 -} \ No newline at end of file