{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/work-with-data/dataprep/how-to-guides/derive-column-by-example.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Derive Column By Example\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One of the more advanced tools in Data Prep is the ability to derive columns by providing examples of desired results and letting Data Prep generate code to achieve the intended derivation." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import azureml.dataprep as dprep" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dflow = dprep.read_csv(path = '../data/crime-spring.csv')\n", "df = dflow.head(5)\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, this is a fairly simple file, but let's assume that we need to be able to join this with a dataset where date and time come in a format 'Apr 4, 2016 | 10PM-12AM'.\n", "\n", "Let's wrangle the data into the shape we need." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "builder = dflow.builders.derive_column_by_example(source_columns = ['Date'], new_column_name = 'date_timerange')\n", "builder.add_example(source_data = df.iloc[0], example_value = 'Apr 4, 2016 10PM-12AM')\n", "builder.preview() # will preview top 10 rows" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The code above first creates a builder for the derived column by providing an array of source columns to consider ('DATE') and name for the new column to be added.\n", "\n", "Then, we provide the first example by passing in the first row (index 0) of the DataFrame printed above and giving an expected value for the derived column.\n", "\n", "Finally, we call `builder.preview()` and observe the derived column next to the source column." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Everything looks good here. However, we just noticed that it's not quite what we wanted. We forgot to separate date and time range by '|' to generate the format we need.\n", "\n", "To fix that, we will add another example. This time, instead of passing in a row from the preview, we just construct a dictionary of column name to value for the source_data parameter." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "builder.add_example(source_data = {'Date': '4/15/2016 10:00'}, example_value = 'Apr 15, 2016 | 10AM-12PM')\n", "builder.preview()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This clearly had negative effects, as now the only rows that have any values in derived column are the ones that match exactly with the examples we have provided.\n", "\n", "Let's look at the examples:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "examples = builder.list_examples()\n", "examples" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we can see that we have provided inconsistent examples. To fix the issue, we need to replace the first example with a correct one (including '|' between date and time).\n", "\n", "We can achieve this by deleting examples that are incorrect (by either passing in example_row from examples DataFrame, or by just passing in example_id value) and then adding new modified examples back." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "builder.delete_example(example_id = -1)\n", "builder.add_example(examples.iloc[0], 'Apr 4, 2016 | 10PM-12AM')\n", "builder.preview()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now this looks correct and we can finally call to_dataflow() on the builder, which would return a dataflow with the desired derived columns added." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dflow = builder.to_dataflow()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = dflow.to_pandas_dataframe()\n", "df" ] } ], "metadata": { "authors": [ { "name": "sihhu" } ], "kernelspec": { "display_name": "Python 3.6", "language": "python", "name": "python36" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.8" }, "notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License." }, "nbformat": 4, "nbformat_minor": 2 }