MachineLearningNotebooks/tutorials/regression-part1-data-prep.ipynb

{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Copyright (c) Microsoft Corporation. All rights reserved.\n",
        "\n",
        "Licensed under the MIT License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Tutorial: Prepare data for regression modeling"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "In this tutorial, you learn how to prepare data for regression modeling by using the Azure Machine Learning Data Prep SDK. You run various transformations to filter and combine two different NYC taxi data sets.\n",
        "\n",
        "This tutorial is **part one of a two-part tutorial series**. After you complete the tutorial series, you can predict the cost of a taxi trip by training a model on data features. These features include the pickup day and time, the number of passengers, and the pickup location.\n",
        "\n",
        "In this tutorial, you:\n",
        "\n",
        "\n",
        "> * Setup a Python environment and import packages\n",
        "> * Load two datasets with different field names\n",
        "> * Cleanse data to remove anomalies\n",
        "> * Transform data using intelligent transforms to create new features\n",
        "> * Save your dataflow object to use in a regression model"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Prerequisites\n",
        "\n",
        "To run the notebook you will need:\n",
        "\n",
        "* A Python 3.6 notebook server with the following installed:\n",
        "    * The Azure Machine Learning Data Prep SDK for Python\n",
        "* The tutorial notebook\n",
        "\n",
        "Navigate back to the [tutorial page](https://docs.microsoft.com/azure/machine-learning/service/tutorial-data-prep) for specific environment setup instructions.\n",
        "\n",
        "## <a name=\"start\"></a>Set up your development environment\n",
        "\n",
        "All the setup for your development work can be accomplished in a Python notebook. Setup includes the following actions:\n",
        "\n",
        "* Install the SDK\n",
        "* Import Python packages\n",
        "\n",
        "### Install and import packages\n",
        "\n",
        "Use the following to install necessary packages if you don't already have them.\n",
        "\n",
        "```shell\n",
        "pip install \"azureml-dataprep[pandas]>=1.1.2,<1.2.0\"\n",
        "```\n",
        "\n",
        "Import the SDK."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import azureml.dataprep as dprep"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Load data\n",
        "Download two different NYC Taxi data sets into dataflow objects. These datasets contain slightly different fields. The method `auto_read_file()` automatically recognizes the input file type."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from IPython.display import display\n",
        "dataset_root = \"https://dprepdata.blob.core.windows.net/demo\"\n",
        "\n",
        "green_path = \"/\".join([dataset_root, \"green-small/*\"])\n",
        "yellow_path = \"/\".join([dataset_root, \"yellow-small/*\"])\n",
        "\n",
        "green_df_raw = dprep.read_csv(path=green_path, header=dprep.PromoteHeadersMode.GROUPED)\n",
        "# auto_read_file automatically identifies and parses the file type, which is useful when you don't know the file type.\n",
        "yellow_df_raw = dprep.auto_read_file(path=yellow_path)\n",
        "\n",
        "display(green_df_raw.head(5))\n",
        "display(yellow_df_raw.head(5))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "A `Dataflow` object is similar to a dataframe, and represents a series of lazily-evaluated, immutable operations on data. Operations can be added by invoking the different transformation and filtering methods available. The result of adding an operation to a `Dataflow` is always a new `Dataflow` object."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Cleanse data"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now you populate some variables with shortcut transforms to apply to all dataflows. The `drop_if_all_null` variable is used to delete records where all fields are null. The `useful_columns` variable holds an array of column descriptions that are kept in each dataflow."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "all_columns = dprep.ColumnSelector(term=\".*\", use_regex=True)\n",
        "drop_if_all_null = [all_columns, dprep.ColumnRelationship(dprep.ColumnRelationship.ALL)]\n",
        "useful_columns = [\n",
        "    \"cost\", \"distance\", \"dropoff_datetime\", \"dropoff_latitude\", \"dropoff_longitude\",\n",
        "    \"passengers\", \"pickup_datetime\", \"pickup_latitude\", \"pickup_longitude\", \"store_forward\", \"vendor\"\n",
        "]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "You first work with the green taxi data to get it into a valid shape that can be combined with the yellow taxi data. Call the `replace_na()`, `drop_nulls()`, and `keep_columns()` functions by using the shortcut transform variables you created. Additionally, rename all the columns in the dataframe to match the names in the `useful_columns` variable."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "green_df = (green_df_raw\n",
        "    .replace_na(columns=all_columns)\n",
        "    .drop_nulls(*drop_if_all_null)\n",
        "    .rename_columns(column_pairs={\n",
        "        \"VendorID\": \"vendor\",\n",
        "        \"lpep_pickup_datetime\": \"pickup_datetime\",\n",
        "        \"Lpep_dropoff_datetime\": \"dropoff_datetime\",\n",
        "        \"lpep_dropoff_datetime\": \"dropoff_datetime\",\n",
        "        \"Store_and_fwd_flag\": \"store_forward\",\n",
        "        \"store_and_fwd_flag\": \"store_forward\",\n",
        "        \"Pickup_longitude\": \"pickup_longitude\",\n",
        "        \"Pickup_latitude\": \"pickup_latitude\",\n",
        "        \"Dropoff_longitude\": \"dropoff_longitude\",\n",
        "        \"Dropoff_latitude\": \"dropoff_latitude\",\n",
        "        \"Passenger_count\": \"passengers\",\n",
        "        \"Fare_amount\": \"cost\",\n",
        "        \"Trip_distance\": \"distance\"\n",
        "     })\n",
        "    .keep_columns(columns=useful_columns))\n",
        "green_df.head(5)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Run the same transformation steps on the yellow taxi data. These functions ensure that null data is removed from the data set, which will help increase machine learning model accuracy."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "yellow_df = (yellow_df_raw\n",
        "    .replace_na(columns=all_columns)\n",
        "    .drop_nulls(*drop_if_all_null)\n",
        "    .rename_columns(column_pairs={\n",
        "        \"vendor_name\": \"vendor\",\n",
        "        \"VendorID\": \"vendor\",\n",
        "        \"vendor_id\": \"vendor\",\n",
        "        \"Trip_Pickup_DateTime\": \"pickup_datetime\",\n",
        "        \"tpep_pickup_datetime\": \"pickup_datetime\",\n",
        "        \"Trip_Dropoff_DateTime\": \"dropoff_datetime\",\n",
        "        \"tpep_dropoff_datetime\": \"dropoff_datetime\",\n",
        "        \"store_and_forward\": \"store_forward\",\n",
        "        \"store_and_fwd_flag\": \"store_forward\",\n",
        "        \"Start_Lon\": \"pickup_longitude\",\n",
        "        \"Start_Lat\": \"pickup_latitude\",\n",
        "        \"End_Lon\": \"dropoff_longitude\",\n",
        "        \"End_Lat\": \"dropoff_latitude\",\n",
        "        \"Passenger_Count\": \"passengers\",\n",
        "        \"passenger_count\": \"passengers\",\n",
        "        \"Fare_Amt\": \"cost\",\n",
        "        \"fare_amount\": \"cost\",\n",
        "        \"Trip_Distance\": \"distance\",\n",
        "        \"trip_distance\": \"distance\"\n",
        "    })\n",
        "    .keep_columns(columns=useful_columns))\n",
        "yellow_df.head(5)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Call the `append_rows()` function on the green taxi data to append the yellow taxi data. A new combined dataframe is created."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "combined_df = green_df.append_rows([yellow_df])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Convert types and filter "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Examine the pickup and drop-off coordinates summary statistics to see how the data is distributed. First, define a `TypeConverter` object to change the latitude and longitude fields to decimal type. Next, call the `keep_columns()` function to restrict output to only the latitude and longitude fields, and then call the `get_profile()` function. These function calls create a condensed view of the dataflow to just show the lat/long fields, which makes it easier to evaluate missing or out-of-scope coordinates."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "decimal_type = dprep.TypeConverter(data_type=dprep.FieldType.DECIMAL)\n",
        "combined_df = combined_df.set_column_types(type_conversions={\n",
        "    \"pickup_longitude\": decimal_type,\n",
        "    \"pickup_latitude\": decimal_type,\n",
        "    \"dropoff_longitude\": decimal_type,\n",
        "    \"dropoff_latitude\": decimal_type\n",
        "})\n",
        "combined_df.keep_columns(columns=[\n",
        "    \"pickup_longitude\", \"pickup_latitude\",\n",
        "    \"dropoff_longitude\", \"dropoff_latitude\"\n",
        "]).get_profile()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "From the summary statistics output, you see there are missing coordinates and coordinates that aren't in New York City (this is determined from subjective analysis). Filter out coordinates for locations that are outside the city border. Chain the column filter commands within the `filter()` function and define the minimum and maximum bounds for each field. Then call the `get_profile()` function again to verify the transformation."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "latlong_filtered_df = (combined_df\n",
        "    .drop_nulls(\n",
        "        columns=[\"pickup_longitude\", \"pickup_latitude\", \"dropoff_longitude\", \"dropoff_latitude\"],\n",
        "        column_relationship=dprep.ColumnRelationship(dprep.ColumnRelationship.ANY)\n",
        "    )\n",
        "    .filter(dprep.f_and(\n",
        "        dprep.col(\"pickup_longitude\") <= -73.72,\n",
        "        dprep.col(\"pickup_longitude\") >= -74.09,\n",
        "        dprep.col(\"pickup_latitude\") <= 40.88,\n",
        "        dprep.col(\"pickup_latitude\") >= 40.53,\n",
        "        dprep.col(\"dropoff_longitude\") <= -73.72,\n",
        "        dprep.col(\"dropoff_longitude\") >= -74.09,\n",
        "        dprep.col(\"dropoff_latitude\") <= 40.88,\n",
        "        dprep.col(\"dropoff_latitude\") >= 40.53\n",
        "    )))\n",
        "latlong_filtered_df.keep_columns(columns=[\n",
        "    \"pickup_longitude\", \"pickup_latitude\",\n",
        "    \"dropoff_longitude\", \"dropoff_latitude\"\n",
        "]).get_profile()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Split and rename columns"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Look at the data profile for the `store_forward` column. This field is a boolean flag that is `Y` when the taxi did not have a connection to the server after the trip, and thus had to store the trip data in memory, and later forward it to the server when connected."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "latlong_filtered_df.keep_columns(columns='store_forward').get_profile()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Notice that the data profile output in the `store_forward` column shows that the data is inconsistent and there are missing or null values. Use the `replace()` and `fill_nulls()` functions to replace these values with the string \"N\":"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "replaced_stfor_vals_df = latlong_filtered_df.replace(columns=\"store_forward\", find=\"0\", replace_with=\"N\").fill_nulls(\"store_forward\", \"N\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Execute the `replace` function on the `distance` field. The function reformats distance values that are incorrectly labeled as `.00`, and fills any nulls with zeros. Convert the `distance` field to numerical format. These incorrect data points are likely anomolies in the data collection system on the taxi cabs."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "replaced_distance_vals_df = replaced_stfor_vals_df.replace(columns=\"distance\", find=\".00\", replace_with=0).fill_nulls(\"distance\", 0)\n",
        "replaced_distance_vals_df = replaced_distance_vals_df.to_number([\"distance\"])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Split the pickup and dropoff datetime values into the respective date and time columns. Use the `split_column_by_example()` function to make the split. In this case, the optional `example` parameter of the `split_column_by_example()` function is omitted. Therefore, the function automatically determines where to split based on the data."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "time_split_df = (replaced_distance_vals_df\n",
        "    .split_column_by_example(source_column=\"pickup_datetime\")\n",
        "    .split_column_by_example(source_column=\"dropoff_datetime\"))\n",
        "time_split_df.head(5)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Rename the columns generated by `split_column_by_example()` into meaningful names."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "renamed_col_df = (time_split_df\n",
        "    .rename_columns(column_pairs={\n",
        "        \"pickup_datetime_1\": \"pickup_date\",\n",
        "        \"pickup_datetime_2\": \"pickup_time\",\n",
        "        \"dropoff_datetime_1\": \"dropoff_date\",\n",
        "        \"dropoff_datetime_2\": \"dropoff_time\"\n",
        "    }))\n",
        "renamed_col_df.head(5)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Call the `get_profile()` function to see the full summary statistics after all cleansing steps."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "renamed_col_df.get_profile()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Transform data"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Split the pickup and dropoff date further into the day of the week, day of the month, and month values. To get the day of the week value, use the `derive_column_by_example()` function. The function takes an array parameter of example objects that define the input data, and the preferred output. The function automatically determines your preferred transformation. For the pickup and dropoff time columns, split the time into the hour, minute, and second by using the `split_column_by_example()` function with no example parameter.\n",
        "\n",
        "After you generate the new features, use the `drop_columns()` function to delete the original fields as the newly generated features are preferred. Rename the rest of the fields to use meaningful descriptions.\n",
        "\n",
        "Transforming the data in this way to create new time-based features will improve machine learning model accuracy. For example, generating a new feature for the weekday will help establish a relationship between the day of the week and the taxi fare price, which is often more expensive on certain days of the week due to high demand."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "transformed_features_df = (renamed_col_df\n",
        "    .derive_column_by_example(\n",
        "        source_columns=\"pickup_date\",\n",
        "        new_column_name=\"pickup_weekday\",\n",
        "        example_data=[(\"2009-01-04\", \"Sunday\"), (\"2013-08-22\", \"Thursday\")]\n",
        "    )\n",
        "    .derive_column_by_example(\n",
        "        source_columns=\"dropoff_date\",\n",
        "        new_column_name=\"dropoff_weekday\",\n",
        "        example_data=[(\"2013-08-22\", \"Thursday\"), (\"2013-11-03\", \"Sunday\")]\n",
        "    )\n",
        "\n",
        "    .split_column_by_example(source_column=\"pickup_time\")\n",
        "    .split_column_by_example(source_column=\"dropoff_time\")\n",
        "    # The following two calls to split_column_by_example reference the column names generated from the previous two calls.\n",
        "    .split_column_by_example(source_column=\"pickup_time_1\")\n",
        "    .split_column_by_example(source_column=\"dropoff_time_1\")\n",
        "    .drop_columns(columns=[\n",
        "        \"pickup_date\", \"pickup_time\", \"dropoff_date\", \"dropoff_time\",\n",
        "        \"pickup_date_1\", \"dropoff_date_1\", \"pickup_time_1\", \"dropoff_time_1\"\n",
        "    ])\n",
        "\n",
        "    .rename_columns(column_pairs={\n",
        "        \"pickup_date_2\": \"pickup_month\",\n",
        "        \"pickup_date_3\": \"pickup_monthday\",\n",
        "        \"pickup_time_1_1\": \"pickup_hour\",\n",
        "        \"pickup_time_1_2\": \"pickup_minute\",\n",
        "        \"pickup_time_2\": \"pickup_second\",\n",
        "        \"dropoff_date_2\": \"dropoff_month\",\n",
        "        \"dropoff_date_3\": \"dropoff_monthday\",\n",
        "        \"dropoff_time_1_1\": \"dropoff_hour\",\n",
        "        \"dropoff_time_1_2\": \"dropoff_minute\",\n",
        "        \"dropoff_time_2\": \"dropoff_second\"\n",
        "    }))\n",
        "\n",
        "transformed_features_df.head(5)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Notice that the data shows that the pickup and dropoff date and time components produced from the derived transformations are correct. Drop the `pickup_datetime` and `dropoff_datetime` columns because they're no longer needed (granular time features like hour, minute and second are more useful for model training)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "processed_df = transformed_features_df.drop_columns(columns=[\"pickup_datetime\", \"dropoff_datetime\"])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Use the type inference functionality to automatically check the data type of each field, and display the inference results."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "type_infer = processed_df.builders.set_column_types()\n",
        "type_infer.learn()\n",
        "type_infer"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The inference results look correct based on the data. Now apply the type conversions to the dataflow."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "type_converted_df = type_infer.to_dataflow()\n",
        "type_converted_df.get_profile()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Before you package the dataflow, run two final filters on the data set. To eliminate incorrectly captured data points, filter the dataflow on records where both the `cost` and `distance` variable values are greater than zero. This step will significantly improve machine learning model accuracy, because data points with a zero cost or distance represent major outliers that throw off prediction accuracy."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "final_df = type_converted_df.filter(dprep.col(\"distance\") > 0)\n",
        "final_df = final_df.filter(dprep.col(\"cost\") > 0)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "You now have a fully transformed and prepared dataflow object to use in a machine learning model. The SDK includes object serialization functionality, which is used as shown in the following code."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import os\n",
        "file_path = os.path.join(os.getcwd(), \"dflows.dprep\")\n",
        "\n",
        "final_df.save(file_path)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Clean up resources"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "To continue with part two of the tutorial, you need the **dflows.dprep** file in the current directory.\n",
        "\n",
        "If you don't plan to continue to part two, delete the **dflows.dprep** file in your current directory. Delete this file whether you're running the execution locally or in [Azure Notebooks](https://notebooks.azure.com/)."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Next steps"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "In this Azure Machine Learning Data Prep SDK tutorial, you:\n",
        "\n",
        "> * Set up your development environment\n",
        "> * Loaded and cleansed data sets\n",
        "> * Used smart transforms to predict your logic based on an example\n",
        "> * Merged and packaged datasets for machine learning training\n",
        "\n",
        "You are ready to use this training data in the next part of the tutorial series:\n",
        "\n",
        "\n",
        "> [Tutorial #2: Train regression model](regression-part2-automated-ml.ipynb)"
      ]
    }
  ],
  "metadata": {
    "authors": [
      {
        "name": "cforbe"
      }
    ],
    "kernelspec": {
      "display_name": "Python 3.6",
      "language": "python",
      "name": "python36"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.7"
    },
    "msauthor": "trbye"
  },
  "nbformat": 4,
  "nbformat_minor": 2
}