MachineLearningNotebooks/tutorials/regression-part1-data-prep.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copyright (c) Microsoft Corporation. All rights reserved.\n",
    "\n",
    "Licensed under the MIT License."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Tutorial #1: Prepare data for regression modeling"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this tutorial, you learn how to prep data for regression modeling using the Azure Machine Learning Data Prep SDK. Perform various transformations to filter and combine two different NYC Taxi data sets. The end goal of this tutorial set is to predict the cost of a taxi trip by training a model on data features including pickup hour, day of week, number of passengers, and coordinates. This tutorial is part one of a two-part tutorial series.\n",
    "\n",
    "In this tutorial, you:\n",
    "\n",
    "\n",
    "> * Setup a Python environment and import packages\n",
    "> * Load two datasets with different field names\n",
    "> * Cleanse data to remove anomalies\n",
    "> * Transform data using intelligent transforms to create new features\n",
    "> * Save your dataflow object to use in a regression model\n",
    "\n",
    "You can prepare your data in Python using the [Azure Machine Learning Data Prep SDK](https://aka.ms/data-prep-sdk)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Import packages\n",
    "Begin by importing the SDK."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import azureml.dataprep as dprep"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load data\n",
    "Download two different NYC Taxi data sets into dataflow objects. These datasets contain slightly different fields. The method `auto_read_file()` automatically recognizes the input file type."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset_root = \"https://dprepdata.blob.core.windows.net/demo\"\n",
    "\n",
    "green_path = \"/\".join([dataset_root, \"green-small/*\"])\n",
    "yellow_path = \"/\".join([dataset_root, \"yellow-small/*\"])\n",
    "\n",
    "green_df = dprep.read_csv(path=green_path, header=dprep.PromoteHeadersMode.GROUPED)\n",
    "# auto_read_file will automatically identify and parse the file type, and is useful if you don't know the file type\n",
    "yellow_df = dprep.auto_read_file(path=yellow_path)\n",
    "\n",
    "green_df.head(5)\n",
    "yellow_df.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Cleanse data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now you populate some variables with shortcut transforms that will apply to all dataflows. The variable `drop_if_all_null` will be used to delete records where all fields are null. The variable `useful_columns` holds an array of column descriptions that are retained in each dataflow."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "all_columns = dprep.ColumnSelector(term=\".*\", use_regex=True)\n",
    "drop_if_all_null = [all_columns, dprep.ColumnRelationship(dprep.ColumnRelationship.ALL)]\n",
    "useful_columns = [\n",
    "    \"cost\", \"distance\"\"distance\", \"dropoff_datetime\", \"dropoff_latitude\", \"dropoff_longitude\",\n",
    "    \"passengers\", \"pickup_datetime\", \"pickup_latitude\", \"pickup_longitude\", \"store_forward\", \"vendor\"\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You first work with the green taxi data and get it into a valid shape that can be combined with the yellow taxi data. Create a temporary dataflow `tmp_df`, and call the `replace_na()`, `drop_nulls()`, and `keep_columns()` functions using the shortcut transform variables you created. Additionally, rename all the columns in the dataframe to match the names in `useful_columns`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tmp_df = (green_df\n",
    "    .replace_na(columns=all_columns)\n",
    "    .drop_nulls(*drop_if_all_null)\n",
    "    .rename_columns(column_pairs={\n",
    "        \"VendorID\": \"vendor\",\n",
    "        \"lpep_pickup_datetime\": \"pickup_datetime\",\n",
    "        \"Lpep_dropoff_datetime\": \"dropoff_datetime\",\n",
    "        \"lpep_dropoff_datetime\": \"dropoff_datetime\",\n",
    "        \"Store_and_fwd_flag\": \"store_forward\",\n",
    "        \"store_and_fwd_flag\": \"store_forward\",\n",
    "        \"Pickup_longitude\": \"pickup_longitude\",\n",
    "        \"Pickup_latitude\": \"pickup_latitude\",\n",
    "        \"Dropoff_longitude\": \"dropoff_longitude\",\n",
    "        \"Dropoff_latitude\": \"dropoff_latitude\",\n",
    "        \"Passenger_count\": \"passengers\",\n",
    "        \"Fare_amount\": \"cost\",\n",
    "        \"Trip_distance\": \"distance\"\n",
    "     })\n",
    "    .keep_columns(columns=useful_columns))\n",
    "tmp_df.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Overwrite the `green_df` variable with the transforms performed on `tmp_df` in the previous step."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "green_df = tmp_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Perform the same transformation steps to the yellow taxi data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tmp_df = (yellow_df\n",
    "    .replace_na(columns=all_columns)\n",
    "    .drop_nulls(*drop_if_all_null)\n",
    "    .rename_columns(column_pairs={\n",
    "        \"vendor_name\": \"vendor\",\n",
    "        \"VendorID\": \"vendor\",\n",
    "        \"vendor_id\": \"vendor\",\n",
    "        \"Trip_Pickup_DateTime\": \"pickup_datetime\",\n",
    "        \"tpep_pickup_datetime\": \"pickup_datetime\",\n",
    "        \"Trip_Dropoff_DateTime\": \"dropoff_datetime\",\n",
    "        \"tpep_dropoff_datetime\": \"dropoff_datetime\",\n",
    "        \"store_and_forward\": \"store_forward\",\n",
    "        \"store_and_fwd_flag\": \"store_forward\",\n",
    "        \"Start_Lon\": \"pickup_longitude\",\n",
    "        \"Start_Lat\": \"pickup_latitude\",\n",
    "        \"End_Lon\": \"dropoff_longitude\",\n",
    "        \"End_Lat\": \"dropoff_latitude\",\n",
    "        \"Passenger_Count\": \"passengers\",\n",
    "        \"passenger_count\": \"passengers\",\n",
    "        \"Fare_Amt\": \"cost\",\n",
    "        \"fare_amount\": \"cost\",\n",
    "        \"Trip_Distance\": \"distance\",\n",
    "        \"trip_distance\": \"distance\"\n",
    "    })\n",
    "    .keep_columns(columns=useful_columns))\n",
    "tmp_df.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Again, overwrite `yellow_df` with `tmp_df`, and then call the `append_rows()` function on the green taxi data to append the yellow taxi data, creating a new combined dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "yellow_df = tmp_df\n",
    "combined_df = green_df.append_rows([yellow_df])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Convert types and filter "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Examine the pickup and drop-off coordinates summary statistics to see how the data is distributed. First define a `TypeConverter` object to change the lat/long fields to decimal type. Next, call the `keep_columns()` function to restrict output to only the lat/long fields, and then call `get_profile()`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "decimal_type = dprep.TypeConverter(data_type=dprep.FieldType.DECIMAL)\n",
    "combined_df = combined_df.set_column_types(type_conversions={\n",
    "    \"pickup_longitude\": decimal_type,\n",
    "    \"pickup_latitude\": decimal_type,\n",
    "    \"dropoff_longitude\": decimal_type,\n",
    "    \"dropoff_latitude\": decimal_type\n",
    "})\n",
    "combined_df.keep_columns(columns=[\n",
    "    \"pickup_longitude\", \"pickup_latitude\", \n",
    "    \"dropoff_longitude\", \"dropoff_latitude\"\n",
    "]).get_profile()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From the summary statistics output, you see that there are coordinates that are missing, and coordinates that are not in New York City. Filter out coordinates not in the city border by chaining column filter commands within the `filter()` function, and defining minimum and maximum bounds for each field. Then call `get_profile()` again to verify the transformation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tmp_df = (combined_df\n",
    "    .drop_nulls(\n",
    "        columns=[\"pickup_longitude\", \"pickup_latitude\", \"dropoff_longitude\", \"dropoff_latitude\"],\n",
    "        column_relationship=dprep.ColumnRelationship(dprep.ColumnRelationship.ANY)\n",
    "    ) \n",
    "    .filter(dprep.f_and(\n",
    "        dprep.col(\"pickup_longitude\") <= -73.72,\n",
    "        dprep.col(\"pickup_longitude\") >= -74.09,\n",
    "        dprep.col(\"pickup_latitude\") <= 40.88,\n",
    "        dprep.col(\"pickup_latitude\") >= 40.53,\n",
    "        dprep.col(\"dropoff_longitude\") <= -73.72,\n",
    "        dprep.col(\"dropoff_longitude\") >= -74.09,\n",
    "        dprep.col(\"dropoff_latitude\") <= 40.88,\n",
    "        dprep.col(\"dropoff_latitude\") >= 40.53\n",
    "    )))\n",
    "tmp_df.keep_columns(columns=[\n",
    "    \"pickup_longitude\", \"pickup_latitude\", \n",
    "    \"dropoff_longitude\", \"dropoff_latitude\"\n",
    "]).get_profile()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Overwrite `combined_df` with the transformations you made to `tmp_df`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "combined_df = tmp_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Split and rename columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Look at the data profile for the `store_forward` column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "combined_df.keep_columns(columns='store_forward').get_profile()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From the data profile output of `store_forward`, you see that the data is inconsistent and there are missing/null values. Replace these values using the `replace()` and `fill_nulls()` functions, and in both cases change to the string \"N\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "combined_df = combined_df.replace(columns=\"store_forward\", find=\"0\", replace_with=\"N\").fill_nulls(\"store_forward\", \"N\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Split the pick up and drop off datetimes into respective date and time columns. Use `split_column_by_example()` to perform the split. In this case, the optional `example` parameter of `split_column_by_example()` is omitted. Therefore the function will automatically determine where to split based on the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tmp_df = (combined_df\n",
    "    .split_column_by_example(source_column=\"pickup_datetime\")\n",
    "    .split_column_by_example(source_column=\"dropoff_datetime\"))\n",
    "tmp_df.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Rename the columns generated by `split_column_by_example()` into meaningful names."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tmp_df_renamed = (tmp_df\n",
    "    .rename_columns(column_pairs={\n",
    "        \"pickup_datetime_1\": \"pickup_date\",\n",
    "        \"pickup_datetime_2\": \"pickup_time\",\n",
    "        \"dropoff_datetime_1\": \"dropoff_date\",\n",
    "        \"dropoff_datetime_2\": \"dropoff_time\"\n",
    "    }))\n",
    "tmp_df_renamed.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Overwrite `combined_df` with the executed transformations, and then call `get_profile()` to see full summary statistics after all transformations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "combined_df = tmp_df_renamed\n",
    "combined_df.get_profile()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Transform data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Split the pickup and drop-off date further into day of week, day of month, and month. To get day of week, use the `derive_column_by_example()` function. This function takes as a parameter an array of example objects that define the input data, and the desired output. The function then automatically determines your desired transformation. For pickup and drop-off time columns, split into hour, minute, and second using the `split_column_by_example()` function with no example parameter.\n",
    "\n",
    "Once you have generated these new features, delete the original fields in favor of the newly generated features using `drop_columns()`. Rename all remaining fields to accurate descriptions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tmp_df = (combined_df\n",
    "    .derive_column_by_example(\n",
    "        source_columns=\"pickup_date\", \n",
    "        new_column_name=\"pickup_weekday\", \n",
    "        example_data=[(\"2009-01-04\", \"Sunday\"), (\"2013-08-22\", \"Thursday\")]\n",
    "    )\n",
    "    .derive_column_by_example(\n",
    "        source_columns=\"dropoff_date\",\n",
    "        new_column_name=\"dropoff_weekday\",\n",
    "        example_data=[(\"2013-08-22\", \"Thursday\"), (\"2013-11-03\", \"Sunday\")]\n",
    "    )\n",
    "          \n",
    "    .split_column_by_example(source_column=\"pickup_time\")\n",
    "    .split_column_by_example(source_column=\"dropoff_time\")\n",
    "    # the following two split_column_by_example calls reference the generated column names from the above two calls\n",
    "    .split_column_by_example(source_column=\"pickup_time_1\")\n",
    "    .split_column_by_example(source_column=\"dropoff_time_1\")\n",
    "    .drop_columns(columns=[\n",
    "        \"pickup_date\", \"pickup_time\", \"dropoff_date\", \"dropoff_time\", \n",
    "        \"pickup_date_1\", \"dropoff_date_1\", \"pickup_time_1\", \"dropoff_time_1\"\n",
    "    ])\n",
    "          \n",
    "    .rename_columns(column_pairs={\n",
    "        \"pickup_date_2\": \"pickup_month\",\n",
    "        \"pickup_date_3\": \"pickup_monthday\",\n",
    "        \"pickup_time_1_1\": \"pickup_hour\",\n",
    "        \"pickup_time_1_2\": \"pickup_minute\",\n",
    "        \"pickup_time_2\": \"pickup_second\",\n",
    "        \"dropoff_date_2\": \"dropoff_month\",\n",
    "        \"dropoff_date_3\": \"dropoff_monthday\",\n",
    "        \"dropoff_time_1_1\": \"dropoff_hour\",\n",
    "        \"dropoff_time_1_2\": \"dropoff_minute\",\n",
    "        \"dropoff_time_2\": \"dropoff_second\"\n",
    "    }))\n",
    "\n",
    "tmp_df.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From the data above, you see that the pickup and drop-off date and time components produced from the derived transformations are correct. Drop the `pickup_datetime` and `dropoff_datetime` columns as they are no longer needed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tmp_df = tmp_df.drop_columns(columns=[\"pickup_datetime\", \"dropoff_datetime\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use the type inference functionality to automatically check the data type of each field, and display the inference results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "type_infer = tmp_df.builders.set_column_types()\n",
    "type_infer.learn()\n",
    "type_infer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The inference results look correct based on the data, now apply the type conversions to the dataflow."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tmp_df = type_infer.to_dataflow()\n",
    "tmp_df.get_profile()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "At this point, you have a fully transformed and prepared dataflow object to use in a machine learning model. The DataPrep SDK includes object serialization functionality, which is used as follows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dflow_prepared = tmp_df\n",
    "package = dprep.Package([dflow_prepared])\n",
    "package.save(\".\\dflows\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Clean up resources"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Delete the file `dflows` (whether you are running locally or in Azure Notebooks) in your current directory if you do not wish to continue with part two of the tutorial. If you continue on to part two, you will need the `dflows` file in the current directory."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Next steps"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this Azure Machine Learning Data Prep SDK tutorial, you:\n",
    "\n",
    "> * Set up your development environment\n",
    "> * Loaded and cleansed data sets\n",
    "> * Used smart transforms to predict your logic based on an example\n",
    "> * Merged and packaged datasets for machine learning training\n",
    "\n",
    "You are ready to use this training data in the next part of the tutorial series:\n",
    "\n",
    "\n",
    "> [Tutorial #2: Train regression model](regression-part2-automated-ml.ipynb)"
   ]
  }
 ],
 "metadata": {
  "authors": [
   {
    "name": "cforbe"
   }
  ],
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  },
  "msauthor": "trbye"
 },
 "nbformat": 4,
 "nbformat_minor": 2
}