tutorial update

This commit is contained in:
Roope Astala
2019-02-11 16:07:10 -05:00
parent 90aaeea113
commit 82bb9fcac3
4 changed files with 246 additions and 242 deletions

View File

@@ -13,14 +13,16 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Tutorial (part 1): Prepare data for regression modeling"
"# Tutorial: Prepare data for regression modeling"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this tutorial, you learn how to prep data for regression modeling using the Azure Machine Learning Data Prep SDK. Perform various transformations to filter and combine two different NYC Taxi data sets. The end goal of this tutorial set is to predict the cost of a taxi trip by training a model on data features including pickup hour, day of week, number of passengers, and coordinates. This tutorial is part one of a two-part tutorial series.\n",
"In this tutorial, you learn how to prepare data for regression modeling by using the Azure Machine Learning Data Prep SDK. You run various transformations to filter and combine two different NYC taxi data sets.\n",
"\n",
"This tutorial is **part one of a two-part tutorial series**. After you complete the tutorial series, you can predict the cost of a taxi trip by training a model on data features. These features include the pickup day and time, the number of passengers, and the pickup location.\n",
"\n",
"In this tutorial, you:\n",
"\n",
@@ -29,17 +31,39 @@
"> * Load two datasets with different field names\n",
"> * Cleanse data to remove anomalies\n",
"> * Transform data using intelligent transforms to create new features\n",
"> * Save your dataflow object to use in a regression model\n",
"\n",
"You can prepare your data in Python using the [Azure Machine Learning Data Prep SDK](https://aka.ms/data-prep-sdk)."
"> * Save your dataflow object to use in a regression model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Import packages\n",
"Begin by importing the SDK."
"## Prerequisites\n",
"\n",
"To run the notebook you will need:\n",
"\n",
"* A Python 3.6 notebook server with the following installed:\n",
" * The Azure Machine Learning Data Prep SDK for Python\n",
"* The tutorial notebook\n",
"\n",
"Navigate back to the [tutorial page](https://docs.microsoft.com/azure/machine-learning/service/tutorial-data-prep) for specific environment setup instructions.\n",
"\n",
"## <a name=\"start\"></a>Set up your development environment\n",
"\n",
"All the setup for your development work can be accomplished in a Python notebook. Setup includes the following actions:\n",
"\n",
"* Install the SDK\n",
"* Import Python packages\n",
"\n",
"### Install and import packages\n",
"\n",
"Use the following to install necessary packages if you don't already have them.\n",
"\n",
"```shell\n",
"pip install azureml-dataprep\n",
"```\n",
"\n",
"Import the SDK."
]
},
{
@@ -65,17 +89,25 @@
"metadata": {},
"outputs": [],
"source": [
"from IPython.display import display\n",
"dataset_root = \"https://dprepdata.blob.core.windows.net/demo\"\n",
"\n",
"green_path = \"/\".join([dataset_root, \"green-small/*\"])\n",
"yellow_path = \"/\".join([dataset_root, \"yellow-small/*\"])\n",
"\n",
"green_df = dprep.read_csv(path=green_path, header=dprep.PromoteHeadersMode.GROUPED)\n",
"# auto_read_file will automatically identify and parse the file type, and is useful if you don't know the file type\n",
"yellow_df = dprep.auto_read_file(path=yellow_path)\n",
"green_df_raw = dprep.read_csv(path=green_path, header=dprep.PromoteHeadersMode.GROUPED)\n",
"# auto_read_file automatically identifies and parses the file type, which is useful when you don't know the file type.\n",
"yellow_df_raw = dprep.auto_read_file(path=yellow_path)\n",
"\n",
"green_df.head(5)\n",
"yellow_df.head(5)"
"display(green_df_raw.head(5))\n",
"display(yellow_df_raw.head(5))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A `Dataflow` object is similar to a dataframe, and represents a series of lazily-evaluated, immutable operations on data. Operations can be added by invoking the different transformation and filtering methods available. The result of adding an operation to a `Dataflow` is always a new `Dataflow` object."
]
},
{
@@ -89,7 +121,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Now you populate some variables with shortcut transforms that will apply to all dataflows. The variable `drop_if_all_null` will be used to delete records where all fields are null. The variable `useful_columns` holds an array of column descriptions that are retained in each dataflow."
"Now you populate some variables with shortcut transforms to apply to all dataflows. The `drop_if_all_null` variable is used to delete records where all fields are null. The `useful_columns` variable holds an array of column descriptions that are kept in each dataflow."
]
},
{
@@ -110,7 +142,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"You first work with the green taxi data and get it into a valid shape that can be combined with the yellow taxi data. Create a temporary dataflow `tmp_df`, and call the `replace_na()`, `drop_nulls()`, and `keep_columns()` functions using the shortcut transform variables you created. Additionally, rename all the columns in the dataframe to match the names in `useful_columns`."
"You first work with the green taxi data to get it into a valid shape that can be combined with the yellow taxi data. Call the `replace_na()`, `drop_nulls()`, and `keep_columns()` functions by using the shortcut transform variables you created. Additionally, rename all the columns in the dataframe to match the names in the `useful_columns` variable."
]
},
{
@@ -119,7 +151,7 @@
"metadata": {},
"outputs": [],
"source": [
"tmp_df = (green_df\n",
"green_df = (green_df_raw\n",
" .replace_na(columns=all_columns)\n",
" .drop_nulls(*drop_if_all_null)\n",
" .rename_columns(column_pairs={\n",
@@ -138,14 +170,14 @@
" \"Trip_distance\": \"distance\"\n",
" })\n",
" .keep_columns(columns=useful_columns))\n",
"tmp_df.head(5)"
"green_df.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Overwrite the `green_df` variable with the transforms performed on `tmp_df` in the previous step."
"Run the same transformation steps on the yellow taxi data. These functions ensure that null data is removed from the data set, which will help increase machine learning model accuracy."
]
},
{
@@ -154,23 +186,7 @@
"metadata": {},
"outputs": [],
"source": [
"green_df = tmp_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Perform the same transformation steps to the yellow taxi data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tmp_df = (yellow_df\n",
"yellow_df = (yellow_df_raw\n",
" .replace_na(columns=all_columns)\n",
" .drop_nulls(*drop_if_all_null)\n",
" .rename_columns(column_pairs={\n",
@@ -195,14 +211,14 @@
" \"trip_distance\": \"distance\"\n",
" })\n",
" .keep_columns(columns=useful_columns))\n",
"tmp_df.head(5)"
"yellow_df.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Again, overwrite `yellow_df` with `tmp_df`, and then call the `append_rows()` function on the green taxi data to append the yellow taxi data, creating a new combined dataframe."
"Call the `append_rows()` function on the green taxi data to append the yellow taxi data. A new combined dataframe is created."
]
},
{
@@ -211,7 +227,6 @@
"metadata": {},
"outputs": [],
"source": [
"yellow_df = tmp_df\n",
"combined_df = green_df.append_rows([yellow_df])"
]
},
@@ -226,7 +241,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Examine the pickup and drop-off coordinates summary statistics to see how the data is distributed. First define a `TypeConverter` object to change the lat/long fields to decimal type. Next, call the `keep_columns()` function to restrict output to only the lat/long fields, and then call `get_profile()`."
"Examine the pickup and drop-off coordinates summary statistics to see how the data is distributed. First, define a `TypeConverter` object to change the latitude and longitude fields to decimal type. Next, call the `keep_columns()` function to restrict output to only the latitude and longitude fields, and then call the `get_profile()` function. These function calls create a condensed view of the dataflow to just show the lat/long fields, which makes it easier to evaluate missing or out-of-scope coordinates."
]
},
{
@@ -243,7 +258,7 @@
" \"dropoff_latitude\": decimal_type\n",
"})\n",
"combined_df.keep_columns(columns=[\n",
" \"pickup_longitude\", \"pickup_latitude\", \n",
" \"pickup_longitude\", \"pickup_latitude\",\n",
" \"dropoff_longitude\", \"dropoff_latitude\"\n",
"]).get_profile()"
]
@@ -252,7 +267,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"From the summary statistics output, you see that there are coordinates that are missing, and coordinates that are not in New York City. Filter out coordinates not in the city border by chaining column filter commands within the `filter()` function, and defining minimum and maximum bounds for each field. Then call `get_profile()` again to verify the transformation."
"From the summary statistics output, you see there are missing coordinates and coordinates that aren't in New York City (this is determined from subjective analysis). Filter out coordinates for locations that are outside the city border. Chain the column filter commands within the `filter()` function and define the minimum and maximum bounds for each field. Then call the `get_profile()` function again to verify the transformation."
]
},
{
@@ -261,11 +276,11 @@
"metadata": {},
"outputs": [],
"source": [
"tmp_df = (combined_df\n",
"latlong_filtered_df = (combined_df\n",
" .drop_nulls(\n",
" columns=[\"pickup_longitude\", \"pickup_latitude\", \"dropoff_longitude\", \"dropoff_latitude\"],\n",
" column_relationship=dprep.ColumnRelationship(dprep.ColumnRelationship.ANY)\n",
" ) \n",
" )\n",
" .filter(dprep.f_and(\n",
" dprep.col(\"pickup_longitude\") <= -73.72,\n",
" dprep.col(\"pickup_longitude\") >= -74.09,\n",
@@ -276,28 +291,12 @@
" dprep.col(\"dropoff_latitude\") <= 40.88,\n",
" dprep.col(\"dropoff_latitude\") >= 40.53\n",
" )))\n",
"tmp_df.keep_columns(columns=[\n",
" \"pickup_longitude\", \"pickup_latitude\", \n",
"latlong_filtered_df.keep_columns(columns=[\n",
" \"pickup_longitude\", \"pickup_latitude\",\n",
" \"dropoff_longitude\", \"dropoff_latitude\"\n",
"]).get_profile()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Overwrite `combined_df` with the transformations you made to `tmp_df`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"combined_df = tmp_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
@@ -309,7 +308,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Look at the data profile for the `store_forward` column."
"Look at the data profile for the `store_forward` column. This field is a boolean flag that is `Y` when the taxi did not have a connection to the server after the trip, and thus had to store the trip data in memory, and later forward it to the server when connected."
]
},
{
@@ -318,14 +317,14 @@
"metadata": {},
"outputs": [],
"source": [
"combined_df.keep_columns(columns='store_forward').get_profile()"
"latlong_filtered_df.keep_columns(columns='store_forward').get_profile()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"From the data profile output of `store_forward`, you see that the data is inconsistent and there are missing/null values. Replace these values using the `replace()` and `fill_nulls()` functions, and in both cases change to the string \"N\"."
"Notice that the data profile output in the `store_forward` column shows that the data is inconsistent and there are missing or null values. Use the `replace()` and `fill_nulls()` functions to replace these values with the string \"N\":"
]
},
{
@@ -334,14 +333,14 @@
"metadata": {},
"outputs": [],
"source": [
"combined_df = combined_df.replace(columns=\"store_forward\", find=\"0\", replace_with=\"N\").fill_nulls(\"store_forward\", \"N\")"
"replaced_stfor_vals_df = latlong_filtered_df.replace(columns=\"store_forward\", find=\"0\", replace_with=\"N\").fill_nulls(\"store_forward\", \"N\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Execute another `replace` function, this time on the `distance` field. This reformats distance values that are incorrectly labeled as `.00`, and fills any nulls with zeros. Convert the `distance` field to numerical format."
"Execute the `replace` function on the `distance` field. The function reformats distance values that are incorrectly labeled as `.00`, and fills any nulls with zeros. Convert the `distance` field to numerical format. These incorrect data points are likely anomolies in the data collection system on the taxi cabs."
]
},
{
@@ -350,15 +349,15 @@
"metadata": {},
"outputs": [],
"source": [
"combined_df = combined_df.replace(columns=\"distance\", find=\".00\", replace_with=0).fill_nulls(\"distance\", 0)\n",
"combined_df = combined_df.to_number([\"distance\"])"
"replaced_distance_vals_df = replaced_stfor_vals_df.replace(columns=\"distance\", find=\".00\", replace_with=0).fill_nulls(\"distance\", 0)\n",
"replaced_distance_vals_df = replaced_distance_vals_df.to_number([\"distance\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Split the pick up and drop off datetimes into respective date and time columns. Use `split_column_by_example()` to perform the split. In this case, the optional `example` parameter of `split_column_by_example()` is omitted. Therefore the function will automatically determine where to split based on the data."
"Split the pickup and dropoff datetime values into the respective date and time columns. Use the `split_column_by_example()` function to make the split. In this case, the optional `example` parameter of the `split_column_by_example()` function is omitted. Therefore, the function automatically determines where to split based on the data."
]
},
{
@@ -367,10 +366,10 @@
"metadata": {},
"outputs": [],
"source": [
"tmp_df = (combined_df\n",
"time_split_df = (replaced_distance_vals_df\n",
" .split_column_by_example(source_column=\"pickup_datetime\")\n",
" .split_column_by_example(source_column=\"dropoff_datetime\"))\n",
"tmp_df.head(5)"
"time_split_df.head(5)"
]
},
{
@@ -386,21 +385,21 @@
"metadata": {},
"outputs": [],
"source": [
"tmp_df_renamed = (tmp_df\n",
"renamed_col_df = (time_split_df\n",
" .rename_columns(column_pairs={\n",
" \"pickup_datetime_1\": \"pickup_date\",\n",
" \"pickup_datetime_2\": \"pickup_time\",\n",
" \"dropoff_datetime_1\": \"dropoff_date\",\n",
" \"dropoff_datetime_2\": \"dropoff_time\"\n",
" }))\n",
"tmp_df_renamed.head(5)"
"renamed_col_df.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Overwrite `combined_df` with the executed transformations, and then call `get_profile()` to see full summary statistics after all transformations."
"Call the `get_profile()` function to see the full summary statistics after all cleansing steps."
]
},
{
@@ -409,8 +408,7 @@
"metadata": {},
"outputs": [],
"source": [
"combined_df = tmp_df_renamed\n",
"combined_df.get_profile()"
"renamed_col_df.get_profile()"
]
},
{
@@ -424,9 +422,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Split the pickup and drop-off date further into day of week, day of month, and month. To get day of week, use the `derive_column_by_example()` function. This function takes as a parameter an array of example objects that define the input data, and the desired output. The function then automatically determines your desired transformation. For pickup and drop-off time columns, split into hour, minute, and second using the `split_column_by_example()` function with no example parameter.\n",
"Split the pickup and dropoff date further into the day of the week, day of the month, and month values. To get the day of the week value, use the `derive_column_by_example()` function. The function takes an array parameter of example objects that define the input data, and the preferred output. The function automatically determines your preferred transformation. For the pickup and dropoff time columns, split the time into the hour, minute, and second by using the `split_column_by_example()` function with no example parameter.\n",
"\n",
"Once you have generated these new features, delete the original fields in favor of the newly generated features using `drop_columns()`. Rename all remaining fields to accurate descriptions."
"After you generate the new features, use the `drop_columns()` function to delete the original fields as the newly generated features are preferred. Rename the rest of the fields to use meaningful descriptions.\n",
"\n",
"Transforming the data in this way to create new time-based features will improve machine learning model accuracy. For example, generating a new feature for the weekday will help establish a relationship between the day of the week and the taxi fare price, which is often more expensive on certain days of the week due to high demand."
]
},
{
@@ -435,10 +435,10 @@
"metadata": {},
"outputs": [],
"source": [
"tmp_df = (combined_df\n",
"transformed_features_df = (renamed_col_df\n",
" .derive_column_by_example(\n",
" source_columns=\"pickup_date\", \n",
" new_column_name=\"pickup_weekday\", \n",
" source_columns=\"pickup_date\",\n",
" new_column_name=\"pickup_weekday\",\n",
" example_data=[(\"2009-01-04\", \"Sunday\"), (\"2013-08-22\", \"Thursday\")]\n",
" )\n",
" .derive_column_by_example(\n",
@@ -446,17 +446,17 @@
" new_column_name=\"dropoff_weekday\",\n",
" example_data=[(\"2013-08-22\", \"Thursday\"), (\"2013-11-03\", \"Sunday\")]\n",
" )\n",
" \n",
"\n",
" .split_column_by_example(source_column=\"pickup_time\")\n",
" .split_column_by_example(source_column=\"dropoff_time\")\n",
" # the following two split_column_by_example calls reference the generated column names from the above two calls\n",
" # The following two calls to split_column_by_example reference the column names generated from the previous two calls.\n",
" .split_column_by_example(source_column=\"pickup_time_1\")\n",
" .split_column_by_example(source_column=\"dropoff_time_1\")\n",
" .drop_columns(columns=[\n",
" \"pickup_date\", \"pickup_time\", \"dropoff_date\", \"dropoff_time\", \n",
" \"pickup_date\", \"pickup_time\", \"dropoff_date\", \"dropoff_time\",\n",
" \"pickup_date_1\", \"dropoff_date_1\", \"pickup_time_1\", \"dropoff_time_1\"\n",
" ])\n",
" \n",
"\n",
" .rename_columns(column_pairs={\n",
" \"pickup_date_2\": \"pickup_month\",\n",
" \"pickup_date_3\": \"pickup_monthday\",\n",
@@ -470,14 +470,14 @@
" \"dropoff_time_2\": \"dropoff_second\"\n",
" }))\n",
"\n",
"tmp_df.head(5)"
"transformed_features_df.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"From the data above, you see that the pickup and drop-off date and time components produced from the derived transformations are correct. Drop the `pickup_datetime` and `dropoff_datetime` columns as they are no longer needed."
"Notice that the data shows that the pickup and dropoff date and time components produced from the derived transformations are correct. Drop the `pickup_datetime` and `dropoff_datetime` columns because they're no longer needed (granular time features like hour, minute and second are more useful for model training)."
]
},
{
@@ -486,7 +486,7 @@
"metadata": {},
"outputs": [],
"source": [
"tmp_df = tmp_df.drop_columns(columns=[\"pickup_datetime\", \"dropoff_datetime\"])"
"processed_df = transformed_features_df.drop_columns(columns=[\"pickup_datetime\", \"dropoff_datetime\"])"
]
},
{
@@ -502,7 +502,7 @@
"metadata": {},
"outputs": [],
"source": [
"type_infer = tmp_df.builders.set_column_types()\n",
"type_infer = processed_df.builders.set_column_types()\n",
"type_infer.learn()\n",
"type_infer"
]
@@ -511,7 +511,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The inference results look correct based on the data, now apply the type conversions to the dataflow."
"The inference results look correct based on the data. Now apply the type conversions to the dataflow."
]
},
{
@@ -520,15 +520,15 @@
"metadata": {},
"outputs": [],
"source": [
"tmp_df = type_infer.to_dataflow()\n",
"tmp_df.get_profile()"
"type_converted_df = type_infer.to_dataflow()\n",
"type_converted_df.get_profile()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Before packaging the dataflow, perform two final filters on the data set. To eliminate incorrect data points, filter the dataflow on records where both the `cost` and `distance` are greater than zero."
"Before you package the dataflow, run two final filters on the data set. To eliminate incorrectly captured data points, filter the dataflow on records where both the `cost` and `distance` variable values are greater than zero. This step will significantly improve machine learning model accuracy, because data points with a zero cost or distance represent major outliers that throw off prediction accuracy."
]
},
{
@@ -537,15 +537,15 @@
"metadata": {},
"outputs": [],
"source": [
"tmp_df = tmp_df.filter(dprep.col(\"distance\") > 0)\n",
"tmp_df = tmp_df.filter(dprep.col(\"cost\") > 0)"
"final_df = type_converted_df.filter(dprep.col(\"distance\") > 0)\n",
"final_df = final_df.filter(dprep.col(\"cost\") > 0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"At this point, you have a fully transformed and prepared dataflow object to use in a machine learning model. The DataPrep SDK includes object serialization functionality, which is used as follows."
"You now have a fully transformed and prepared dataflow object to use in a machine learning model. The SDK includes object serialization functionality, which is used as shown in the following code."
]
},
{
@@ -557,8 +557,7 @@
"import os\n",
"file_path = os.path.join(os.getcwd(), \"dflows.dprep\")\n",
"\n",
"dflow_prepared = tmp_df\n",
"package = dprep.Package([dflow_prepared])\n",
"package = dprep.Package([final_df])\n",
"package.save(file_path)"
]
},
@@ -573,7 +572,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Delete the file `dflows.dprep` (whether you are running locally or in Azure Notebooks) in your current directory if you do not wish to continue with part two of the tutorial. If you continue on to part two, you will need the `dflows.dprep` file in the current directory."
"To continue with part two of the tutorial, you need the **dflows.dprep** file in the current directory.\n",
"\n",
"If you don't plan to continue to part two, delete the **dflows.dprep** file in your current directory. Delete this file whether you're running the execution locally or in [Azure Notebooks](https://notebooks.azure.com/)."
]
},
{