Files
MachineLearningNotebooks/dataprep/getting-started.ipynb
rastala d10b1fa796 Revert "Updated notebook folders"
This reverts commit 06728004b6.
2018-11-20 10:39:48 -05:00

2986 lines
106 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Getting started with Azure ML Data Prep SDK\n",
"Copyright (c) Microsoft Corporation. All rights reserved.<br>\n",
"Licensed under the MIT License."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"#### Note: Some features in this Notebook will _not_ work with the Private Preview version of the SDK; it assumes the Public Preview version.\n",
"\n",
"Wonder how you can make the most of the Azure ML Data Prep SDK? In this \"Getting Started\" guide, we'll showcase a few highlights that make this SDK shine for big datasets where `pandas` and `dplyr` can fall short. Using the [Ford GoBike dataset](https://www.fordgobike.com/system-data) as an example, we'll cover how to build Dataflows that allow you to:\n",
"\n",
"* [Read in data](#Read-in-data)\n",
"* [Get a profile of your data](#Get-data-profile)\n",
"* [Apply smart transforms by Microsoft Research](#Derive-by-example)\n",
"* [Filter quickly](#Filter-our-data)\n",
"* [Apply common data science transforms](#Transform-our-data)\n",
"* [Easily handle errors and assertions](#Assert-on-invalid-data)\n",
"* [Prepare your dataset for export and machine learning](#Export-for-machine-learning)"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"from IPython.display import display\n",
"from os import path\n",
"from tempfile import mkdtemp\n",
"\n",
"import pandas as pd\n",
"import azureml.dataprep as dprep"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Read in data\n",
"\n",
"Azure ML Data Prep supports many different file reading formats (i.e. CSV, Excel, Parquet), and also offers the ability to infer column types automatically. "
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>duration_sec</th>\n",
" <th>start_time</th>\n",
" <th>end_time</th>\n",
" <th>start_station_id</th>\n",
" <th>start_station_name</th>\n",
" <th>start_station_latitude</th>\n",
" <th>start_station_longitude</th>\n",
" <th>end_station_id</th>\n",
" <th>end_station_name</th>\n",
" <th>end_station_latitude</th>\n",
" <th>end_station_longitude</th>\n",
" <th>bike_id</th>\n",
" <th>user_type</th>\n",
" <th>member_birth_year</th>\n",
" <th>member_gender</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>80110.0</td>\n",
" <td>2017-12-31 16:57:39.654000+00:00</td>\n",
" <td>2018-01-01 15:12:50.245000+00:00</td>\n",
" <td>74.0</td>\n",
" <td>Laguna St at Hayes St</td>\n",
" <td>37.776435</td>\n",
" <td>-122.426244</td>\n",
" <td>43.0</td>\n",
" <td>San Francisco Public Library (Grove St at Hyde...</td>\n",
" <td>37.778768</td>\n",
" <td>-122.415929</td>\n",
" <td>96.0</td>\n",
" <td>Customer</td>\n",
" <td>1987.0</td>\n",
" <td>Male</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>78800.0</td>\n",
" <td>2017-12-31 15:56:34.842000+00:00</td>\n",
" <td>2018-01-01 13:49:55.617000+00:00</td>\n",
" <td>284.0</td>\n",
" <td>Yerba Buena Center for the Arts (Howard St at ...</td>\n",
" <td>37.784872</td>\n",
" <td>-122.400876</td>\n",
" <td>96.0</td>\n",
" <td>Dolores St at 15th St</td>\n",
" <td>37.766210</td>\n",
" <td>-122.426614</td>\n",
" <td>88.0</td>\n",
" <td>Customer</td>\n",
" <td>1965.0</td>\n",
" <td>Female</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>45768.0</td>\n",
" <td>2017-12-31 22:45:48.411000+00:00</td>\n",
" <td>2018-01-01 11:28:36.883000+00:00</td>\n",
" <td>245.0</td>\n",
" <td>Downtown Berkeley BART</td>\n",
" <td>37.870348</td>\n",
" <td>-122.267764</td>\n",
" <td>245.0</td>\n",
" <td>Downtown Berkeley BART</td>\n",
" <td>37.870348</td>\n",
" <td>-122.267764</td>\n",
" <td>1094.0</td>\n",
" <td>Customer</td>\n",
" <td>NaN</td>\n",
" <td></td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>62172.0</td>\n",
" <td>2017-12-31 17:31:10.636000+00:00</td>\n",
" <td>2018-01-01 10:47:23.531000+00:00</td>\n",
" <td>60.0</td>\n",
" <td>8th St at Ringold St</td>\n",
" <td>37.774520</td>\n",
" <td>-122.409449</td>\n",
" <td>5.0</td>\n",
" <td>Powell St BART Station (Market St at 5th St)</td>\n",
" <td>37.783899</td>\n",
" <td>-122.408445</td>\n",
" <td>2831.0</td>\n",
" <td>Customer</td>\n",
" <td>NaN</td>\n",
" <td></td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>43603.0</td>\n",
" <td>2017-12-31 14:23:14.001000+00:00</td>\n",
" <td>2018-01-01 02:29:57.571000+00:00</td>\n",
" <td>239.0</td>\n",
" <td>Bancroft Way at Telegraph Ave</td>\n",
" <td>37.868813</td>\n",
" <td>-122.258764</td>\n",
" <td>247.0</td>\n",
" <td>Fulton St at Bancroft Way</td>\n",
" <td>37.867789</td>\n",
" <td>-122.265896</td>\n",
" <td>3167.0</td>\n",
" <td>Subscriber</td>\n",
" <td>1997.0</td>\n",
" <td>Female</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" duration_sec start_time \\\n",
"0 80110.0 2017-12-31 16:57:39.654000+00:00 \n",
"1 78800.0 2017-12-31 15:56:34.842000+00:00 \n",
"2 45768.0 2017-12-31 22:45:48.411000+00:00 \n",
"3 62172.0 2017-12-31 17:31:10.636000+00:00 \n",
"4 43603.0 2017-12-31 14:23:14.001000+00:00 \n",
"\n",
" end_time start_station_id \\\n",
"0 2018-01-01 15:12:50.245000+00:00 74.0 \n",
"1 2018-01-01 13:49:55.617000+00:00 284.0 \n",
"2 2018-01-01 11:28:36.883000+00:00 245.0 \n",
"3 2018-01-01 10:47:23.531000+00:00 60.0 \n",
"4 2018-01-01 02:29:57.571000+00:00 239.0 \n",
"\n",
" start_station_name start_station_latitude \\\n",
"0 Laguna St at Hayes St 37.776435 \n",
"1 Yerba Buena Center for the Arts (Howard St at ... 37.784872 \n",
"2 Downtown Berkeley BART 37.870348 \n",
"3 8th St at Ringold St 37.774520 \n",
"4 Bancroft Way at Telegraph Ave 37.868813 \n",
"\n",
" start_station_longitude end_station_id \\\n",
"0 -122.426244 43.0 \n",
"1 -122.400876 96.0 \n",
"2 -122.267764 245.0 \n",
"3 -122.409449 5.0 \n",
"4 -122.258764 247.0 \n",
"\n",
" end_station_name end_station_latitude \\\n",
"0 San Francisco Public Library (Grove St at Hyde... 37.778768 \n",
"1 Dolores St at 15th St 37.766210 \n",
"2 Downtown Berkeley BART 37.870348 \n",
"3 Powell St BART Station (Market St at 5th St) 37.783899 \n",
"4 Fulton St at Bancroft Way 37.867789 \n",
"\n",
" end_station_longitude bike_id user_type member_birth_year member_gender \n",
"0 -122.415929 96.0 Customer 1987.0 Male \n",
"1 -122.426614 88.0 Customer 1965.0 Female \n",
"2 -122.267764 1094.0 Customer NaN \n",
"3 -122.408445 2831.0 Customer NaN \n",
"4 -122.265896 3167.0 Subscriber 1997.0 Female "
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"gobike = dprep\\\n",
" .read_csv(\n",
" path='https://dprepdata.blob.core.windows.net/demo/ford_gobike/2017-fordgobike-tripdata.csv',\n",
" inference_arguments=dprep.InferenceArguments.current_culture()\n",
" )\n",
"gobike.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In order to iterate more quickly, we can take a sample of our data. Later, we can then apply the same transformations to the entire dataset."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"sampled_gobike = gobike.take_sample(probability=0.1, seed=5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Get data profile\n",
"\n",
"Let's understand what our data looks like. Azure ML Data Prep facilitates this process by offering data profiles that help us glimpse into column types and column summary statistics."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Type</th>\n",
" <th>Min</th>\n",
" <th>Max</th>\n",
" <th>Count</th>\n",
" <th>Missing Count</th>\n",
" <th>Error Count</th>\n",
" <th>Lower Quartile</th>\n",
" <th>Upper Quartile</th>\n",
" <th>Standard Deviation</th>\n",
" <th>Mean</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>duration_sec</th>\n",
" <td>FieldType.DECIMAL</td>\n",
" <td>61</td>\n",
" <td>86369</td>\n",
" <td>519700.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>381.842</td>\n",
" <td>938.574</td>\n",
" <td>3444.15</td>\n",
" <td>1099.01</td>\n",
" </tr>\n",
" <tr>\n",
" <th>start_time</th>\n",
" <td>FieldType.DATE</td>\n",
" <td>2017-06-28 09:47:36.347000+00:00</td>\n",
" <td>2017-12-31 23:59:01.261000+00:00</td>\n",
" <td>519700.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" </tr>\n",
" <tr>\n",
" <th>end_time</th>\n",
" <td>FieldType.DATE</td>\n",
" <td>2017-06-28 09:52:55.338000+00:00</td>\n",
" <td>2018-01-01 15:12:50.245000+00:00</td>\n",
" <td>519700.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" </tr>\n",
" <tr>\n",
" <th>start_station_id</th>\n",
" <td>FieldType.DECIMAL</td>\n",
" <td>3</td>\n",
" <td>340</td>\n",
" <td>519700.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>23.8481</td>\n",
" <td>139.424</td>\n",
" <td>86.0831</td>\n",
" <td>95.0342</td>\n",
" </tr>\n",
" <tr>\n",
" <th>start_station_name</th>\n",
" <td>FieldType.STRING</td>\n",
" <td>10th Ave at E 15th St</td>\n",
" <td>Yerba Buena Center for the Arts (Howard St at ...</td>\n",
" <td>519700.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" </tr>\n",
" <tr>\n",
" <th>start_station_latitude</th>\n",
" <td>FieldType.DECIMAL</td>\n",
" <td>37.3173</td>\n",
" <td>37.8802</td>\n",
" <td>519700.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>37.7736</td>\n",
" <td>37.7953</td>\n",
" <td>0.086305</td>\n",
" <td>37.7717</td>\n",
" </tr>\n",
" <tr>\n",
" <th>start_station_longitude</th>\n",
" <td>FieldType.DECIMAL</td>\n",
" <td>-122.444</td>\n",
" <td>-121.874</td>\n",
" <td>519700.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>-122.412</td>\n",
" <td>-122.391</td>\n",
" <td>0.105573</td>\n",
" <td>-122.364</td>\n",
" </tr>\n",
" <tr>\n",
" <th>end_station_id</th>\n",
" <td>FieldType.DECIMAL</td>\n",
" <td>3</td>\n",
" <td>340</td>\n",
" <td>519700.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>22.7024</td>\n",
" <td>134.22</td>\n",
" <td>84.9695</td>\n",
" <td>92.184</td>\n",
" </tr>\n",
" <tr>\n",
" <th>end_station_name</th>\n",
" <td>FieldType.STRING</td>\n",
" <td>10th Ave at E 15th St</td>\n",
" <td>Yerba Buena Center for the Arts (Howard St at ...</td>\n",
" <td>519700.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" </tr>\n",
" <tr>\n",
" <th>end_station_latitude</th>\n",
" <td>FieldType.DECIMAL</td>\n",
" <td>37.3173</td>\n",
" <td>37.8802</td>\n",
" <td>519700.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>37.7742</td>\n",
" <td>37.7956</td>\n",
" <td>0.0862238</td>\n",
" <td>37.7718</td>\n",
" </tr>\n",
" <tr>\n",
" <th>end_station_longitude</th>\n",
" <td>FieldType.DECIMAL</td>\n",
" <td>-122.444</td>\n",
" <td>-121.874</td>\n",
" <td>519700.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>-122.41</td>\n",
" <td>-122.391</td>\n",
" <td>0.105122</td>\n",
" <td>-122.363</td>\n",
" </tr>\n",
" <tr>\n",
" <th>bike_id</th>\n",
" <td>FieldType.DECIMAL</td>\n",
" <td>10</td>\n",
" <td>3733</td>\n",
" <td>519700.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>788.679</td>\n",
" <td>2519.96</td>\n",
" <td>971.357</td>\n",
" <td>1672.53</td>\n",
" </tr>\n",
" <tr>\n",
" <th>user_type</th>\n",
" <td>FieldType.STRING</td>\n",
" <td>Customer</td>\n",
" <td>Subscriber</td>\n",
" <td>519700.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" </tr>\n",
" <tr>\n",
" <th>member_birth_year</th>\n",
" <td>FieldType.DECIMAL</td>\n",
" <td>1886</td>\n",
" <td>1999</td>\n",
" <td>519700.0</td>\n",
" <td>66541.0</td>\n",
" <td>0.0</td>\n",
" <td>1974.33</td>\n",
" <td>1987.99</td>\n",
" <td>10.5135</td>\n",
" <td>1980.4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>member_gender</th>\n",
" <td>FieldType.STRING</td>\n",
" <td></td>\n",
" <td>Other</td>\n",
" <td>519700.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" </tr>\n",
" </tbody>\n",
"</table>"
],
"text/plain": [
"ColumnProfile\n",
" name: duration_sec\n",
" type: FieldType.DECIMAL\n",
"\n",
" min: 61.0\n",
" max: 86369.0\n",
" count: 519700.0\n",
" missing_count: 0.0\n",
" error_count: 0.0\n",
"\n",
" lower_quartile: 381.8421435321134\n",
" median: 595.9837506906349\n",
" upper_quartile: 938.5741138032683\n",
" std: 3444.146451247386\n",
" mean: 1099.009520877422\n",
"\n",
"ColumnProfile\n",
" name: start_time\n",
" type: FieldType.DATE\n",
"\n",
" min: 2017-06-28 09:47:36.347000+00:00\n",
" max: 2017-12-31 23:59:01.261000+00:00\n",
" count: 519700.0\n",
" missing_count: 0.0\n",
" error_count: 0.0\n",
"\n",
"ColumnProfile\n",
" name: end_time\n",
" type: FieldType.DATE\n",
"\n",
" min: 2017-06-28 09:52:55.338000+00:00\n",
" max: 2018-01-01 15:12:50.245000+00:00\n",
" count: 519700.0\n",
" missing_count: 0.0\n",
" error_count: 0.0\n",
"\n",
"ColumnProfile\n",
" name: start_station_id\n",
" type: FieldType.DECIMAL\n",
"\n",
" min: 3.0\n",
" max: 340.0\n",
" count: 519700.0\n",
" missing_count: 0.0\n",
" error_count: 0.0\n",
"\n",
" lower_quartile: 23.848148131600635\n",
" median: 67.18427817406452\n",
" upper_quartile: 139.42430180307275\n",
" std: 86.08307797095921\n",
" mean: 95.03424475658852\n",
"\n",
"ColumnProfile\n",
" name: start_station_name\n",
" type: FieldType.STRING\n",
"\n",
" min: 10th Ave at E 15th St\n",
" max: Yerba Buena Center for the Arts (Howard St at 3rd St)\n",
" count: 519700.0\n",
" missing_count: 0.0\n",
" error_count: 0.0\n",
"\n",
"ColumnProfile\n",
" name: start_station_latitude\n",
" type: FieldType.DECIMAL\n",
"\n",
" min: 37.3172979\n",
" max: 37.88022244590679\n",
" count: 519700.0\n",
" missing_count: 0.0\n",
" error_count: 0.0\n",
"\n",
" lower_quartile: 37.7735913559721\n",
" median: 37.783211475877295\n",
" upper_quartile: 37.79531236950411\n",
" std: 0.08630496061661774\n",
" mean: 37.771652603110894\n",
"\n",
"ColumnProfile\n",
" name: start_station_longitude\n",
" type: FieldType.DECIMAL\n",
"\n",
" min: -122.44429260492325\n",
" max: -121.8741186\n",
" count: 519700.0\n",
" missing_count: 0.0\n",
" error_count: 0.0\n",
"\n",
" lower_quartile: -122.41170653070694\n",
" median: -122.39875282257843\n",
" upper_quartile: -122.39103429266093\n",
" std: 0.10557344899193394\n",
" mean: -122.36392726512949\n",
"\n",
"ColumnProfile\n",
" name: end_station_id\n",
" type: FieldType.DECIMAL\n",
"\n",
" min: 3.0\n",
" max: 340.0\n",
" count: 519700.0\n",
" missing_count: 0.0\n",
" error_count: 0.0\n",
"\n",
" lower_quartile: 22.702361193995444\n",
" median: 65.22613324081779\n",
" upper_quartile: 134.21987129021295\n",
" std: 84.9694914863546\n",
" mean: 92.18404079276426\n",
"\n",
"ColumnProfile\n",
" name: end_station_name\n",
" type: FieldType.STRING\n",
"\n",
" min: 10th Ave at E 15th St\n",
" max: Yerba Buena Center for the Arts (Howard St at 3rd St)\n",
" count: 519700.0\n",
" missing_count: 0.0\n",
" error_count: 0.0\n",
"\n",
"ColumnProfile\n",
" name: end_station_latitude\n",
" type: FieldType.DECIMAL\n",
"\n",
" min: 37.3172979\n",
" max: 37.88022244590679\n",
" count: 519700.0\n",
" missing_count: 0.0\n",
" error_count: 0.0\n",
"\n",
" lower_quartile: 37.774232065528906\n",
" median: 37.78329810124021\n",
" upper_quartile: 37.79557128475191\n",
" std: 0.08622383487119635\n",
" mean: 37.771843749644646\n",
"\n",
"ColumnProfile\n",
" name: end_station_longitude\n",
" type: FieldType.DECIMAL\n",
"\n",
" min: -122.44429260492325\n",
" max: -121.8741186\n",
" count: 519700.0\n",
" missing_count: 0.0\n",
" error_count: 0.0\n",
"\n",
" lower_quartile: -122.41012752595213\n",
" median: -122.39855511689811\n",
" upper_quartile: -122.39096192032446\n",
" std: 0.10512220222934929\n",
" mean: -122.36323553679931\n",
"\n",
"ColumnProfile\n",
" name: bike_id\n",
" type: FieldType.DECIMAL\n",
"\n",
" min: 10.0\n",
" max: 3733.0\n",
" count: 519700.0\n",
" missing_count: 0.0\n",
" error_count: 0.0\n",
"\n",
" lower_quartile: 788.6785454424829\n",
" median: 1726.652793720984\n",
" upper_quartile: 2519.963581272433\n",
" std: 971.3569593530214\n",
" mean: 1672.533078699254\n",
"\n",
"ColumnProfile\n",
" name: user_type\n",
" type: FieldType.STRING\n",
"\n",
" min: Customer\n",
" max: Subscriber\n",
" count: 519700.0\n",
" missing_count: 0.0\n",
" error_count: 0.0\n",
"\n",
"ColumnProfile\n",
" name: member_birth_year\n",
" type: FieldType.DECIMAL\n",
"\n",
" min: 1886.0\n",
" max: 1999.0\n",
" count: 519700.0\n",
" missing_count: 66541.0\n",
" error_count: 0.0\n",
"\n",
" lower_quartile: 1974.3341624985283\n",
" median: 1982.8007516297655\n",
" upper_quartile: 1987.9916166785322\n",
" std: 10.51348753990893\n",
" mean: 1980.4047872821984\n",
"\n",
"ColumnProfile\n",
" name: member_gender\n",
" type: FieldType.STRING\n",
"\n",
" min: \n",
" max: Other\n",
" count: 519700.0\n",
" missing_count: 0.0\n",
" error_count: 0.0"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"gobike.get_profile()"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Type</th>\n",
" <th>Min</th>\n",
" <th>Max</th>\n",
" <th>Count</th>\n",
" <th>Missing Count</th>\n",
" <th>Error Count</th>\n",
" <th>Lower Quartile</th>\n",
" <th>Upper Quartile</th>\n",
" <th>Standard Deviation</th>\n",
" <th>Mean</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>duration_sec</th>\n",
" <td>FieldType.DECIMAL</td>\n",
" <td>61</td>\n",
" <td>85864</td>\n",
" <td>51853.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>381.017</td>\n",
" <td>936.399</td>\n",
" <td>3527.18</td>\n",
" <td>1102.23</td>\n",
" </tr>\n",
" <tr>\n",
" <th>start_time</th>\n",
" <td>FieldType.DATE</td>\n",
" <td>2017-06-28 10:51:23.182000+00:00</td>\n",
" <td>2017-12-31 23:55:09.686000+00:00</td>\n",
" <td>51853.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" </tr>\n",
" <tr>\n",
" <th>end_time</th>\n",
" <td>FieldType.DATE</td>\n",
" <td>2017-06-28 11:01:39.557000+00:00</td>\n",
" <td>2018-01-01 15:12:50.245000+00:00</td>\n",
" <td>51853.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" </tr>\n",
" <tr>\n",
" <th>start_station_id</th>\n",
" <td>FieldType.DECIMAL</td>\n",
" <td>3</td>\n",
" <td>340</td>\n",
" <td>51853.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>23.823</td>\n",
" <td>139.679</td>\n",
" <td>86.0923</td>\n",
" <td>94.8785</td>\n",
" </tr>\n",
" <tr>\n",
" <th>start_station_name</th>\n",
" <td>FieldType.STRING</td>\n",
" <td>10th Ave at E 15th St</td>\n",
" <td>Yerba Buena Center for the Arts (Howard St at ...</td>\n",
" <td>51853.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" </tr>\n",
" <tr>\n",
" <th>start_station_latitude</th>\n",
" <td>FieldType.DECIMAL</td>\n",
" <td>37.3173</td>\n",
" <td>37.8802</td>\n",
" <td>51853.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>37.7736</td>\n",
" <td>37.7954</td>\n",
" <td>0.0862637</td>\n",
" <td>37.7717</td>\n",
" </tr>\n",
" <tr>\n",
" <th>start_station_longitude</th>\n",
" <td>FieldType.DECIMAL</td>\n",
" <td>-122.444</td>\n",
" <td>-121.874</td>\n",
" <td>51853.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>-122.412</td>\n",
" <td>-122.391</td>\n",
" <td>0.105593</td>\n",
" <td>-122.364</td>\n",
" </tr>\n",
" <tr>\n",
" <th>end_station_id</th>\n",
" <td>FieldType.DECIMAL</td>\n",
" <td>3</td>\n",
" <td>338</td>\n",
" <td>51853.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>22.3474</td>\n",
" <td>135.081</td>\n",
" <td>85.0916</td>\n",
" <td>91.9201</td>\n",
" </tr>\n",
" <tr>\n",
" <th>end_station_name</th>\n",
" <td>FieldType.STRING</td>\n",
" <td>10th Ave at E 15th St</td>\n",
" <td>Yerba Buena Center for the Arts (Howard St at ...</td>\n",
" <td>51853.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" </tr>\n",
" <tr>\n",
" <th>end_station_latitude</th>\n",
" <td>FieldType.DECIMAL</td>\n",
" <td>37.3184</td>\n",
" <td>37.8802</td>\n",
" <td>51853.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>37.7745</td>\n",
" <td>37.7956</td>\n",
" <td>0.0861915</td>\n",
" <td>37.7719</td>\n",
" </tr>\n",
" <tr>\n",
" <th>end_station_longitude</th>\n",
" <td>FieldType.DECIMAL</td>\n",
" <td>-122.444</td>\n",
" <td>-121.874</td>\n",
" <td>51853.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>-122.41</td>\n",
" <td>-122.391</td>\n",
" <td>0.105075</td>\n",
" <td>-122.363</td>\n",
" </tr>\n",
" <tr>\n",
" <th>bike_id</th>\n",
" <td>FieldType.DECIMAL</td>\n",
" <td>10</td>\n",
" <td>3733</td>\n",
" <td>51853.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>795.89</td>\n",
" <td>2524.9</td>\n",
" <td>970.506</td>\n",
" <td>1674.51</td>\n",
" </tr>\n",
" <tr>\n",
" <th>user_type</th>\n",
" <td>FieldType.STRING</td>\n",
" <td>Customer</td>\n",
" <td>Subscriber</td>\n",
" <td>51853.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" </tr>\n",
" <tr>\n",
" <th>member_birth_year</th>\n",
" <td>FieldType.DECIMAL</td>\n",
" <td>1900</td>\n",
" <td>1999</td>\n",
" <td>51853.0</td>\n",
" <td>6577.0</td>\n",
" <td>0.0</td>\n",
" <td>1974.29</td>\n",
" <td>1988.01</td>\n",
" <td>10.4148</td>\n",
" <td>1980.4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>member_gender</th>\n",
" <td>FieldType.STRING</td>\n",
" <td></td>\n",
" <td>Other</td>\n",
" <td>51853.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" </tr>\n",
" </tbody>\n",
"</table>"
],
"text/plain": [
"ColumnProfile\n",
" name: duration_sec\n",
" type: FieldType.DECIMAL\n",
"\n",
" min: 61.0\n",
" max: 85864.0\n",
" count: 51853.0\n",
" missing_count: 0.0\n",
" error_count: 0.0\n",
"\n",
" lower_quartile: 381.0173265588649\n",
" median: 596.0824682091602\n",
" upper_quartile: 936.3990413401431\n",
" std: 3527.1849383367376\n",
" mean: 1102.2291284978571\n",
"\n",
"ColumnProfile\n",
" name: start_time\n",
" type: FieldType.DATE\n",
"\n",
" min: 2017-06-28 10:51:23.182000+00:00\n",
" max: 2017-12-31 23:55:09.686000+00:00\n",
" count: 51853.0\n",
" missing_count: 0.0\n",
" error_count: 0.0\n",
"\n",
"ColumnProfile\n",
" name: end_time\n",
" type: FieldType.DATE\n",
"\n",
" min: 2017-06-28 11:01:39.557000+00:00\n",
" max: 2018-01-01 15:12:50.245000+00:00\n",
" count: 51853.0\n",
" missing_count: 0.0\n",
" error_count: 0.0\n",
"\n",
"ColumnProfile\n",
" name: start_station_id\n",
" type: FieldType.DECIMAL\n",
"\n",
" min: 3.0\n",
" max: 340.0\n",
" count: 51853.0\n",
" missing_count: 0.0\n",
" error_count: 0.0\n",
"\n",
" lower_quartile: 23.82299260050619\n",
" median: 66.81449005522046\n",
" upper_quartile: 139.6790865298709\n",
" std: 86.09232732608726\n",
" mean: 94.87848340501073\n",
"\n",
"ColumnProfile\n",
" name: start_station_name\n",
" type: FieldType.STRING\n",
"\n",
" min: 10th Ave at E 15th St\n",
" max: Yerba Buena Center for the Arts (Howard St at 3rd St)\n",
" count: 51853.0\n",
" missing_count: 0.0\n",
" error_count: 0.0\n",
"\n",
"ColumnProfile\n",
" name: start_station_latitude\n",
" type: FieldType.DECIMAL\n",
"\n",
" min: 37.3172979\n",
" max: 37.88022244590679\n",
" count: 51853.0\n",
" missing_count: 0.0\n",
" error_count: 0.0\n",
"\n",
" lower_quartile: 37.773594346717786\n",
" median: 37.78325255020885\n",
" upper_quartile: 37.795362857566715\n",
" std: 0.08626372544371842\n",
" mean: 37.771708918993944\n",
"\n",
"ColumnProfile\n",
" name: start_station_longitude\n",
" type: FieldType.DECIMAL\n",
"\n",
" min: -122.44429260492325\n",
" max: -121.8741186\n",
" count: 51853.0\n",
" missing_count: 0.0\n",
" error_count: 0.0\n",
"\n",
" lower_quartile: -122.41157512442906\n",
" median: -122.39882719487981\n",
" upper_quartile: -122.39096385593315\n",
" std: 0.10559301820942323\n",
" mean: -122.36375576045955\n",
"\n",
"ColumnProfile\n",
" name: end_station_id\n",
" type: FieldType.DECIMAL\n",
"\n",
" min: 3.0\n",
" max: 338.0\n",
" count: 51853.0\n",
" missing_count: 0.0\n",
" error_count: 0.0\n",
"\n",
" lower_quartile: 22.34742112221029\n",
" median: 65.60893574407544\n",
" upper_quartile: 135.08124174966116\n",
" std: 85.09162990442911\n",
" mean: 91.9201396254798\n",
"\n",
"ColumnProfile\n",
" name: end_station_name\n",
" type: FieldType.STRING\n",
"\n",
" min: 10th Ave at E 15th St\n",
" max: Yerba Buena Center for the Arts (Howard St at 3rd St)\n",
" count: 51853.0\n",
" missing_count: 0.0\n",
" error_count: 0.0\n",
"\n",
"ColumnProfile\n",
" name: end_station_latitude\n",
" type: FieldType.DECIMAL\n",
"\n",
" min: 37.3184498\n",
" max: 37.88022244590679\n",
" count: 51853.0\n",
" missing_count: 0.0\n",
" error_count: 0.0\n",
"\n",
" lower_quartile: 37.77450364194883\n",
" median: 37.78358862499172\n",
" upper_quartile: 37.79555394254664\n",
" std: 0.08619152451969307\n",
" mean: 37.77190111029278\n",
"\n",
"ColumnProfile\n",
" name: end_station_longitude\n",
" type: FieldType.DECIMAL\n",
"\n",
" min: -122.44429260492325\n",
" max: -121.8741186\n",
" count: 51853.0\n",
" missing_count: 0.0\n",
" error_count: 0.0\n",
"\n",
" lower_quartile: -122.40967464398858\n",
" median: -122.39857551157675\n",
" upper_quartile: -122.39085540596203\n",
" std: 0.10507512085392584\n",
" mean: -122.3629776153239\n",
"\n",
"ColumnProfile\n",
" name: bike_id\n",
" type: FieldType.DECIMAL\n",
"\n",
" min: 10.0\n",
" max: 3733.0\n",
" count: 51853.0\n",
" missing_count: 0.0\n",
" error_count: 0.0\n",
"\n",
" lower_quartile: 795.8904240211187\n",
" median: 1723.039443196501\n",
" upper_quartile: 2524.901114053501\n",
" std: 970.5058870359009\n",
" mean: 1674.5133936319962\n",
"\n",
"ColumnProfile\n",
" name: user_type\n",
" type: FieldType.STRING\n",
"\n",
" min: Customer\n",
" max: Subscriber\n",
" count: 51853.0\n",
" missing_count: 0.0\n",
" error_count: 0.0\n",
"\n",
"ColumnProfile\n",
" name: member_birth_year\n",
" type: FieldType.DECIMAL\n",
"\n",
" min: 1900.0\n",
" max: 1999.0\n",
" count: 51853.0\n",
" missing_count: 6577.0\n",
" error_count: 0.0\n",
"\n",
" lower_quartile: 1974.2949238618335\n",
" median: 1982.7223690704195\n",
" upper_quartile: 1988.012942765942\n",
" std: 10.414847623452637\n",
" mean: 1980.4024648820382\n",
"\n",
"ColumnProfile\n",
" name: member_gender\n",
" type: FieldType.STRING\n",
"\n",
" min: \n",
" max: Other\n",
" count: 51853.0\n",
" missing_count: 0.0\n",
" error_count: 0.0"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sampled_gobike.get_profile()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It appears that we have quite a few missing values in `member_birth_year`. We also immediately see that we have some empty strings in our `member_gender` column. With the data profiler, we can quickly do a sanity check on our dataset and see where we might need to start data cleaning."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Derive by example\n",
"\n",
"Azure ML Data Prep comes with additional \"smart\" transforms created by Microsoft Research. Here, we'll look at how you can derive a new column by providing examples of input-output pairs. Rather than explicitly using regular expressions to extract dates or hours from datetimes, we can provide examples for Azure ML Data Prep to learn what the pattern is. In fact, these smart transformations can also handle more complex derivations like inferring the day of the week from datetimes."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"sgb_derived = sampled_gobike\\\n",
" .to_string(\n",
" columns=['start_time', 'end_time']\n",
" )\\\n",
" .derive_column_by_example(\n",
" source_columns='start_time',\n",
" new_column_name='date',\n",
" example_data=[('2017-12-31 16:57:39.6540', '2017-12-31'), ('2017-12-31 16:57:39', '2017-12-31')]\n",
" )\\\n",
" .derive_column_by_example(\n",
" source_columns='start_time',\n",
" new_column_name='hour',\n",
" example_data=[('2017-12-31 16:57:39.6540', '16')]\n",
" )\\\n",
" .derive_column_by_example(\n",
" source_columns='start_time',\n",
" new_column_name='wday',\n",
" example_data=[('2017-12-31 16:57:39.6540', 'Sunday')]\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Filter our data\n",
"\n",
"Let's verify that our derivations are correct by doing a bit of spot-checking."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>duration_sec</th>\n",
" <th>start_time</th>\n",
" <th>wday</th>\n",
" <th>hour</th>\n",
" <th>date</th>\n",
" <th>end_time</th>\n",
" <th>start_station_id</th>\n",
" <th>start_station_name</th>\n",
" <th>start_station_latitude</th>\n",
" <th>start_station_longitude</th>\n",
" <th>end_station_id</th>\n",
" <th>end_station_name</th>\n",
" <th>end_station_latitude</th>\n",
" <th>end_station_longitude</th>\n",
" <th>bike_id</th>\n",
" <th>user_type</th>\n",
" <th>member_birth_year</th>\n",
" <th>member_gender</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>3456.0</td>\n",
" <td>2017-12-30 23:46:13.358000</td>\n",
" <td>Saturday</td>\n",
" <td>23</td>\n",
" <td>2017-12-30</td>\n",
" <td>2017-12-31 00:43:49.469000</td>\n",
" <td>75.0</td>\n",
" <td>Market St at Franklin St</td>\n",
" <td>37.773793</td>\n",
" <td>-122.421239</td>\n",
" <td>75.0</td>\n",
" <td>Market St at Franklin St</td>\n",
" <td>37.773793</td>\n",
" <td>-122.421239</td>\n",
" <td>1642.0</td>\n",
" <td>Subscriber</td>\n",
" <td>1972.0</td>\n",
" <td>Male</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>204.0</td>\n",
" <td>2017-12-30 23:31:38.904000</td>\n",
" <td>Saturday</td>\n",
" <td>23</td>\n",
" <td>2017-12-30</td>\n",
" <td>2017-12-30 23:35:03.121000</td>\n",
" <td>84.0</td>\n",
" <td>Duboce Park</td>\n",
" <td>37.769200</td>\n",
" <td>-122.433812</td>\n",
" <td>107.0</td>\n",
" <td>17th St at Dolores St</td>\n",
" <td>37.763015</td>\n",
" <td>-122.426497</td>\n",
" <td>2201.0</td>\n",
" <td>Subscriber</td>\n",
" <td>1965.0</td>\n",
" <td>Male</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>743.0</td>\n",
" <td>2017-12-30 22:35:13.114000</td>\n",
" <td>Saturday</td>\n",
" <td>22</td>\n",
" <td>2017-12-30</td>\n",
" <td>2017-12-30 22:47:36.356000</td>\n",
" <td>285.0</td>\n",
" <td>Webster St at O'Farrell St</td>\n",
" <td>37.783521</td>\n",
" <td>-122.431158</td>\n",
" <td>97.0</td>\n",
" <td>14th St at Mission St</td>\n",
" <td>37.768265</td>\n",
" <td>-122.420110</td>\n",
" <td>1628.0</td>\n",
" <td>Subscriber</td>\n",
" <td>1993.0</td>\n",
" <td>Male</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>328.0</td>\n",
" <td>2017-12-30 22:19:28.760000</td>\n",
" <td>Saturday</td>\n",
" <td>22</td>\n",
" <td>2017-12-30</td>\n",
" <td>2017-12-30 22:24:57.489000</td>\n",
" <td>5.0</td>\n",
" <td>Powell St BART Station (Market St at 5th St)</td>\n",
" <td>37.783899</td>\n",
" <td>-122.408445</td>\n",
" <td>64.0</td>\n",
" <td>5th St at Brannan St</td>\n",
" <td>37.776754</td>\n",
" <td>-122.399018</td>\n",
" <td>2806.0</td>\n",
" <td>Subscriber</td>\n",
" <td>1986.0</td>\n",
" <td>Male</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>260.0</td>\n",
" <td>2017-12-30 21:22:40.116000</td>\n",
" <td>Saturday</td>\n",
" <td>21</td>\n",
" <td>2017-12-30</td>\n",
" <td>2017-12-30 21:27:00.885000</td>\n",
" <td>277.0</td>\n",
" <td>Morrison Ave at Julian St</td>\n",
" <td>37.333658</td>\n",
" <td>-121.908586</td>\n",
" <td>278.0</td>\n",
" <td>The Alameda at Bush St</td>\n",
" <td>37.331932</td>\n",
" <td>-121.904888</td>\n",
" <td>465.0</td>\n",
" <td>Subscriber</td>\n",
" <td>1991.0</td>\n",
" <td>Male</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" duration_sec start_time wday hour date \\\n",
"0 3456.0 2017-12-30 23:46:13.358000 Saturday 23 2017-12-30 \n",
"1 204.0 2017-12-30 23:31:38.904000 Saturday 23 2017-12-30 \n",
"2 743.0 2017-12-30 22:35:13.114000 Saturday 22 2017-12-30 \n",
"3 328.0 2017-12-30 22:19:28.760000 Saturday 22 2017-12-30 \n",
"4 260.0 2017-12-30 21:22:40.116000 Saturday 21 2017-12-30 \n",
"\n",
" end_time start_station_id \\\n",
"0 2017-12-31 00:43:49.469000 75.0 \n",
"1 2017-12-30 23:35:03.121000 84.0 \n",
"2 2017-12-30 22:47:36.356000 285.0 \n",
"3 2017-12-30 22:24:57.489000 5.0 \n",
"4 2017-12-30 21:27:00.885000 277.0 \n",
"\n",
" start_station_name start_station_latitude \\\n",
"0 Market St at Franklin St 37.773793 \n",
"1 Duboce Park 37.769200 \n",
"2 Webster St at O'Farrell St 37.783521 \n",
"3 Powell St BART Station (Market St at 5th St) 37.783899 \n",
"4 Morrison Ave at Julian St 37.333658 \n",
"\n",
" start_station_longitude end_station_id end_station_name \\\n",
"0 -122.421239 75.0 Market St at Franklin St \n",
"1 -122.433812 107.0 17th St at Dolores St \n",
"2 -122.431158 97.0 14th St at Mission St \n",
"3 -122.408445 64.0 5th St at Brannan St \n",
"4 -121.908586 278.0 The Alameda at Bush St \n",
"\n",
" end_station_latitude end_station_longitude bike_id user_type \\\n",
"0 37.773793 -122.421239 1642.0 Subscriber \n",
"1 37.763015 -122.426497 2201.0 Subscriber \n",
"2 37.768265 -122.420110 1628.0 Subscriber \n",
"3 37.776754 -122.399018 2806.0 Subscriber \n",
"4 37.331932 -121.904888 465.0 Subscriber \n",
"\n",
" member_birth_year member_gender \n",
"0 1972.0 Male \n",
"1 1965.0 Male \n",
"2 1993.0 Male \n",
"3 1986.0 Male \n",
"4 1991.0 Male "
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sgb_derived.filter(dprep.col('wday') != 'Sunday').head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also filter on other column types; let's take a peek at rides that lasted over 5 hours."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>duration_sec</th>\n",
" <th>start_time</th>\n",
" <th>wday</th>\n",
" <th>hour</th>\n",
" <th>date</th>\n",
" <th>end_time</th>\n",
" <th>start_station_id</th>\n",
" <th>start_station_name</th>\n",
" <th>start_station_latitude</th>\n",
" <th>start_station_longitude</th>\n",
" <th>end_station_id</th>\n",
" <th>end_station_name</th>\n",
" <th>end_station_latitude</th>\n",
" <th>end_station_longitude</th>\n",
" <th>bike_id</th>\n",
" <th>user_type</th>\n",
" <th>member_birth_year</th>\n",
" <th>member_gender</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>80110.0</td>\n",
" <td>2017-12-31 16:57:39.654000</td>\n",
" <td>Sunday</td>\n",
" <td>16</td>\n",
" <td>2017-12-31</td>\n",
" <td>2018-01-01 15:12:50.245000</td>\n",
" <td>74.0</td>\n",
" <td>Laguna St at Hayes St</td>\n",
" <td>37.776435</td>\n",
" <td>-122.426244</td>\n",
" <td>43.0</td>\n",
" <td>San Francisco Public Library (Grove St at Hyde...</td>\n",
" <td>37.778768</td>\n",
" <td>-122.415929</td>\n",
" <td>96.0</td>\n",
" <td>Customer</td>\n",
" <td>1987</td>\n",
" <td>Male</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>22587.0</td>\n",
" <td>2017-12-31 13:51:04.538000</td>\n",
" <td>Sunday</td>\n",
" <td>13</td>\n",
" <td>2017-12-31</td>\n",
" <td>2017-12-31 20:07:32.139000</td>\n",
" <td>307.0</td>\n",
" <td>SAP Center</td>\n",
" <td>37.332692</td>\n",
" <td>-121.900084</td>\n",
" <td>307.0</td>\n",
" <td>SAP Center</td>\n",
" <td>37.332692</td>\n",
" <td>-121.900084</td>\n",
" <td>1443.0</td>\n",
" <td>Customer</td>\n",
" <td>None</td>\n",
" <td></td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>18067.0</td>\n",
" <td>2017-12-30 04:20:13.938000</td>\n",
" <td>Saturday</td>\n",
" <td>04</td>\n",
" <td>2017-12-30</td>\n",
" <td>2017-12-30 09:21:21.628000</td>\n",
" <td>70.0</td>\n",
" <td>Central Ave at Fell St</td>\n",
" <td>37.773311</td>\n",
" <td>-122.444293</td>\n",
" <td>43.0</td>\n",
" <td>San Francisco Public Library (Grove St at Hyde...</td>\n",
" <td>37.778768</td>\n",
" <td>-122.415929</td>\n",
" <td>1928.0</td>\n",
" <td>Customer</td>\n",
" <td>None</td>\n",
" <td></td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>54550.0</td>\n",
" <td>2017-12-29 10:02:38.086000</td>\n",
" <td>Friday</td>\n",
" <td>10</td>\n",
" <td>2017-12-29</td>\n",
" <td>2017-12-30 01:11:48.539000</td>\n",
" <td>21.0</td>\n",
" <td>Montgomery St BART Station (Market St at 2nd St)</td>\n",
" <td>37.789625</td>\n",
" <td>-122.400811</td>\n",
" <td>84.0</td>\n",
" <td>Duboce Park</td>\n",
" <td>37.769200</td>\n",
" <td>-122.433812</td>\n",
" <td>209.0</td>\n",
" <td>Customer</td>\n",
" <td>None</td>\n",
" <td></td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>63627.0</td>\n",
" <td>2017-12-27 19:12:42.794000</td>\n",
" <td>Wednesday</td>\n",
" <td>19</td>\n",
" <td>2017-12-27</td>\n",
" <td>2017-12-28 12:53:10.649000</td>\n",
" <td>249.0</td>\n",
" <td>Russell St at College Ave</td>\n",
" <td>37.858473</td>\n",
" <td>-122.253253</td>\n",
" <td>244.0</td>\n",
" <td>Shattuck Ave at Hearst Ave</td>\n",
" <td>37.873792</td>\n",
" <td>-122.268618</td>\n",
" <td>1804.0</td>\n",
" <td>Customer</td>\n",
" <td>1988</td>\n",
" <td>Male</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" duration_sec start_time wday hour date \\\n",
"0 80110.0 2017-12-31 16:57:39.654000 Sunday 16 2017-12-31 \n",
"1 22587.0 2017-12-31 13:51:04.538000 Sunday 13 2017-12-31 \n",
"2 18067.0 2017-12-30 04:20:13.938000 Saturday 04 2017-12-30 \n",
"3 54550.0 2017-12-29 10:02:38.086000 Friday 10 2017-12-29 \n",
"4 63627.0 2017-12-27 19:12:42.794000 Wednesday 19 2017-12-27 \n",
"\n",
" end_time start_station_id \\\n",
"0 2018-01-01 15:12:50.245000 74.0 \n",
"1 2017-12-31 20:07:32.139000 307.0 \n",
"2 2017-12-30 09:21:21.628000 70.0 \n",
"3 2017-12-30 01:11:48.539000 21.0 \n",
"4 2017-12-28 12:53:10.649000 249.0 \n",
"\n",
" start_station_name start_station_latitude \\\n",
"0 Laguna St at Hayes St 37.776435 \n",
"1 SAP Center 37.332692 \n",
"2 Central Ave at Fell St 37.773311 \n",
"3 Montgomery St BART Station (Market St at 2nd St) 37.789625 \n",
"4 Russell St at College Ave 37.858473 \n",
"\n",
" start_station_longitude end_station_id \\\n",
"0 -122.426244 43.0 \n",
"1 -121.900084 307.0 \n",
"2 -122.444293 43.0 \n",
"3 -122.400811 84.0 \n",
"4 -122.253253 244.0 \n",
"\n",
" end_station_name end_station_latitude \\\n",
"0 San Francisco Public Library (Grove St at Hyde... 37.778768 \n",
"1 SAP Center 37.332692 \n",
"2 San Francisco Public Library (Grove St at Hyde... 37.778768 \n",
"3 Duboce Park 37.769200 \n",
"4 Shattuck Ave at Hearst Ave 37.873792 \n",
"\n",
" end_station_longitude bike_id user_type member_birth_year member_gender \n",
"0 -122.415929 96.0 Customer 1987 Male \n",
"1 -121.900084 1443.0 Customer None \n",
"2 -122.415929 1928.0 Customer None \n",
"3 -122.433812 209.0 Customer None \n",
"4 -122.268618 1804.0 Customer 1988 Male "
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sgb_derived.filter(dprep.col('duration_sec') > (60 * 60 * 5)).head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Transform our data\n",
"\n",
"In addition to \"smart\" transformations, Azure ML Data Prep also supports many common data science transforms familiar to other industry-standard data science libraries. Here, we'll explore the ability to `summarize` and `replace`. We'll also get to use `join` when we handle assertions.\n",
"\n",
"#### Summarize\n"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>date</th>\n",
" <th>duration_sec_mean</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2017-12-31</td>\n",
" <td>1982.801418</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2017-12-30</td>\n",
" <td>1203.766423</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2017-12-29</td>\n",
" <td>1287.324841</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2017-12-28</td>\n",
" <td>835.146465</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2017-12-27</td>\n",
" <td>1658.735955</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" date duration_sec_mean\n",
"0 2017-12-31 1982.801418\n",
"1 2017-12-30 1203.766423\n",
"2 2017-12-29 1287.324841\n",
"3 2017-12-28 835.146465\n",
"4 2017-12-27 1658.735955"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sgb_summary = sgb_derived\\\n",
" .summarize(\n",
" summary_columns=[\n",
" dprep\\\n",
" .SummaryColumnsValue(\n",
" column_id='duration_sec', \n",
" summary_column_name='duration_sec_mean', \n",
" summary_function=dprep.SummaryFunction.MEAN\n",
" )\n",
" ],\n",
" group_by_columns=['date']\n",
" )\n",
"sgb_summary.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Azure Data Prep also makes it easy to append this output of `summarize` to the original table based on the grouping variable. "
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>duration_sec</th>\n",
" <th>start_time</th>\n",
" <th>wday</th>\n",
" <th>hour</th>\n",
" <th>date</th>\n",
" <th>end_time</th>\n",
" <th>start_station_id</th>\n",
" <th>start_station_name</th>\n",
" <th>start_station_latitude</th>\n",
" <th>start_station_longitude</th>\n",
" <th>end_station_id</th>\n",
" <th>end_station_name</th>\n",
" <th>end_station_latitude</th>\n",
" <th>end_station_longitude</th>\n",
" <th>bike_id</th>\n",
" <th>user_type</th>\n",
" <th>member_birth_year</th>\n",
" <th>member_gender</th>\n",
" <th>duration_sec_mean</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>80110.0</td>\n",
" <td>2017-12-31 16:57:39.654000</td>\n",
" <td>Sunday</td>\n",
" <td>16</td>\n",
" <td>2017-12-31</td>\n",
" <td>2018-01-01 15:12:50.245000</td>\n",
" <td>74.0</td>\n",
" <td>Laguna St at Hayes St</td>\n",
" <td>37.776435</td>\n",
" <td>-122.426244</td>\n",
" <td>43.0</td>\n",
" <td>San Francisco Public Library (Grove St at Hyde...</td>\n",
" <td>37.778768</td>\n",
" <td>-122.415929</td>\n",
" <td>96.0</td>\n",
" <td>Customer</td>\n",
" <td>1987</td>\n",
" <td>Male</td>\n",
" <td>1982.801418</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>3292.0</td>\n",
" <td>2017-12-31 23:46:32.403000</td>\n",
" <td>Sunday</td>\n",
" <td>23</td>\n",
" <td>2017-12-31</td>\n",
" <td>2018-01-01 00:41:24.605000</td>\n",
" <td>284.0</td>\n",
" <td>Yerba Buena Center for the Arts (Howard St at ...</td>\n",
" <td>37.784872</td>\n",
" <td>-122.400876</td>\n",
" <td>22.0</td>\n",
" <td>Howard St at Beale St</td>\n",
" <td>37.789756</td>\n",
" <td>-122.394643</td>\n",
" <td>3058.0</td>\n",
" <td>Customer</td>\n",
" <td>None</td>\n",
" <td></td>\n",
" <td>1982.801418</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1397.0</td>\n",
" <td>2017-12-31 23:55:09.686000</td>\n",
" <td>Sunday</td>\n",
" <td>23</td>\n",
" <td>2017-12-31</td>\n",
" <td>2018-01-01 00:18:26.721000</td>\n",
" <td>78.0</td>\n",
" <td>Folsom St at 9th St</td>\n",
" <td>37.773717</td>\n",
" <td>-122.411647</td>\n",
" <td>15.0</td>\n",
" <td>San Francisco Ferry Building (Harry Bridges Pl...</td>\n",
" <td>37.795392</td>\n",
" <td>-122.394203</td>\n",
" <td>1667.0</td>\n",
" <td>Customer</td>\n",
" <td>None</td>\n",
" <td></td>\n",
" <td>1982.801418</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>422.0</td>\n",
" <td>2017-12-31 23:54:25.337000</td>\n",
" <td>Sunday</td>\n",
" <td>23</td>\n",
" <td>2017-12-31</td>\n",
" <td>2018-01-01 00:01:27.354000</td>\n",
" <td>139.0</td>\n",
" <td>Garfield Square (25th St at Harrison St)</td>\n",
" <td>37.751017</td>\n",
" <td>-122.411901</td>\n",
" <td>99.0</td>\n",
" <td>Folsom St at 15th St</td>\n",
" <td>37.767037</td>\n",
" <td>-122.415442</td>\n",
" <td>2415.0</td>\n",
" <td>Subscriber</td>\n",
" <td>1985</td>\n",
" <td>Male</td>\n",
" <td>1982.801418</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1130.0</td>\n",
" <td>2017-12-31 23:36:16.069000</td>\n",
" <td>Sunday</td>\n",
" <td>23</td>\n",
" <td>2017-12-31</td>\n",
" <td>2017-12-31 23:55:06.096000</td>\n",
" <td>66.0</td>\n",
" <td>3rd St at Townsend St</td>\n",
" <td>37.778742</td>\n",
" <td>-122.392741</td>\n",
" <td>23.0</td>\n",
" <td>The Embarcadero at Steuart St</td>\n",
" <td>37.791464</td>\n",
" <td>-122.391034</td>\n",
" <td>2721.0</td>\n",
" <td>Customer</td>\n",
" <td>None</td>\n",
" <td></td>\n",
" <td>1982.801418</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" duration_sec start_time wday hour date \\\n",
"0 80110.0 2017-12-31 16:57:39.654000 Sunday 16 2017-12-31 \n",
"1 3292.0 2017-12-31 23:46:32.403000 Sunday 23 2017-12-31 \n",
"2 1397.0 2017-12-31 23:55:09.686000 Sunday 23 2017-12-31 \n",
"3 422.0 2017-12-31 23:54:25.337000 Sunday 23 2017-12-31 \n",
"4 1130.0 2017-12-31 23:36:16.069000 Sunday 23 2017-12-31 \n",
"\n",
" end_time start_station_id \\\n",
"0 2018-01-01 15:12:50.245000 74.0 \n",
"1 2018-01-01 00:41:24.605000 284.0 \n",
"2 2018-01-01 00:18:26.721000 78.0 \n",
"3 2018-01-01 00:01:27.354000 139.0 \n",
"4 2017-12-31 23:55:06.096000 66.0 \n",
"\n",
" start_station_name start_station_latitude \\\n",
"0 Laguna St at Hayes St 37.776435 \n",
"1 Yerba Buena Center for the Arts (Howard St at ... 37.784872 \n",
"2 Folsom St at 9th St 37.773717 \n",
"3 Garfield Square (25th St at Harrison St) 37.751017 \n",
"4 3rd St at Townsend St 37.778742 \n",
"\n",
" start_station_longitude end_station_id \\\n",
"0 -122.426244 43.0 \n",
"1 -122.400876 22.0 \n",
"2 -122.411647 15.0 \n",
"3 -122.411901 99.0 \n",
"4 -122.392741 23.0 \n",
"\n",
" end_station_name end_station_latitude \\\n",
"0 San Francisco Public Library (Grove St at Hyde... 37.778768 \n",
"1 Howard St at Beale St 37.789756 \n",
"2 San Francisco Ferry Building (Harry Bridges Pl... 37.795392 \n",
"3 Folsom St at 15th St 37.767037 \n",
"4 The Embarcadero at Steuart St 37.791464 \n",
"\n",
" end_station_longitude bike_id user_type member_birth_year member_gender \\\n",
"0 -122.415929 96.0 Customer 1987 Male \n",
"1 -122.394643 3058.0 Customer None \n",
"2 -122.394203 1667.0 Customer None \n",
"3 -122.415442 2415.0 Subscriber 1985 Male \n",
"4 -122.391034 2721.0 Customer None \n",
"\n",
" duration_sec_mean \n",
"0 1982.801418 \n",
"1 1982.801418 \n",
"2 1982.801418 \n",
"3 1982.801418 \n",
"4 1982.801418 "
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sgb_appended = sgb_derived\\\n",
" .summarize(\n",
" summary_columns=[\n",
" dprep\\\n",
" .SummaryColumnsValue(\n",
" column_id='duration_sec', \n",
" summary_column_name='duration_sec_mean', \n",
" summary_function=dprep.SummaryFunction.MEAN\n",
" )\n",
" ],\n",
" group_by_columns=['date'],\n",
" join_back=True\n",
" )\n",
"sgb_appended.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Replace\n",
"\n",
"Recall that our `member_gender` column had empty strings that stood in place of `None`. Let's use our `replace` function to properly recode them as `None`s."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>duration_sec</th>\n",
" <th>start_time</th>\n",
" <th>end_time</th>\n",
" <th>start_station_id</th>\n",
" <th>start_station_name</th>\n",
" <th>start_station_latitude</th>\n",
" <th>start_station_longitude</th>\n",
" <th>end_station_id</th>\n",
" <th>end_station_name</th>\n",
" <th>end_station_latitude</th>\n",
" <th>end_station_longitude</th>\n",
" <th>bike_id</th>\n",
" <th>user_type</th>\n",
" <th>member_birth_year</th>\n",
" <th>member_gender</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>80110.0</td>\n",
" <td>2017-12-31 16:57:39.654000+00:00</td>\n",
" <td>2018-01-01 15:12:50.245000+00:00</td>\n",
" <td>74.0</td>\n",
" <td>Laguna St at Hayes St</td>\n",
" <td>37.776435</td>\n",
" <td>-122.426244</td>\n",
" <td>43.0</td>\n",
" <td>San Francisco Public Library (Grove St at Hyde...</td>\n",
" <td>37.778768</td>\n",
" <td>-122.415929</td>\n",
" <td>96.0</td>\n",
" <td>Customer</td>\n",
" <td>1987</td>\n",
" <td>Male</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>3292.0</td>\n",
" <td>2017-12-31 23:46:32.403000+00:00</td>\n",
" <td>2018-01-01 00:41:24.605000+00:00</td>\n",
" <td>284.0</td>\n",
" <td>Yerba Buena Center for the Arts (Howard St at ...</td>\n",
" <td>37.784872</td>\n",
" <td>-122.400876</td>\n",
" <td>22.0</td>\n",
" <td>Howard St at Beale St</td>\n",
" <td>37.789756</td>\n",
" <td>-122.394643</td>\n",
" <td>3058.0</td>\n",
" <td>Customer</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1397.0</td>\n",
" <td>2017-12-31 23:55:09.686000+00:00</td>\n",
" <td>2018-01-01 00:18:26.721000+00:00</td>\n",
" <td>78.0</td>\n",
" <td>Folsom St at 9th St</td>\n",
" <td>37.773717</td>\n",
" <td>-122.411647</td>\n",
" <td>15.0</td>\n",
" <td>San Francisco Ferry Building (Harry Bridges Pl...</td>\n",
" <td>37.795392</td>\n",
" <td>-122.394203</td>\n",
" <td>1667.0</td>\n",
" <td>Customer</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>422.0</td>\n",
" <td>2017-12-31 23:54:25.337000+00:00</td>\n",
" <td>2018-01-01 00:01:27.354000+00:00</td>\n",
" <td>139.0</td>\n",
" <td>Garfield Square (25th St at Harrison St)</td>\n",
" <td>37.751017</td>\n",
" <td>-122.411901</td>\n",
" <td>99.0</td>\n",
" <td>Folsom St at 15th St</td>\n",
" <td>37.767037</td>\n",
" <td>-122.415442</td>\n",
" <td>2415.0</td>\n",
" <td>Subscriber</td>\n",
" <td>1985</td>\n",
" <td>Male</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1130.0</td>\n",
" <td>2017-12-31 23:36:16.069000+00:00</td>\n",
" <td>2017-12-31 23:55:06.096000+00:00</td>\n",
" <td>66.0</td>\n",
" <td>3rd St at Townsend St</td>\n",
" <td>37.778742</td>\n",
" <td>-122.392741</td>\n",
" <td>23.0</td>\n",
" <td>The Embarcadero at Steuart St</td>\n",
" <td>37.791464</td>\n",
" <td>-122.391034</td>\n",
" <td>2721.0</td>\n",
" <td>Customer</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" duration_sec start_time \\\n",
"0 80110.0 2017-12-31 16:57:39.654000+00:00 \n",
"1 3292.0 2017-12-31 23:46:32.403000+00:00 \n",
"2 1397.0 2017-12-31 23:55:09.686000+00:00 \n",
"3 422.0 2017-12-31 23:54:25.337000+00:00 \n",
"4 1130.0 2017-12-31 23:36:16.069000+00:00 \n",
"\n",
" end_time start_station_id \\\n",
"0 2018-01-01 15:12:50.245000+00:00 74.0 \n",
"1 2018-01-01 00:41:24.605000+00:00 284.0 \n",
"2 2018-01-01 00:18:26.721000+00:00 78.0 \n",
"3 2018-01-01 00:01:27.354000+00:00 139.0 \n",
"4 2017-12-31 23:55:06.096000+00:00 66.0 \n",
"\n",
" start_station_name start_station_latitude \\\n",
"0 Laguna St at Hayes St 37.776435 \n",
"1 Yerba Buena Center for the Arts (Howard St at ... 37.784872 \n",
"2 Folsom St at 9th St 37.773717 \n",
"3 Garfield Square (25th St at Harrison St) 37.751017 \n",
"4 3rd St at Townsend St 37.778742 \n",
"\n",
" start_station_longitude end_station_id \\\n",
"0 -122.426244 43.0 \n",
"1 -122.400876 22.0 \n",
"2 -122.411647 15.0 \n",
"3 -122.411901 99.0 \n",
"4 -122.392741 23.0 \n",
"\n",
" end_station_name end_station_latitude \\\n",
"0 San Francisco Public Library (Grove St at Hyde... 37.778768 \n",
"1 Howard St at Beale St 37.789756 \n",
"2 San Francisco Ferry Building (Harry Bridges Pl... 37.795392 \n",
"3 Folsom St at 15th St 37.767037 \n",
"4 The Embarcadero at Steuart St 37.791464 \n",
"\n",
" end_station_longitude bike_id user_type member_birth_year member_gender \n",
"0 -122.415929 96.0 Customer 1987 Male \n",
"1 -122.394643 3058.0 Customer None None \n",
"2 -122.394203 1667.0 Customer None None \n",
"3 -122.415442 2415.0 Subscriber 1985 Male \n",
"4 -122.391034 2721.0 Customer None None "
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sgb_replaced = sampled_gobike.replace_na(columns=['member_gender'])\n",
"sgb_replaced.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Assert on invalid data \n",
"\n",
"Azure ML Data Prep helps prevent broken pipelines and safeguard against bad data by supporting assertions. In our case, we'll create assertions to handle potentially erroneous `member_birth_year` values. The oldest person on record is no more than 130 years old, so birth year listed as before 1900 is wrong. Though our `sampled_gobike` dataset doesn't have any issues, we would fail on the full `gobike` dataset if we made that assumption. However, Azure ML Data Prep allows us to handle these gracefully with assertions."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Type</th>\n",
" <th>Min</th>\n",
" <th>Max</th>\n",
" <th>Count</th>\n",
" <th>Missing Count</th>\n",
" <th>Error Count</th>\n",
" <th>Lower Quartile</th>\n",
" <th>Upper Quartile</th>\n",
" <th>Standard Deviation</th>\n",
" <th>Mean</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>duration_sec</th>\n",
" <td>FieldType.DECIMAL</td>\n",
" <td>61</td>\n",
" <td>86369</td>\n",
" <td>519700.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>381.842</td>\n",
" <td>938.574</td>\n",
" <td>3444.15</td>\n",
" <td>1099.01</td>\n",
" </tr>\n",
" <tr>\n",
" <th>start_time</th>\n",
" <td>FieldType.DATE</td>\n",
" <td>2017-06-28 09:47:36.347000+00:00</td>\n",
" <td>2017-12-31 23:59:01.261000+00:00</td>\n",
" <td>519700.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" </tr>\n",
" <tr>\n",
" <th>end_time</th>\n",
" <td>FieldType.DATE</td>\n",
" <td>2017-06-28 09:52:55.338000+00:00</td>\n",
" <td>2018-01-01 15:12:50.245000+00:00</td>\n",
" <td>519700.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" </tr>\n",
" <tr>\n",
" <th>start_station_id</th>\n",
" <td>FieldType.DECIMAL</td>\n",
" <td>3</td>\n",
" <td>340</td>\n",
" <td>519700.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>23.8481</td>\n",
" <td>139.424</td>\n",
" <td>86.0831</td>\n",
" <td>95.0342</td>\n",
" </tr>\n",
" <tr>\n",
" <th>start_station_name</th>\n",
" <td>FieldType.STRING</td>\n",
" <td>10th Ave at E 15th St</td>\n",
" <td>Yerba Buena Center for the Arts (Howard St at ...</td>\n",
" <td>519700.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" </tr>\n",
" <tr>\n",
" <th>start_station_latitude</th>\n",
" <td>FieldType.DECIMAL</td>\n",
" <td>37.3173</td>\n",
" <td>37.8802</td>\n",
" <td>519700.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>37.7736</td>\n",
" <td>37.7953</td>\n",
" <td>0.086305</td>\n",
" <td>37.7717</td>\n",
" </tr>\n",
" <tr>\n",
" <th>start_station_longitude</th>\n",
" <td>FieldType.DECIMAL</td>\n",
" <td>-122.444</td>\n",
" <td>-121.874</td>\n",
" <td>519700.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>-122.412</td>\n",
" <td>-122.391</td>\n",
" <td>0.105573</td>\n",
" <td>-122.364</td>\n",
" </tr>\n",
" <tr>\n",
" <th>end_station_id</th>\n",
" <td>FieldType.DECIMAL</td>\n",
" <td>3</td>\n",
" <td>340</td>\n",
" <td>519700.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>22.7024</td>\n",
" <td>134.22</td>\n",
" <td>84.9695</td>\n",
" <td>92.184</td>\n",
" </tr>\n",
" <tr>\n",
" <th>end_station_name</th>\n",
" <td>FieldType.STRING</td>\n",
" <td>10th Ave at E 15th St</td>\n",
" <td>Yerba Buena Center for the Arts (Howard St at ...</td>\n",
" <td>519700.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" </tr>\n",
" <tr>\n",
" <th>end_station_latitude</th>\n",
" <td>FieldType.DECIMAL</td>\n",
" <td>37.3173</td>\n",
" <td>37.8802</td>\n",
" <td>519700.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>37.7742</td>\n",
" <td>37.7956</td>\n",
" <td>0.0862238</td>\n",
" <td>37.7718</td>\n",
" </tr>\n",
" <tr>\n",
" <th>end_station_longitude</th>\n",
" <td>FieldType.DECIMAL</td>\n",
" <td>-122.444</td>\n",
" <td>-121.874</td>\n",
" <td>519700.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>-122.41</td>\n",
" <td>-122.391</td>\n",
" <td>0.105122</td>\n",
" <td>-122.363</td>\n",
" </tr>\n",
" <tr>\n",
" <th>bike_id</th>\n",
" <td>FieldType.DECIMAL</td>\n",
" <td>10</td>\n",
" <td>3733</td>\n",
" <td>519700.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>788.679</td>\n",
" <td>2519.96</td>\n",
" <td>971.357</td>\n",
" <td>1672.53</td>\n",
" </tr>\n",
" <tr>\n",
" <th>user_type</th>\n",
" <td>FieldType.STRING</td>\n",
" <td>Customer</td>\n",
" <td>Subscriber</td>\n",
" <td>519700.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" </tr>\n",
" <tr>\n",
" <th>member_birth_year</th>\n",
" <td>FieldType.DECIMAL</td>\n",
" <td>1900</td>\n",
" <td>1999</td>\n",
" <td>519700.0</td>\n",
" <td>66541.0</td>\n",
" <td>2.0</td>\n",
" <td>1974.33</td>\n",
" <td>1987.99</td>\n",
" <td>10.5116</td>\n",
" <td>1980.41</td>\n",
" </tr>\n",
" <tr>\n",
" <th>member_gender</th>\n",
" <td>FieldType.STRING</td>\n",
" <td></td>\n",
" <td>Other</td>\n",
" <td>519700.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" </tr>\n",
" </tbody>\n",
"</table>"
],
"text/plain": [
"ColumnProfile\n",
" name: duration_sec\n",
" type: FieldType.DECIMAL\n",
"\n",
" min: 61.0\n",
" max: 86369.0\n",
" count: 519700.0\n",
" missing_count: 0.0\n",
" error_count: 0.0\n",
"\n",
" lower_quartile: 381.8421435321134\n",
" median: 595.9837506906349\n",
" upper_quartile: 938.5741138032683\n",
" std: 3444.146451247386\n",
" mean: 1099.009520877422\n",
"\n",
"ColumnProfile\n",
" name: start_time\n",
" type: FieldType.DATE\n",
"\n",
" min: 2017-06-28 09:47:36.347000+00:00\n",
" max: 2017-12-31 23:59:01.261000+00:00\n",
" count: 519700.0\n",
" missing_count: 0.0\n",
" error_count: 0.0\n",
"\n",
"ColumnProfile\n",
" name: end_time\n",
" type: FieldType.DATE\n",
"\n",
" min: 2017-06-28 09:52:55.338000+00:00\n",
" max: 2018-01-01 15:12:50.245000+00:00\n",
" count: 519700.0\n",
" missing_count: 0.0\n",
" error_count: 0.0\n",
"\n",
"ColumnProfile\n",
" name: start_station_id\n",
" type: FieldType.DECIMAL\n",
"\n",
" min: 3.0\n",
" max: 340.0\n",
" count: 519700.0\n",
" missing_count: 0.0\n",
" error_count: 0.0\n",
"\n",
" lower_quartile: 23.848148131600635\n",
" median: 67.18427817406452\n",
" upper_quartile: 139.42430180307275\n",
" std: 86.08307797095921\n",
" mean: 95.03424475658852\n",
"\n",
"ColumnProfile\n",
" name: start_station_name\n",
" type: FieldType.STRING\n",
"\n",
" min: 10th Ave at E 15th St\n",
" max: Yerba Buena Center for the Arts (Howard St at 3rd St)\n",
" count: 519700.0\n",
" missing_count: 0.0\n",
" error_count: 0.0\n",
"\n",
"ColumnProfile\n",
" name: start_station_latitude\n",
" type: FieldType.DECIMAL\n",
"\n",
" min: 37.3172979\n",
" max: 37.88022244590679\n",
" count: 519700.0\n",
" missing_count: 0.0\n",
" error_count: 0.0\n",
"\n",
" lower_quartile: 37.7735913559721\n",
" median: 37.783211475877295\n",
" upper_quartile: 37.79531236950411\n",
" std: 0.08630496061661774\n",
" mean: 37.771652603110894\n",
"\n",
"ColumnProfile\n",
" name: start_station_longitude\n",
" type: FieldType.DECIMAL\n",
"\n",
" min: -122.44429260492325\n",
" max: -121.8741186\n",
" count: 519700.0\n",
" missing_count: 0.0\n",
" error_count: 0.0\n",
"\n",
" lower_quartile: -122.41170653070694\n",
" median: -122.39875282257843\n",
" upper_quartile: -122.39103429266093\n",
" std: 0.10557344899193394\n",
" mean: -122.36392726512949\n",
"\n",
"ColumnProfile\n",
" name: end_station_id\n",
" type: FieldType.DECIMAL\n",
"\n",
" min: 3.0\n",
" max: 340.0\n",
" count: 519700.0\n",
" missing_count: 0.0\n",
" error_count: 0.0\n",
"\n",
" lower_quartile: 22.702361193995444\n",
" median: 65.22613324081779\n",
" upper_quartile: 134.21987129021295\n",
" std: 84.9694914863546\n",
" mean: 92.18404079276426\n",
"\n",
"ColumnProfile\n",
" name: end_station_name\n",
" type: FieldType.STRING\n",
"\n",
" min: 10th Ave at E 15th St\n",
" max: Yerba Buena Center for the Arts (Howard St at 3rd St)\n",
" count: 519700.0\n",
" missing_count: 0.0\n",
" error_count: 0.0\n",
"\n",
"ColumnProfile\n",
" name: end_station_latitude\n",
" type: FieldType.DECIMAL\n",
"\n",
" min: 37.3172979\n",
" max: 37.88022244590679\n",
" count: 519700.0\n",
" missing_count: 0.0\n",
" error_count: 0.0\n",
"\n",
" lower_quartile: 37.774232065528906\n",
" median: 37.78329810124021\n",
" upper_quartile: 37.79557128475191\n",
" std: 0.08622383487119635\n",
" mean: 37.771843749644646\n",
"\n",
"ColumnProfile\n",
" name: end_station_longitude\n",
" type: FieldType.DECIMAL\n",
"\n",
" min: -122.44429260492325\n",
" max: -121.8741186\n",
" count: 519700.0\n",
" missing_count: 0.0\n",
" error_count: 0.0\n",
"\n",
" lower_quartile: -122.41012752595213\n",
" median: -122.39855511689811\n",
" upper_quartile: -122.39096192032446\n",
" std: 0.10512220222934929\n",
" mean: -122.36323553679931\n",
"\n",
"ColumnProfile\n",
" name: bike_id\n",
" type: FieldType.DECIMAL\n",
"\n",
" min: 10.0\n",
" max: 3733.0\n",
" count: 519700.0\n",
" missing_count: 0.0\n",
" error_count: 0.0\n",
"\n",
" lower_quartile: 788.6785454424829\n",
" median: 1726.652793720984\n",
" upper_quartile: 2519.963581272433\n",
" std: 971.3569593530214\n",
" mean: 1672.533078699254\n",
"\n",
"ColumnProfile\n",
" name: user_type\n",
" type: FieldType.STRING\n",
"\n",
" min: Customer\n",
" max: Subscriber\n",
" count: 519700.0\n",
" missing_count: 0.0\n",
" error_count: 0.0\n",
"\n",
"ColumnProfile\n",
" name: member_birth_year\n",
" type: FieldType.DECIMAL\n",
"\n",
" min: 1900.0\n",
" max: 1999.0\n",
" count: 519700.0\n",
" missing_count: 66541.0\n",
" error_count: 2.0\n",
"\n",
" lower_quartile: 1974.3343079021402\n",
" median: 1982.8008012973817\n",
" upper_quartile: 1987.991638371539\n",
" std: 10.511639915765766\n",
" mean: 1980.4052039359563\n",
"\n",
"ColumnProfile\n",
" name: member_gender\n",
" type: FieldType.STRING\n",
"\n",
" min: \n",
" max: Other\n",
" count: 519700.0\n",
" missing_count: 0.0\n",
" error_count: 0.0"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"gb_asserted = gobike\\\n",
" .assert_value(\n",
" columns='member_birth_year', \n",
" expression=dprep.f_or(dprep.value.is_null(), dprep.value >= 1900),\n",
" error_code='InvalidDate'\n",
" )\n",
"gb_asserted.get_profile()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we can filter to see what caused the 2 errors above:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>duration_sec</th>\n",
" <th>start_time</th>\n",
" <th>end_time</th>\n",
" <th>start_station_id</th>\n",
" <th>start_station_name</th>\n",
" <th>start_station_latitude</th>\n",
" <th>start_station_longitude</th>\n",
" <th>end_station_id</th>\n",
" <th>end_station_name</th>\n",
" <th>end_station_latitude</th>\n",
" <th>end_station_longitude</th>\n",
" <th>bike_id</th>\n",
" <th>user_type</th>\n",
" <th>member_birth_year</th>\n",
" <th>member_gender</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2546.0</td>\n",
" <td>2017-08-19 17:47:32.110000+00:00</td>\n",
" <td>2017-08-19 18:29:58.825000+00:00</td>\n",
" <td>197.0</td>\n",
" <td>El Embarcadero at Grand Ave</td>\n",
" <td>37.808848</td>\n",
" <td>-122.24968</td>\n",
" <td>172.0</td>\n",
" <td>College Ave at Taft Ave</td>\n",
" <td>37.841800</td>\n",
" <td>-122.251535</td>\n",
" <td>1448.0</td>\n",
" <td>Customer</td>\n",
" <td>azureml.dataprep.native.DataPrepError(\"'Invali...</td>\n",
" <td>Male</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1767.0</td>\n",
" <td>2017-08-19 13:20:02.170000+00:00</td>\n",
" <td>2017-08-19 13:49:29.735000+00:00</td>\n",
" <td>235.0</td>\n",
" <td>Union St at 10th St</td>\n",
" <td>37.807239</td>\n",
" <td>-122.28937</td>\n",
" <td>197.0</td>\n",
" <td>El Embarcadero at Grand Ave</td>\n",
" <td>37.808848</td>\n",
" <td>-122.249680</td>\n",
" <td>91.0</td>\n",
" <td>Customer</td>\n",
" <td>azureml.dataprep.native.DataPrepError(\"'Invali...</td>\n",
" <td>Male</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" duration_sec start_time \\\n",
"0 2546.0 2017-08-19 17:47:32.110000+00:00 \n",
"1 1767.0 2017-08-19 13:20:02.170000+00:00 \n",
"\n",
" end_time start_station_id \\\n",
"0 2017-08-19 18:29:58.825000+00:00 197.0 \n",
"1 2017-08-19 13:49:29.735000+00:00 235.0 \n",
"\n",
" start_station_name start_station_latitude \\\n",
"0 El Embarcadero at Grand Ave 37.808848 \n",
"1 Union St at 10th St 37.807239 \n",
"\n",
" start_station_longitude end_station_id end_station_name \\\n",
"0 -122.24968 172.0 College Ave at Taft Ave \n",
"1 -122.28937 197.0 El Embarcadero at Grand Ave \n",
"\n",
" end_station_latitude end_station_longitude bike_id user_type \\\n",
"0 37.841800 -122.251535 1448.0 Customer \n",
"1 37.808848 -122.249680 91.0 Customer \n",
"\n",
" member_birth_year member_gender \n",
"0 azureml.dataprep.native.DataPrepError(\"'Invali... Male \n",
"1 azureml.dataprep.native.DataPrepError(\"'Invali... Male "
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"gb_errors = gb_asserted.filter(dprep.col('member_birth_year').is_error())\n",
"gb_errors.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Join\n",
"But what were the original values? Let's use `join` to figure out what the values were that caused our assert to throw an error. "
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>l_duration_sec</th>\n",
" <th>l_start_time</th>\n",
" <th>l_end_time</th>\n",
" <th>l_start_station_id</th>\n",
" <th>l_start_station_name</th>\n",
" <th>l_start_station_latitude</th>\n",
" <th>l_start_station_longitude</th>\n",
" <th>l_end_station_id</th>\n",
" <th>l_end_station_name</th>\n",
" <th>l_end_station_latitude</th>\n",
" <th>...</th>\n",
" <th>r_start_station_latitude</th>\n",
" <th>r_start_station_longitude</th>\n",
" <th>r_end_station_id</th>\n",
" <th>r_end_station_name</th>\n",
" <th>r_end_station_latitude</th>\n",
" <th>r_end_station_longitude</th>\n",
" <th>r_bike_id</th>\n",
" <th>r_user_type</th>\n",
" <th>r_member_birth_year</th>\n",
" <th>r_member_gender</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2546.0</td>\n",
" <td>2017-08-19 17:47:32.110000+00:00</td>\n",
" <td>2017-08-19 18:29:58.825000+00:00</td>\n",
" <td>197.0</td>\n",
" <td>El Embarcadero at Grand Ave</td>\n",
" <td>37.808848</td>\n",
" <td>-122.24968</td>\n",
" <td>172.0</td>\n",
" <td>College Ave at Taft Ave</td>\n",
" <td>37.841800</td>\n",
" <td>...</td>\n",
" <td>37.808848</td>\n",
" <td>-122.24968</td>\n",
" <td>172.0</td>\n",
" <td>College Ave at Taft Ave</td>\n",
" <td>37.841800</td>\n",
" <td>-122.251535</td>\n",
" <td>1448.0</td>\n",
" <td>Customer</td>\n",
" <td>1886.0</td>\n",
" <td>Male</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1767.0</td>\n",
" <td>2017-08-19 13:20:02.170000+00:00</td>\n",
" <td>2017-08-19 13:49:29.735000+00:00</td>\n",
" <td>235.0</td>\n",
" <td>Union St at 10th St</td>\n",
" <td>37.807239</td>\n",
" <td>-122.28937</td>\n",
" <td>197.0</td>\n",
" <td>El Embarcadero at Grand Ave</td>\n",
" <td>37.808848</td>\n",
" <td>...</td>\n",
" <td>37.807239</td>\n",
" <td>-122.28937</td>\n",
" <td>197.0</td>\n",
" <td>El Embarcadero at Grand Ave</td>\n",
" <td>37.808848</td>\n",
" <td>-122.249680</td>\n",
" <td>91.0</td>\n",
" <td>Customer</td>\n",
" <td>1886.0</td>\n",
" <td>Male</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>2 rows × 30 columns</p>\n",
"</div>"
],
"text/plain": [
" l_duration_sec l_start_time \\\n",
"0 2546.0 2017-08-19 17:47:32.110000+00:00 \n",
"1 1767.0 2017-08-19 13:20:02.170000+00:00 \n",
"\n",
" l_end_time l_start_station_id \\\n",
"0 2017-08-19 18:29:58.825000+00:00 197.0 \n",
"1 2017-08-19 13:49:29.735000+00:00 235.0 \n",
"\n",
" l_start_station_name l_start_station_latitude \\\n",
"0 El Embarcadero at Grand Ave 37.808848 \n",
"1 Union St at 10th St 37.807239 \n",
"\n",
" l_start_station_longitude l_end_station_id l_end_station_name \\\n",
"0 -122.24968 172.0 College Ave at Taft Ave \n",
"1 -122.28937 197.0 El Embarcadero at Grand Ave \n",
"\n",
" l_end_station_latitude ... r_start_station_latitude \\\n",
"0 37.841800 ... 37.808848 \n",
"1 37.808848 ... 37.807239 \n",
"\n",
" r_start_station_longitude r_end_station_id r_end_station_name \\\n",
"0 -122.24968 172.0 College Ave at Taft Ave \n",
"1 -122.28937 197.0 El Embarcadero at Grand Ave \n",
"\n",
" r_end_station_latitude r_end_station_longitude r_bike_id r_user_type \\\n",
"0 37.841800 -122.251535 1448.0 Customer \n",
"1 37.808848 -122.249680 91.0 Customer \n",
"\n",
" r_member_birth_year r_member_gender \n",
"0 1886.0 Male \n",
"1 1886.0 Male \n",
"\n",
"[2 rows x 30 columns]"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"gb_errors.join(\n",
" left_dataflow=gb_errors,\n",
" right_dataflow=gobike,\n",
" join_key_pairs=[\n",
" ('duration_sec', 'duration_sec'),\n",
" ('start_station_id', 'start_station_id'),\n",
" ('bike_id', 'bike_id')\n",
" ]\n",
").head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we look at `r_member_birth_year`, we see that these people were listed as being born in 1886. That's impossible! Now that we've identified outliers and anomalies, we can appropriately clean our data however we like."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Export for machine learning\n",
"\n",
"One of the beautiful features of Azure ML Data Prep is that you only need to write your code once and choose whether to scale up or out; it takes care of figuring out how. To do so, you can export the `.dprep` file you've written tested on a smaller dataset, then run it with your larger dataset. Here, we show how you can export your new package. For a more detailed example on how to execute it on Spark, check out our [New York Taxicab scenario](https://github.com/Microsoft/PendletonDocs/blob/master/Scenarios/NYTaxiCab/01.new_york_taxi.ipynb)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"gobike = gobike.set_name(name=\"gobike\")\n",
"package_path = path.join(mkdtemp(), \"gobike.dprep\")\n",
"\n",
"print(\"Saving package to: {}\".format(package_path))\n",
"package = dprep.Package(arg=gobike)\n",
"package.save(file_path=package_path)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Want more information?\n",
"\n",
"Congratulations on finishing your introduction to the Azure ML Data Prep SDK! If you'd like more detailed tutorials on how to construct machine learning datasets or dive deeper into all of its functionality, you can find more information in our detailed notebooks [here](https://github.com/Microsoft/PendletonDocs). There, we cover topics including how to:\n",
"\n",
"* Cache your Dataflow to speed up your iterations\n",
"* Add your custom Python transforms\n",
"* Impute missing values\n",
"* Sample your data\n",
"* Reference and link between Dataflows\n",
"* Apply your Dataflow to a new, larger data source"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}