{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Getting started with Azure ML Data Prep SDK\n", "Copyright (c) Microsoft Corporation. All rights reserved.
\n", "Licensed under the MIT License." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "#### Note: Some features in this Notebook will _not_ work with the Private Preview version of the SDK; it assumes the Public Preview version.\n", "\n", "Wonder how you can make the most of the Azure ML Data Prep SDK? In this \"Getting Started\" guide, we'll showcase a few highlights that make this SDK shine for big datasets where `pandas` and `dplyr` can fall short. Using the [Ford GoBike dataset](https://www.fordgobike.com/system-data) as an example, we'll cover how to build Dataflows that allow you to:\n", "\n", "* [Read in data](#Read-in-data)\n", "* [Get a profile of your data](#Get-data-profile)\n", "* [Apply smart transforms by Microsoft Research](#Derive-by-example)\n", "* [Filter quickly](#Filter-our-data)\n", "* [Apply common data science transforms](#Transform-our-data)\n", "* [Easily handle errors and assertions](#Assert-on-invalid-data)\n", "* [Prepare your dataset for export and machine learning](#Export-for-machine-learning)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from IPython.display import display\n", "from os import path\n", "from tempfile import mkdtemp\n", "\n", "import pandas as pd\n", "import azureml.dataprep as dprep" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Read in data\n", "\n", "Azure ML Data Prep supports many different file reading formats (i.e. CSV, Excel, Parquet), and also offers the ability to infer column types automatically. " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
duration_secstart_timeend_timestart_station_idstart_station_namestart_station_latitudestart_station_longitudeend_station_idend_station_nameend_station_latitudeend_station_longitudebike_iduser_typemember_birth_yearmember_gender
080110.02017-12-31 16:57:39.654000+00:002018-01-01 15:12:50.245000+00:0074.0Laguna St at Hayes St37.776435-122.42624443.0San Francisco Public Library (Grove St at Hyde...37.778768-122.41592996.0Customer1987.0Male
178800.02017-12-31 15:56:34.842000+00:002018-01-01 13:49:55.617000+00:00284.0Yerba Buena Center for the Arts (Howard St at ...37.784872-122.40087696.0Dolores St at 15th St37.766210-122.42661488.0Customer1965.0Female
245768.02017-12-31 22:45:48.411000+00:002018-01-01 11:28:36.883000+00:00245.0Downtown Berkeley BART37.870348-122.267764245.0Downtown Berkeley BART37.870348-122.2677641094.0CustomerNaN
362172.02017-12-31 17:31:10.636000+00:002018-01-01 10:47:23.531000+00:0060.08th St at Ringold St37.774520-122.4094495.0Powell St BART Station (Market St at 5th St)37.783899-122.4084452831.0CustomerNaN
443603.02017-12-31 14:23:14.001000+00:002018-01-01 02:29:57.571000+00:00239.0Bancroft Way at Telegraph Ave37.868813-122.258764247.0Fulton St at Bancroft Way37.867789-122.2658963167.0Subscriber1997.0Female
\n", "
" ], "text/plain": [ " duration_sec start_time \\\n", "0 80110.0 2017-12-31 16:57:39.654000+00:00 \n", "1 78800.0 2017-12-31 15:56:34.842000+00:00 \n", "2 45768.0 2017-12-31 22:45:48.411000+00:00 \n", "3 62172.0 2017-12-31 17:31:10.636000+00:00 \n", "4 43603.0 2017-12-31 14:23:14.001000+00:00 \n", "\n", " end_time start_station_id \\\n", "0 2018-01-01 15:12:50.245000+00:00 74.0 \n", "1 2018-01-01 13:49:55.617000+00:00 284.0 \n", "2 2018-01-01 11:28:36.883000+00:00 245.0 \n", "3 2018-01-01 10:47:23.531000+00:00 60.0 \n", "4 2018-01-01 02:29:57.571000+00:00 239.0 \n", "\n", " start_station_name start_station_latitude \\\n", "0 Laguna St at Hayes St 37.776435 \n", "1 Yerba Buena Center for the Arts (Howard St at ... 37.784872 \n", "2 Downtown Berkeley BART 37.870348 \n", "3 8th St at Ringold St 37.774520 \n", "4 Bancroft Way at Telegraph Ave 37.868813 \n", "\n", " start_station_longitude end_station_id \\\n", "0 -122.426244 43.0 \n", "1 -122.400876 96.0 \n", "2 -122.267764 245.0 \n", "3 -122.409449 5.0 \n", "4 -122.258764 247.0 \n", "\n", " end_station_name end_station_latitude \\\n", "0 San Francisco Public Library (Grove St at Hyde... 37.778768 \n", "1 Dolores St at 15th St 37.766210 \n", "2 Downtown Berkeley BART 37.870348 \n", "3 Powell St BART Station (Market St at 5th St) 37.783899 \n", "4 Fulton St at Bancroft Way 37.867789 \n", "\n", " end_station_longitude bike_id user_type member_birth_year member_gender \n", "0 -122.415929 96.0 Customer 1987.0 Male \n", "1 -122.426614 88.0 Customer 1965.0 Female \n", "2 -122.267764 1094.0 Customer NaN \n", "3 -122.408445 2831.0 Customer NaN \n", "4 -122.265896 3167.0 Subscriber 1997.0 Female " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gobike = dprep\\\n", " .read_csv(\n", " path='https://dprepdata.blob.core.windows.net/demo/ford_gobike/2017-fordgobike-tripdata.csv',\n", " inference_arguments=dprep.InferenceArguments.current_culture()\n", " )\n", "gobike.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to iterate more quickly, we can take a sample of our data. Later, we can then apply the same transformations to the entire dataset." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "sampled_gobike = gobike.take_sample(probability=0.1, seed=5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get data profile\n", "\n", "Let's understand what our data looks like. Azure ML Data Prep facilitates this process by offering data profiles that help us glimpse into column types and column summary statistics." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TypeMinMaxCountMissing CountError CountLower QuartileUpper QuartileStandard DeviationMean
duration_secFieldType.DECIMAL6186369519700.00.00.0381.842938.5743444.151099.01
start_timeFieldType.DATE2017-06-28 09:47:36.347000+00:002017-12-31 23:59:01.261000+00:00519700.00.00.0
end_timeFieldType.DATE2017-06-28 09:52:55.338000+00:002018-01-01 15:12:50.245000+00:00519700.00.00.0
start_station_idFieldType.DECIMAL3340519700.00.00.023.8481139.42486.083195.0342
start_station_nameFieldType.STRING10th Ave at E 15th StYerba Buena Center for the Arts (Howard St at ...519700.00.00.0
start_station_latitudeFieldType.DECIMAL37.317337.8802519700.00.00.037.773637.79530.08630537.7717
start_station_longitudeFieldType.DECIMAL-122.444-121.874519700.00.00.0-122.412-122.3910.105573-122.364
end_station_idFieldType.DECIMAL3340519700.00.00.022.7024134.2284.969592.184
end_station_nameFieldType.STRING10th Ave at E 15th StYerba Buena Center for the Arts (Howard St at ...519700.00.00.0
end_station_latitudeFieldType.DECIMAL37.317337.8802519700.00.00.037.774237.79560.086223837.7718
end_station_longitudeFieldType.DECIMAL-122.444-121.874519700.00.00.0-122.41-122.3910.105122-122.363
bike_idFieldType.DECIMAL103733519700.00.00.0788.6792519.96971.3571672.53
user_typeFieldType.STRINGCustomerSubscriber519700.00.00.0
member_birth_yearFieldType.DECIMAL18861999519700.066541.00.01974.331987.9910.51351980.4
member_genderFieldType.STRINGOther519700.00.00.0
" ], "text/plain": [ "ColumnProfile\n", " name: duration_sec\n", " type: FieldType.DECIMAL\n", "\n", " min: 61.0\n", " max: 86369.0\n", " count: 519700.0\n", " missing_count: 0.0\n", " error_count: 0.0\n", "\n", " lower_quartile: 381.8421435321134\n", " median: 595.9837506906349\n", " upper_quartile: 938.5741138032683\n", " std: 3444.146451247386\n", " mean: 1099.009520877422\n", "\n", "ColumnProfile\n", " name: start_time\n", " type: FieldType.DATE\n", "\n", " min: 2017-06-28 09:47:36.347000+00:00\n", " max: 2017-12-31 23:59:01.261000+00:00\n", " count: 519700.0\n", " missing_count: 0.0\n", " error_count: 0.0\n", "\n", "ColumnProfile\n", " name: end_time\n", " type: FieldType.DATE\n", "\n", " min: 2017-06-28 09:52:55.338000+00:00\n", " max: 2018-01-01 15:12:50.245000+00:00\n", " count: 519700.0\n", " missing_count: 0.0\n", " error_count: 0.0\n", "\n", "ColumnProfile\n", " name: start_station_id\n", " type: FieldType.DECIMAL\n", "\n", " min: 3.0\n", " max: 340.0\n", " count: 519700.0\n", " missing_count: 0.0\n", " error_count: 0.0\n", "\n", " lower_quartile: 23.848148131600635\n", " median: 67.18427817406452\n", " upper_quartile: 139.42430180307275\n", " std: 86.08307797095921\n", " mean: 95.03424475658852\n", "\n", "ColumnProfile\n", " name: start_station_name\n", " type: FieldType.STRING\n", "\n", " min: 10th Ave at E 15th St\n", " max: Yerba Buena Center for the Arts (Howard St at 3rd St)\n", " count: 519700.0\n", " missing_count: 0.0\n", " error_count: 0.0\n", "\n", "ColumnProfile\n", " name: start_station_latitude\n", " type: FieldType.DECIMAL\n", "\n", " min: 37.3172979\n", " max: 37.88022244590679\n", " count: 519700.0\n", " missing_count: 0.0\n", " error_count: 0.0\n", "\n", " lower_quartile: 37.7735913559721\n", " median: 37.783211475877295\n", " upper_quartile: 37.79531236950411\n", " std: 0.08630496061661774\n", " mean: 37.771652603110894\n", "\n", "ColumnProfile\n", " name: start_station_longitude\n", " type: FieldType.DECIMAL\n", "\n", " min: -122.44429260492325\n", " max: -121.8741186\n", " count: 519700.0\n", " missing_count: 0.0\n", " error_count: 0.0\n", "\n", " lower_quartile: -122.41170653070694\n", " median: -122.39875282257843\n", " upper_quartile: -122.39103429266093\n", " std: 0.10557344899193394\n", " mean: -122.36392726512949\n", "\n", "ColumnProfile\n", " name: end_station_id\n", " type: FieldType.DECIMAL\n", "\n", " min: 3.0\n", " max: 340.0\n", " count: 519700.0\n", " missing_count: 0.0\n", " error_count: 0.0\n", "\n", " lower_quartile: 22.702361193995444\n", " median: 65.22613324081779\n", " upper_quartile: 134.21987129021295\n", " std: 84.9694914863546\n", " mean: 92.18404079276426\n", "\n", "ColumnProfile\n", " name: end_station_name\n", " type: FieldType.STRING\n", "\n", " min: 10th Ave at E 15th St\n", " max: Yerba Buena Center for the Arts (Howard St at 3rd St)\n", " count: 519700.0\n", " missing_count: 0.0\n", " error_count: 0.0\n", "\n", "ColumnProfile\n", " name: end_station_latitude\n", " type: FieldType.DECIMAL\n", "\n", " min: 37.3172979\n", " max: 37.88022244590679\n", " count: 519700.0\n", " missing_count: 0.0\n", " error_count: 0.0\n", "\n", " lower_quartile: 37.774232065528906\n", " median: 37.78329810124021\n", " upper_quartile: 37.79557128475191\n", " std: 0.08622383487119635\n", " mean: 37.771843749644646\n", "\n", "ColumnProfile\n", " name: end_station_longitude\n", " type: FieldType.DECIMAL\n", "\n", " min: -122.44429260492325\n", " max: -121.8741186\n", " count: 519700.0\n", " missing_count: 0.0\n", " error_count: 0.0\n", "\n", " lower_quartile: -122.41012752595213\n", " median: -122.39855511689811\n", " upper_quartile: -122.39096192032446\n", " std: 0.10512220222934929\n", " mean: -122.36323553679931\n", "\n", "ColumnProfile\n", " name: bike_id\n", " type: FieldType.DECIMAL\n", "\n", " min: 10.0\n", " max: 3733.0\n", " count: 519700.0\n", " missing_count: 0.0\n", " error_count: 0.0\n", "\n", " lower_quartile: 788.6785454424829\n", " median: 1726.652793720984\n", " upper_quartile: 2519.963581272433\n", " std: 971.3569593530214\n", " mean: 1672.533078699254\n", "\n", "ColumnProfile\n", " name: user_type\n", " type: FieldType.STRING\n", "\n", " min: Customer\n", " max: Subscriber\n", " count: 519700.0\n", " missing_count: 0.0\n", " error_count: 0.0\n", "\n", "ColumnProfile\n", " name: member_birth_year\n", " type: FieldType.DECIMAL\n", "\n", " min: 1886.0\n", " max: 1999.0\n", " count: 519700.0\n", " missing_count: 66541.0\n", " error_count: 0.0\n", "\n", " lower_quartile: 1974.3341624985283\n", " median: 1982.8007516297655\n", " upper_quartile: 1987.9916166785322\n", " std: 10.51348753990893\n", " mean: 1980.4047872821984\n", "\n", "ColumnProfile\n", " name: member_gender\n", " type: FieldType.STRING\n", "\n", " min: \n", " max: Other\n", " count: 519700.0\n", " missing_count: 0.0\n", " error_count: 0.0" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gobike.get_profile()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TypeMinMaxCountMissing CountError CountLower QuartileUpper QuartileStandard DeviationMean
duration_secFieldType.DECIMAL618586451853.00.00.0381.017936.3993527.181102.23
start_timeFieldType.DATE2017-06-28 10:51:23.182000+00:002017-12-31 23:55:09.686000+00:0051853.00.00.0
end_timeFieldType.DATE2017-06-28 11:01:39.557000+00:002018-01-01 15:12:50.245000+00:0051853.00.00.0
start_station_idFieldType.DECIMAL334051853.00.00.023.823139.67986.092394.8785
start_station_nameFieldType.STRING10th Ave at E 15th StYerba Buena Center for the Arts (Howard St at ...51853.00.00.0
start_station_latitudeFieldType.DECIMAL37.317337.880251853.00.00.037.773637.79540.086263737.7717
start_station_longitudeFieldType.DECIMAL-122.444-121.87451853.00.00.0-122.412-122.3910.105593-122.364
end_station_idFieldType.DECIMAL333851853.00.00.022.3474135.08185.091691.9201
end_station_nameFieldType.STRING10th Ave at E 15th StYerba Buena Center for the Arts (Howard St at ...51853.00.00.0
end_station_latitudeFieldType.DECIMAL37.318437.880251853.00.00.037.774537.79560.086191537.7719
end_station_longitudeFieldType.DECIMAL-122.444-121.87451853.00.00.0-122.41-122.3910.105075-122.363
bike_idFieldType.DECIMAL10373351853.00.00.0795.892524.9970.5061674.51
user_typeFieldType.STRINGCustomerSubscriber51853.00.00.0
member_birth_yearFieldType.DECIMAL1900199951853.06577.00.01974.291988.0110.41481980.4
member_genderFieldType.STRINGOther51853.00.00.0
" ], "text/plain": [ "ColumnProfile\n", " name: duration_sec\n", " type: FieldType.DECIMAL\n", "\n", " min: 61.0\n", " max: 85864.0\n", " count: 51853.0\n", " missing_count: 0.0\n", " error_count: 0.0\n", "\n", " lower_quartile: 381.0173265588649\n", " median: 596.0824682091602\n", " upper_quartile: 936.3990413401431\n", " std: 3527.1849383367376\n", " mean: 1102.2291284978571\n", "\n", "ColumnProfile\n", " name: start_time\n", " type: FieldType.DATE\n", "\n", " min: 2017-06-28 10:51:23.182000+00:00\n", " max: 2017-12-31 23:55:09.686000+00:00\n", " count: 51853.0\n", " missing_count: 0.0\n", " error_count: 0.0\n", "\n", "ColumnProfile\n", " name: end_time\n", " type: FieldType.DATE\n", "\n", " min: 2017-06-28 11:01:39.557000+00:00\n", " max: 2018-01-01 15:12:50.245000+00:00\n", " count: 51853.0\n", " missing_count: 0.0\n", " error_count: 0.0\n", "\n", "ColumnProfile\n", " name: start_station_id\n", " type: FieldType.DECIMAL\n", "\n", " min: 3.0\n", " max: 340.0\n", " count: 51853.0\n", " missing_count: 0.0\n", " error_count: 0.0\n", "\n", " lower_quartile: 23.82299260050619\n", " median: 66.81449005522046\n", " upper_quartile: 139.6790865298709\n", " std: 86.09232732608726\n", " mean: 94.87848340501073\n", "\n", "ColumnProfile\n", " name: start_station_name\n", " type: FieldType.STRING\n", "\n", " min: 10th Ave at E 15th St\n", " max: Yerba Buena Center for the Arts (Howard St at 3rd St)\n", " count: 51853.0\n", " missing_count: 0.0\n", " error_count: 0.0\n", "\n", "ColumnProfile\n", " name: start_station_latitude\n", " type: FieldType.DECIMAL\n", "\n", " min: 37.3172979\n", " max: 37.88022244590679\n", " count: 51853.0\n", " missing_count: 0.0\n", " error_count: 0.0\n", "\n", " lower_quartile: 37.773594346717786\n", " median: 37.78325255020885\n", " upper_quartile: 37.795362857566715\n", " std: 0.08626372544371842\n", " mean: 37.771708918993944\n", "\n", "ColumnProfile\n", " name: start_station_longitude\n", " type: FieldType.DECIMAL\n", "\n", " min: -122.44429260492325\n", " max: -121.8741186\n", " count: 51853.0\n", " missing_count: 0.0\n", " error_count: 0.0\n", "\n", " lower_quartile: -122.41157512442906\n", " median: -122.39882719487981\n", " upper_quartile: -122.39096385593315\n", " std: 0.10559301820942323\n", " mean: -122.36375576045955\n", "\n", "ColumnProfile\n", " name: end_station_id\n", " type: FieldType.DECIMAL\n", "\n", " min: 3.0\n", " max: 338.0\n", " count: 51853.0\n", " missing_count: 0.0\n", " error_count: 0.0\n", "\n", " lower_quartile: 22.34742112221029\n", " median: 65.60893574407544\n", " upper_quartile: 135.08124174966116\n", " std: 85.09162990442911\n", " mean: 91.9201396254798\n", "\n", "ColumnProfile\n", " name: end_station_name\n", " type: FieldType.STRING\n", "\n", " min: 10th Ave at E 15th St\n", " max: Yerba Buena Center for the Arts (Howard St at 3rd St)\n", " count: 51853.0\n", " missing_count: 0.0\n", " error_count: 0.0\n", "\n", "ColumnProfile\n", " name: end_station_latitude\n", " type: FieldType.DECIMAL\n", "\n", " min: 37.3184498\n", " max: 37.88022244590679\n", " count: 51853.0\n", " missing_count: 0.0\n", " error_count: 0.0\n", "\n", " lower_quartile: 37.77450364194883\n", " median: 37.78358862499172\n", " upper_quartile: 37.79555394254664\n", " std: 0.08619152451969307\n", " mean: 37.77190111029278\n", "\n", "ColumnProfile\n", " name: end_station_longitude\n", " type: FieldType.DECIMAL\n", "\n", " min: -122.44429260492325\n", " max: -121.8741186\n", " count: 51853.0\n", " missing_count: 0.0\n", " error_count: 0.0\n", "\n", " lower_quartile: -122.40967464398858\n", " median: -122.39857551157675\n", " upper_quartile: -122.39085540596203\n", " std: 0.10507512085392584\n", " mean: -122.3629776153239\n", "\n", "ColumnProfile\n", " name: bike_id\n", " type: FieldType.DECIMAL\n", "\n", " min: 10.0\n", " max: 3733.0\n", " count: 51853.0\n", " missing_count: 0.0\n", " error_count: 0.0\n", "\n", " lower_quartile: 795.8904240211187\n", " median: 1723.039443196501\n", " upper_quartile: 2524.901114053501\n", " std: 970.5058870359009\n", " mean: 1674.5133936319962\n", "\n", "ColumnProfile\n", " name: user_type\n", " type: FieldType.STRING\n", "\n", " min: Customer\n", " max: Subscriber\n", " count: 51853.0\n", " missing_count: 0.0\n", " error_count: 0.0\n", "\n", "ColumnProfile\n", " name: member_birth_year\n", " type: FieldType.DECIMAL\n", "\n", " min: 1900.0\n", " max: 1999.0\n", " count: 51853.0\n", " missing_count: 6577.0\n", " error_count: 0.0\n", "\n", " lower_quartile: 1974.2949238618335\n", " median: 1982.7223690704195\n", " upper_quartile: 1988.012942765942\n", " std: 10.414847623452637\n", " mean: 1980.4024648820382\n", "\n", "ColumnProfile\n", " name: member_gender\n", " type: FieldType.STRING\n", "\n", " min: \n", " max: Other\n", " count: 51853.0\n", " missing_count: 0.0\n", " error_count: 0.0" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sampled_gobike.get_profile()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It appears that we have quite a few missing values in `member_birth_year`. We also immediately see that we have some empty strings in our `member_gender` column. With the data profiler, we can quickly do a sanity check on our dataset and see where we might need to start data cleaning." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Derive by example\n", "\n", "Azure ML Data Prep comes with additional \"smart\" transforms created by Microsoft Research. Here, we'll look at how you can derive a new column by providing examples of input-output pairs. Rather than explicitly using regular expressions to extract dates or hours from datetimes, we can provide examples for Azure ML Data Prep to learn what the pattern is. In fact, these smart transformations can also handle more complex derivations like inferring the day of the week from datetimes." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "sgb_derived = sampled_gobike\\\n", " .to_string(\n", " columns=['start_time', 'end_time']\n", " )\\\n", " .derive_column_by_example(\n", " source_columns='start_time',\n", " new_column_name='date',\n", " example_data=[('2017-12-31 16:57:39.6540', '2017-12-31'), ('2017-12-31 16:57:39', '2017-12-31')]\n", " )\\\n", " .derive_column_by_example(\n", " source_columns='start_time',\n", " new_column_name='hour',\n", " example_data=[('2017-12-31 16:57:39.6540', '16')]\n", " )\\\n", " .derive_column_by_example(\n", " source_columns='start_time',\n", " new_column_name='wday',\n", " example_data=[('2017-12-31 16:57:39.6540', 'Sunday')]\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Filter our data\n", "\n", "Let's verify that our derivations are correct by doing a bit of spot-checking." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
duration_secstart_timewdayhourdateend_timestart_station_idstart_station_namestart_station_latitudestart_station_longitudeend_station_idend_station_nameend_station_latitudeend_station_longitudebike_iduser_typemember_birth_yearmember_gender
03456.02017-12-30 23:46:13.358000Saturday232017-12-302017-12-31 00:43:49.46900075.0Market St at Franklin St37.773793-122.42123975.0Market St at Franklin St37.773793-122.4212391642.0Subscriber1972.0Male
1204.02017-12-30 23:31:38.904000Saturday232017-12-302017-12-30 23:35:03.12100084.0Duboce Park37.769200-122.433812107.017th St at Dolores St37.763015-122.4264972201.0Subscriber1965.0Male
2743.02017-12-30 22:35:13.114000Saturday222017-12-302017-12-30 22:47:36.356000285.0Webster St at O'Farrell St37.783521-122.43115897.014th St at Mission St37.768265-122.4201101628.0Subscriber1993.0Male
3328.02017-12-30 22:19:28.760000Saturday222017-12-302017-12-30 22:24:57.4890005.0Powell St BART Station (Market St at 5th St)37.783899-122.40844564.05th St at Brannan St37.776754-122.3990182806.0Subscriber1986.0Male
4260.02017-12-30 21:22:40.116000Saturday212017-12-302017-12-30 21:27:00.885000277.0Morrison Ave at Julian St37.333658-121.908586278.0The Alameda at Bush St37.331932-121.904888465.0Subscriber1991.0Male
\n", "
" ], "text/plain": [ " duration_sec start_time wday hour date \\\n", "0 3456.0 2017-12-30 23:46:13.358000 Saturday 23 2017-12-30 \n", "1 204.0 2017-12-30 23:31:38.904000 Saturday 23 2017-12-30 \n", "2 743.0 2017-12-30 22:35:13.114000 Saturday 22 2017-12-30 \n", "3 328.0 2017-12-30 22:19:28.760000 Saturday 22 2017-12-30 \n", "4 260.0 2017-12-30 21:22:40.116000 Saturday 21 2017-12-30 \n", "\n", " end_time start_station_id \\\n", "0 2017-12-31 00:43:49.469000 75.0 \n", "1 2017-12-30 23:35:03.121000 84.0 \n", "2 2017-12-30 22:47:36.356000 285.0 \n", "3 2017-12-30 22:24:57.489000 5.0 \n", "4 2017-12-30 21:27:00.885000 277.0 \n", "\n", " start_station_name start_station_latitude \\\n", "0 Market St at Franklin St 37.773793 \n", "1 Duboce Park 37.769200 \n", "2 Webster St at O'Farrell St 37.783521 \n", "3 Powell St BART Station (Market St at 5th St) 37.783899 \n", "4 Morrison Ave at Julian St 37.333658 \n", "\n", " start_station_longitude end_station_id end_station_name \\\n", "0 -122.421239 75.0 Market St at Franklin St \n", "1 -122.433812 107.0 17th St at Dolores St \n", "2 -122.431158 97.0 14th St at Mission St \n", "3 -122.408445 64.0 5th St at Brannan St \n", "4 -121.908586 278.0 The Alameda at Bush St \n", "\n", " end_station_latitude end_station_longitude bike_id user_type \\\n", "0 37.773793 -122.421239 1642.0 Subscriber \n", "1 37.763015 -122.426497 2201.0 Subscriber \n", "2 37.768265 -122.420110 1628.0 Subscriber \n", "3 37.776754 -122.399018 2806.0 Subscriber \n", "4 37.331932 -121.904888 465.0 Subscriber \n", "\n", " member_birth_year member_gender \n", "0 1972.0 Male \n", "1 1965.0 Male \n", "2 1993.0 Male \n", "3 1986.0 Male \n", "4 1991.0 Male " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sgb_derived.filter(dprep.col('wday') != 'Sunday').head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also filter on other column types; let's take a peek at rides that lasted over 5 hours." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
duration_secstart_timewdayhourdateend_timestart_station_idstart_station_namestart_station_latitudestart_station_longitudeend_station_idend_station_nameend_station_latitudeend_station_longitudebike_iduser_typemember_birth_yearmember_gender
080110.02017-12-31 16:57:39.654000Sunday162017-12-312018-01-01 15:12:50.24500074.0Laguna St at Hayes St37.776435-122.42624443.0San Francisco Public Library (Grove St at Hyde...37.778768-122.41592996.0Customer1987Male
122587.02017-12-31 13:51:04.538000Sunday132017-12-312017-12-31 20:07:32.139000307.0SAP Center37.332692-121.900084307.0SAP Center37.332692-121.9000841443.0CustomerNone
218067.02017-12-30 04:20:13.938000Saturday042017-12-302017-12-30 09:21:21.62800070.0Central Ave at Fell St37.773311-122.44429343.0San Francisco Public Library (Grove St at Hyde...37.778768-122.4159291928.0CustomerNone
354550.02017-12-29 10:02:38.086000Friday102017-12-292017-12-30 01:11:48.53900021.0Montgomery St BART Station (Market St at 2nd St)37.789625-122.40081184.0Duboce Park37.769200-122.433812209.0CustomerNone
463627.02017-12-27 19:12:42.794000Wednesday192017-12-272017-12-28 12:53:10.649000249.0Russell St at College Ave37.858473-122.253253244.0Shattuck Ave at Hearst Ave37.873792-122.2686181804.0Customer1988Male
\n", "
" ], "text/plain": [ " duration_sec start_time wday hour date \\\n", "0 80110.0 2017-12-31 16:57:39.654000 Sunday 16 2017-12-31 \n", "1 22587.0 2017-12-31 13:51:04.538000 Sunday 13 2017-12-31 \n", "2 18067.0 2017-12-30 04:20:13.938000 Saturday 04 2017-12-30 \n", "3 54550.0 2017-12-29 10:02:38.086000 Friday 10 2017-12-29 \n", "4 63627.0 2017-12-27 19:12:42.794000 Wednesday 19 2017-12-27 \n", "\n", " end_time start_station_id \\\n", "0 2018-01-01 15:12:50.245000 74.0 \n", "1 2017-12-31 20:07:32.139000 307.0 \n", "2 2017-12-30 09:21:21.628000 70.0 \n", "3 2017-12-30 01:11:48.539000 21.0 \n", "4 2017-12-28 12:53:10.649000 249.0 \n", "\n", " start_station_name start_station_latitude \\\n", "0 Laguna St at Hayes St 37.776435 \n", "1 SAP Center 37.332692 \n", "2 Central Ave at Fell St 37.773311 \n", "3 Montgomery St BART Station (Market St at 2nd St) 37.789625 \n", "4 Russell St at College Ave 37.858473 \n", "\n", " start_station_longitude end_station_id \\\n", "0 -122.426244 43.0 \n", "1 -121.900084 307.0 \n", "2 -122.444293 43.0 \n", "3 -122.400811 84.0 \n", "4 -122.253253 244.0 \n", "\n", " end_station_name end_station_latitude \\\n", "0 San Francisco Public Library (Grove St at Hyde... 37.778768 \n", "1 SAP Center 37.332692 \n", "2 San Francisco Public Library (Grove St at Hyde... 37.778768 \n", "3 Duboce Park 37.769200 \n", "4 Shattuck Ave at Hearst Ave 37.873792 \n", "\n", " end_station_longitude bike_id user_type member_birth_year member_gender \n", "0 -122.415929 96.0 Customer 1987 Male \n", "1 -121.900084 1443.0 Customer None \n", "2 -122.415929 1928.0 Customer None \n", "3 -122.433812 209.0 Customer None \n", "4 -122.268618 1804.0 Customer 1988 Male " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sgb_derived.filter(dprep.col('duration_sec') > (60 * 60 * 5)).head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Transform our data\n", "\n", "In addition to \"smart\" transformations, Azure ML Data Prep also supports many common data science transforms familiar to other industry-standard data science libraries. Here, we'll explore the ability to `summarize` and `replace`. We'll also get to use `join` when we handle assertions.\n", "\n", "#### Summarize\n" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dateduration_sec_mean
02017-12-311982.801418
12017-12-301203.766423
22017-12-291287.324841
32017-12-28835.146465
42017-12-271658.735955
\n", "
" ], "text/plain": [ " date duration_sec_mean\n", "0 2017-12-31 1982.801418\n", "1 2017-12-30 1203.766423\n", "2 2017-12-29 1287.324841\n", "3 2017-12-28 835.146465\n", "4 2017-12-27 1658.735955" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sgb_summary = sgb_derived\\\n", " .summarize(\n", " summary_columns=[\n", " dprep\\\n", " .SummaryColumnsValue(\n", " column_id='duration_sec', \n", " summary_column_name='duration_sec_mean', \n", " summary_function=dprep.SummaryFunction.MEAN\n", " )\n", " ],\n", " group_by_columns=['date']\n", " )\n", "sgb_summary.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Azure Data Prep also makes it easy to append this output of `summarize` to the original table based on the grouping variable. " ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
duration_secstart_timewdayhourdateend_timestart_station_idstart_station_namestart_station_latitudestart_station_longitudeend_station_idend_station_nameend_station_latitudeend_station_longitudebike_iduser_typemember_birth_yearmember_genderduration_sec_mean
080110.02017-12-31 16:57:39.654000Sunday162017-12-312018-01-01 15:12:50.24500074.0Laguna St at Hayes St37.776435-122.42624443.0San Francisco Public Library (Grove St at Hyde...37.778768-122.41592996.0Customer1987Male1982.801418
13292.02017-12-31 23:46:32.403000Sunday232017-12-312018-01-01 00:41:24.605000284.0Yerba Buena Center for the Arts (Howard St at ...37.784872-122.40087622.0Howard St at Beale St37.789756-122.3946433058.0CustomerNone1982.801418
21397.02017-12-31 23:55:09.686000Sunday232017-12-312018-01-01 00:18:26.72100078.0Folsom St at 9th St37.773717-122.41164715.0San Francisco Ferry Building (Harry Bridges Pl...37.795392-122.3942031667.0CustomerNone1982.801418
3422.02017-12-31 23:54:25.337000Sunday232017-12-312018-01-01 00:01:27.354000139.0Garfield Square (25th St at Harrison St)37.751017-122.41190199.0Folsom St at 15th St37.767037-122.4154422415.0Subscriber1985Male1982.801418
41130.02017-12-31 23:36:16.069000Sunday232017-12-312017-12-31 23:55:06.09600066.03rd St at Townsend St37.778742-122.39274123.0The Embarcadero at Steuart St37.791464-122.3910342721.0CustomerNone1982.801418
\n", "
" ], "text/plain": [ " duration_sec start_time wday hour date \\\n", "0 80110.0 2017-12-31 16:57:39.654000 Sunday 16 2017-12-31 \n", "1 3292.0 2017-12-31 23:46:32.403000 Sunday 23 2017-12-31 \n", "2 1397.0 2017-12-31 23:55:09.686000 Sunday 23 2017-12-31 \n", "3 422.0 2017-12-31 23:54:25.337000 Sunday 23 2017-12-31 \n", "4 1130.0 2017-12-31 23:36:16.069000 Sunday 23 2017-12-31 \n", "\n", " end_time start_station_id \\\n", "0 2018-01-01 15:12:50.245000 74.0 \n", "1 2018-01-01 00:41:24.605000 284.0 \n", "2 2018-01-01 00:18:26.721000 78.0 \n", "3 2018-01-01 00:01:27.354000 139.0 \n", "4 2017-12-31 23:55:06.096000 66.0 \n", "\n", " start_station_name start_station_latitude \\\n", "0 Laguna St at Hayes St 37.776435 \n", "1 Yerba Buena Center for the Arts (Howard St at ... 37.784872 \n", "2 Folsom St at 9th St 37.773717 \n", "3 Garfield Square (25th St at Harrison St) 37.751017 \n", "4 3rd St at Townsend St 37.778742 \n", "\n", " start_station_longitude end_station_id \\\n", "0 -122.426244 43.0 \n", "1 -122.400876 22.0 \n", "2 -122.411647 15.0 \n", "3 -122.411901 99.0 \n", "4 -122.392741 23.0 \n", "\n", " end_station_name end_station_latitude \\\n", "0 San Francisco Public Library (Grove St at Hyde... 37.778768 \n", "1 Howard St at Beale St 37.789756 \n", "2 San Francisco Ferry Building (Harry Bridges Pl... 37.795392 \n", "3 Folsom St at 15th St 37.767037 \n", "4 The Embarcadero at Steuart St 37.791464 \n", "\n", " end_station_longitude bike_id user_type member_birth_year member_gender \\\n", "0 -122.415929 96.0 Customer 1987 Male \n", "1 -122.394643 3058.0 Customer None \n", "2 -122.394203 1667.0 Customer None \n", "3 -122.415442 2415.0 Subscriber 1985 Male \n", "4 -122.391034 2721.0 Customer None \n", "\n", " duration_sec_mean \n", "0 1982.801418 \n", "1 1982.801418 \n", "2 1982.801418 \n", "3 1982.801418 \n", "4 1982.801418 " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sgb_appended = sgb_derived\\\n", " .summarize(\n", " summary_columns=[\n", " dprep\\\n", " .SummaryColumnsValue(\n", " column_id='duration_sec', \n", " summary_column_name='duration_sec_mean', \n", " summary_function=dprep.SummaryFunction.MEAN\n", " )\n", " ],\n", " group_by_columns=['date'],\n", " join_back=True\n", " )\n", "sgb_appended.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Replace\n", "\n", "Recall that our `member_gender` column had empty strings that stood in place of `None`. Let's use our `replace` function to properly recode them as `None`s." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
duration_secstart_timeend_timestart_station_idstart_station_namestart_station_latitudestart_station_longitudeend_station_idend_station_nameend_station_latitudeend_station_longitudebike_iduser_typemember_birth_yearmember_gender
080110.02017-12-31 16:57:39.654000+00:002018-01-01 15:12:50.245000+00:0074.0Laguna St at Hayes St37.776435-122.42624443.0San Francisco Public Library (Grove St at Hyde...37.778768-122.41592996.0Customer1987Male
13292.02017-12-31 23:46:32.403000+00:002018-01-01 00:41:24.605000+00:00284.0Yerba Buena Center for the Arts (Howard St at ...37.784872-122.40087622.0Howard St at Beale St37.789756-122.3946433058.0CustomerNoneNone
21397.02017-12-31 23:55:09.686000+00:002018-01-01 00:18:26.721000+00:0078.0Folsom St at 9th St37.773717-122.41164715.0San Francisco Ferry Building (Harry Bridges Pl...37.795392-122.3942031667.0CustomerNoneNone
3422.02017-12-31 23:54:25.337000+00:002018-01-01 00:01:27.354000+00:00139.0Garfield Square (25th St at Harrison St)37.751017-122.41190199.0Folsom St at 15th St37.767037-122.4154422415.0Subscriber1985Male
41130.02017-12-31 23:36:16.069000+00:002017-12-31 23:55:06.096000+00:0066.03rd St at Townsend St37.778742-122.39274123.0The Embarcadero at Steuart St37.791464-122.3910342721.0CustomerNoneNone
\n", "
" ], "text/plain": [ " duration_sec start_time \\\n", "0 80110.0 2017-12-31 16:57:39.654000+00:00 \n", "1 3292.0 2017-12-31 23:46:32.403000+00:00 \n", "2 1397.0 2017-12-31 23:55:09.686000+00:00 \n", "3 422.0 2017-12-31 23:54:25.337000+00:00 \n", "4 1130.0 2017-12-31 23:36:16.069000+00:00 \n", "\n", " end_time start_station_id \\\n", "0 2018-01-01 15:12:50.245000+00:00 74.0 \n", "1 2018-01-01 00:41:24.605000+00:00 284.0 \n", "2 2018-01-01 00:18:26.721000+00:00 78.0 \n", "3 2018-01-01 00:01:27.354000+00:00 139.0 \n", "4 2017-12-31 23:55:06.096000+00:00 66.0 \n", "\n", " start_station_name start_station_latitude \\\n", "0 Laguna St at Hayes St 37.776435 \n", "1 Yerba Buena Center for the Arts (Howard St at ... 37.784872 \n", "2 Folsom St at 9th St 37.773717 \n", "3 Garfield Square (25th St at Harrison St) 37.751017 \n", "4 3rd St at Townsend St 37.778742 \n", "\n", " start_station_longitude end_station_id \\\n", "0 -122.426244 43.0 \n", "1 -122.400876 22.0 \n", "2 -122.411647 15.0 \n", "3 -122.411901 99.0 \n", "4 -122.392741 23.0 \n", "\n", " end_station_name end_station_latitude \\\n", "0 San Francisco Public Library (Grove St at Hyde... 37.778768 \n", "1 Howard St at Beale St 37.789756 \n", "2 San Francisco Ferry Building (Harry Bridges Pl... 37.795392 \n", "3 Folsom St at 15th St 37.767037 \n", "4 The Embarcadero at Steuart St 37.791464 \n", "\n", " end_station_longitude bike_id user_type member_birth_year member_gender \n", "0 -122.415929 96.0 Customer 1987 Male \n", "1 -122.394643 3058.0 Customer None None \n", "2 -122.394203 1667.0 Customer None None \n", "3 -122.415442 2415.0 Subscriber 1985 Male \n", "4 -122.391034 2721.0 Customer None None " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sgb_replaced = sampled_gobike.replace_na(columns=['member_gender'])\n", "sgb_replaced.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Assert on invalid data \n", "\n", "Azure ML Data Prep helps prevent broken pipelines and safeguard against bad data by supporting assertions. In our case, we'll create assertions to handle potentially erroneous `member_birth_year` values. The oldest person on record is no more than 130 years old, so birth year listed as before 1900 is wrong. Though our `sampled_gobike` dataset doesn't have any issues, we would fail on the full `gobike` dataset if we made that assumption. However, Azure ML Data Prep allows us to handle these gracefully with assertions." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TypeMinMaxCountMissing CountError CountLower QuartileUpper QuartileStandard DeviationMean
duration_secFieldType.DECIMAL6186369519700.00.00.0381.842938.5743444.151099.01
start_timeFieldType.DATE2017-06-28 09:47:36.347000+00:002017-12-31 23:59:01.261000+00:00519700.00.00.0
end_timeFieldType.DATE2017-06-28 09:52:55.338000+00:002018-01-01 15:12:50.245000+00:00519700.00.00.0
start_station_idFieldType.DECIMAL3340519700.00.00.023.8481139.42486.083195.0342
start_station_nameFieldType.STRING10th Ave at E 15th StYerba Buena Center for the Arts (Howard St at ...519700.00.00.0
start_station_latitudeFieldType.DECIMAL37.317337.8802519700.00.00.037.773637.79530.08630537.7717
start_station_longitudeFieldType.DECIMAL-122.444-121.874519700.00.00.0-122.412-122.3910.105573-122.364
end_station_idFieldType.DECIMAL3340519700.00.00.022.7024134.2284.969592.184
end_station_nameFieldType.STRING10th Ave at E 15th StYerba Buena Center for the Arts (Howard St at ...519700.00.00.0
end_station_latitudeFieldType.DECIMAL37.317337.8802519700.00.00.037.774237.79560.086223837.7718
end_station_longitudeFieldType.DECIMAL-122.444-121.874519700.00.00.0-122.41-122.3910.105122-122.363
bike_idFieldType.DECIMAL103733519700.00.00.0788.6792519.96971.3571672.53
user_typeFieldType.STRINGCustomerSubscriber519700.00.00.0
member_birth_yearFieldType.DECIMAL19001999519700.066541.02.01974.331987.9910.51161980.41
member_genderFieldType.STRINGOther519700.00.00.0
" ], "text/plain": [ "ColumnProfile\n", " name: duration_sec\n", " type: FieldType.DECIMAL\n", "\n", " min: 61.0\n", " max: 86369.0\n", " count: 519700.0\n", " missing_count: 0.0\n", " error_count: 0.0\n", "\n", " lower_quartile: 381.8421435321134\n", " median: 595.9837506906349\n", " upper_quartile: 938.5741138032683\n", " std: 3444.146451247386\n", " mean: 1099.009520877422\n", "\n", "ColumnProfile\n", " name: start_time\n", " type: FieldType.DATE\n", "\n", " min: 2017-06-28 09:47:36.347000+00:00\n", " max: 2017-12-31 23:59:01.261000+00:00\n", " count: 519700.0\n", " missing_count: 0.0\n", " error_count: 0.0\n", "\n", "ColumnProfile\n", " name: end_time\n", " type: FieldType.DATE\n", "\n", " min: 2017-06-28 09:52:55.338000+00:00\n", " max: 2018-01-01 15:12:50.245000+00:00\n", " count: 519700.0\n", " missing_count: 0.0\n", " error_count: 0.0\n", "\n", "ColumnProfile\n", " name: start_station_id\n", " type: FieldType.DECIMAL\n", "\n", " min: 3.0\n", " max: 340.0\n", " count: 519700.0\n", " missing_count: 0.0\n", " error_count: 0.0\n", "\n", " lower_quartile: 23.848148131600635\n", " median: 67.18427817406452\n", " upper_quartile: 139.42430180307275\n", " std: 86.08307797095921\n", " mean: 95.03424475658852\n", "\n", "ColumnProfile\n", " name: start_station_name\n", " type: FieldType.STRING\n", "\n", " min: 10th Ave at E 15th St\n", " max: Yerba Buena Center for the Arts (Howard St at 3rd St)\n", " count: 519700.0\n", " missing_count: 0.0\n", " error_count: 0.0\n", "\n", "ColumnProfile\n", " name: start_station_latitude\n", " type: FieldType.DECIMAL\n", "\n", " min: 37.3172979\n", " max: 37.88022244590679\n", " count: 519700.0\n", " missing_count: 0.0\n", " error_count: 0.0\n", "\n", " lower_quartile: 37.7735913559721\n", " median: 37.783211475877295\n", " upper_quartile: 37.79531236950411\n", " std: 0.08630496061661774\n", " mean: 37.771652603110894\n", "\n", "ColumnProfile\n", " name: start_station_longitude\n", " type: FieldType.DECIMAL\n", "\n", " min: -122.44429260492325\n", " max: -121.8741186\n", " count: 519700.0\n", " missing_count: 0.0\n", " error_count: 0.0\n", "\n", " lower_quartile: -122.41170653070694\n", " median: -122.39875282257843\n", " upper_quartile: -122.39103429266093\n", " std: 0.10557344899193394\n", " mean: -122.36392726512949\n", "\n", "ColumnProfile\n", " name: end_station_id\n", " type: FieldType.DECIMAL\n", "\n", " min: 3.0\n", " max: 340.0\n", " count: 519700.0\n", " missing_count: 0.0\n", " error_count: 0.0\n", "\n", " lower_quartile: 22.702361193995444\n", " median: 65.22613324081779\n", " upper_quartile: 134.21987129021295\n", " std: 84.9694914863546\n", " mean: 92.18404079276426\n", "\n", "ColumnProfile\n", " name: end_station_name\n", " type: FieldType.STRING\n", "\n", " min: 10th Ave at E 15th St\n", " max: Yerba Buena Center for the Arts (Howard St at 3rd St)\n", " count: 519700.0\n", " missing_count: 0.0\n", " error_count: 0.0\n", "\n", "ColumnProfile\n", " name: end_station_latitude\n", " type: FieldType.DECIMAL\n", "\n", " min: 37.3172979\n", " max: 37.88022244590679\n", " count: 519700.0\n", " missing_count: 0.0\n", " error_count: 0.0\n", "\n", " lower_quartile: 37.774232065528906\n", " median: 37.78329810124021\n", " upper_quartile: 37.79557128475191\n", " std: 0.08622383487119635\n", " mean: 37.771843749644646\n", "\n", "ColumnProfile\n", " name: end_station_longitude\n", " type: FieldType.DECIMAL\n", "\n", " min: -122.44429260492325\n", " max: -121.8741186\n", " count: 519700.0\n", " missing_count: 0.0\n", " error_count: 0.0\n", "\n", " lower_quartile: -122.41012752595213\n", " median: -122.39855511689811\n", " upper_quartile: -122.39096192032446\n", " std: 0.10512220222934929\n", " mean: -122.36323553679931\n", "\n", "ColumnProfile\n", " name: bike_id\n", " type: FieldType.DECIMAL\n", "\n", " min: 10.0\n", " max: 3733.0\n", " count: 519700.0\n", " missing_count: 0.0\n", " error_count: 0.0\n", "\n", " lower_quartile: 788.6785454424829\n", " median: 1726.652793720984\n", " upper_quartile: 2519.963581272433\n", " std: 971.3569593530214\n", " mean: 1672.533078699254\n", "\n", "ColumnProfile\n", " name: user_type\n", " type: FieldType.STRING\n", "\n", " min: Customer\n", " max: Subscriber\n", " count: 519700.0\n", " missing_count: 0.0\n", " error_count: 0.0\n", "\n", "ColumnProfile\n", " name: member_birth_year\n", " type: FieldType.DECIMAL\n", "\n", " min: 1900.0\n", " max: 1999.0\n", " count: 519700.0\n", " missing_count: 66541.0\n", " error_count: 2.0\n", "\n", " lower_quartile: 1974.3343079021402\n", " median: 1982.8008012973817\n", " upper_quartile: 1987.991638371539\n", " std: 10.511639915765766\n", " mean: 1980.4052039359563\n", "\n", "ColumnProfile\n", " name: member_gender\n", " type: FieldType.STRING\n", "\n", " min: \n", " max: Other\n", " count: 519700.0\n", " missing_count: 0.0\n", " error_count: 0.0" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gb_asserted = gobike\\\n", " .assert_value(\n", " columns='member_birth_year', \n", " expression=dprep.f_or(dprep.value.is_null(), dprep.value >= 1900),\n", " error_code='InvalidDate'\n", " )\n", "gb_asserted.get_profile()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we can filter to see what caused the 2 errors above:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
duration_secstart_timeend_timestart_station_idstart_station_namestart_station_latitudestart_station_longitudeend_station_idend_station_nameend_station_latitudeend_station_longitudebike_iduser_typemember_birth_yearmember_gender
02546.02017-08-19 17:47:32.110000+00:002017-08-19 18:29:58.825000+00:00197.0El Embarcadero at Grand Ave37.808848-122.24968172.0College Ave at Taft Ave37.841800-122.2515351448.0Customerazureml.dataprep.native.DataPrepError(\"'Invali...Male
11767.02017-08-19 13:20:02.170000+00:002017-08-19 13:49:29.735000+00:00235.0Union St at 10th St37.807239-122.28937197.0El Embarcadero at Grand Ave37.808848-122.24968091.0Customerazureml.dataprep.native.DataPrepError(\"'Invali...Male
\n", "
" ], "text/plain": [ " duration_sec start_time \\\n", "0 2546.0 2017-08-19 17:47:32.110000+00:00 \n", "1 1767.0 2017-08-19 13:20:02.170000+00:00 \n", "\n", " end_time start_station_id \\\n", "0 2017-08-19 18:29:58.825000+00:00 197.0 \n", "1 2017-08-19 13:49:29.735000+00:00 235.0 \n", "\n", " start_station_name start_station_latitude \\\n", "0 El Embarcadero at Grand Ave 37.808848 \n", "1 Union St at 10th St 37.807239 \n", "\n", " start_station_longitude end_station_id end_station_name \\\n", "0 -122.24968 172.0 College Ave at Taft Ave \n", "1 -122.28937 197.0 El Embarcadero at Grand Ave \n", "\n", " end_station_latitude end_station_longitude bike_id user_type \\\n", "0 37.841800 -122.251535 1448.0 Customer \n", "1 37.808848 -122.249680 91.0 Customer \n", "\n", " member_birth_year member_gender \n", "0 azureml.dataprep.native.DataPrepError(\"'Invali... Male \n", "1 azureml.dataprep.native.DataPrepError(\"'Invali... Male " ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gb_errors = gb_asserted.filter(dprep.col('member_birth_year').is_error())\n", "gb_errors.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Join\n", "But what were the original values? Let's use `join` to figure out what the values were that caused our assert to throw an error. " ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
l_duration_secl_start_timel_end_timel_start_station_idl_start_station_namel_start_station_latitudel_start_station_longitudel_end_station_idl_end_station_namel_end_station_latitude...r_start_station_latituder_start_station_longituder_end_station_idr_end_station_namer_end_station_latituder_end_station_longituder_bike_idr_user_typer_member_birth_yearr_member_gender
02546.02017-08-19 17:47:32.110000+00:002017-08-19 18:29:58.825000+00:00197.0El Embarcadero at Grand Ave37.808848-122.24968172.0College Ave at Taft Ave37.841800...37.808848-122.24968172.0College Ave at Taft Ave37.841800-122.2515351448.0Customer1886.0Male
11767.02017-08-19 13:20:02.170000+00:002017-08-19 13:49:29.735000+00:00235.0Union St at 10th St37.807239-122.28937197.0El Embarcadero at Grand Ave37.808848...37.807239-122.28937197.0El Embarcadero at Grand Ave37.808848-122.24968091.0Customer1886.0Male
\n", "

2 rows × 30 columns

\n", "
" ], "text/plain": [ " l_duration_sec l_start_time \\\n", "0 2546.0 2017-08-19 17:47:32.110000+00:00 \n", "1 1767.0 2017-08-19 13:20:02.170000+00:00 \n", "\n", " l_end_time l_start_station_id \\\n", "0 2017-08-19 18:29:58.825000+00:00 197.0 \n", "1 2017-08-19 13:49:29.735000+00:00 235.0 \n", "\n", " l_start_station_name l_start_station_latitude \\\n", "0 El Embarcadero at Grand Ave 37.808848 \n", "1 Union St at 10th St 37.807239 \n", "\n", " l_start_station_longitude l_end_station_id l_end_station_name \\\n", "0 -122.24968 172.0 College Ave at Taft Ave \n", "1 -122.28937 197.0 El Embarcadero at Grand Ave \n", "\n", " l_end_station_latitude ... r_start_station_latitude \\\n", "0 37.841800 ... 37.808848 \n", "1 37.808848 ... 37.807239 \n", "\n", " r_start_station_longitude r_end_station_id r_end_station_name \\\n", "0 -122.24968 172.0 College Ave at Taft Ave \n", "1 -122.28937 197.0 El Embarcadero at Grand Ave \n", "\n", " r_end_station_latitude r_end_station_longitude r_bike_id r_user_type \\\n", "0 37.841800 -122.251535 1448.0 Customer \n", "1 37.808848 -122.249680 91.0 Customer \n", "\n", " r_member_birth_year r_member_gender \n", "0 1886.0 Male \n", "1 1886.0 Male \n", "\n", "[2 rows x 30 columns]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gb_errors.join(\n", " left_dataflow=gb_errors,\n", " right_dataflow=gobike,\n", " join_key_pairs=[\n", " ('duration_sec', 'duration_sec'),\n", " ('start_station_id', 'start_station_id'),\n", " ('bike_id', 'bike_id')\n", " ]\n", ").head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we look at `r_member_birth_year`, we see that these people were listed as being born in 1886. That's impossible! Now that we've identified outliers and anomalies, we can appropriately clean our data however we like." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Export for machine learning\n", "\n", "One of the beautiful features of Azure ML Data Prep is that you only need to write your code once and choose whether to scale up or out; it takes care of figuring out how. To do so, you can export the `.dprep` file you've written tested on a smaller dataset, then run it with your larger dataset. Here, we show how you can export your new package. For a more detailed example on how to execute it on Spark, check out our [New York Taxicab scenario](https://github.com/Microsoft/PendletonDocs/blob/master/Scenarios/NYTaxiCab/01.new_york_taxi.ipynb)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gobike = gobike.set_name(name=\"gobike\")\n", "package_path = path.join(mkdtemp(), \"gobike.dprep\")\n", "\n", "print(\"Saving package to: {}\".format(package_path))\n", "package = dprep.Package(arg=gobike)\n", "package.save(file_path=package_path)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Want more information?\n", "\n", "Congratulations on finishing your introduction to the Azure ML Data Prep SDK! If you'd like more detailed tutorials on how to construct machine learning datasets or dive deeper into all of its functionality, you can find more information in our detailed notebooks [here](https://github.com/Microsoft/PendletonDocs). There, we cover topics including how to:\n", "\n", "* Cache your Dataflow to speed up your iterations\n", "* Add your custom Python transforms\n", "* Impute missing values\n", "* Sample your data\n", "* Reference and link between Dataflows\n", "* Apply your Dataflow to a new, larger data source" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }