Update 12-1-19
This commit is contained in:
@@ -1,427 +0,0 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Copyright (c) Microsoft Corporation. All rights reserved.\n",
|
||||
"\n",
|
||||
"Licensed under the MIT License."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Tutorial: Train a classification model with automated machine learning\n",
|
||||
"\n",
|
||||
"In this tutorial, you'll learn how to generate a machine learning model using automated machine learning (automated ML). Azure Machine Learning can perform algorithm selection and hyperparameter selection in an automated way for you. The final model can then be deployed following the workflow in the [Deploy a model](02.deploy-models.ipynb) tutorial.\n",
|
||||
"\n",
|
||||
"[flow diagram](./imgs/flow2.png)\n",
|
||||
"\n",
|
||||
"Similar to the [train models tutorial](01.train-models.ipynb), this tutorial classifies handwritten images of digits (0-9) from the [MNIST](http://yann.lecun.com/exdb/mnist/) dataset. But this time you don't to specify an algorithm or tune hyperparameters. The automated ML technique iterates over many combinations of algorithms and hyperparameters until it finds the best model based on your criterion.\n",
|
||||
"\n",
|
||||
"You'll learn how to:\n",
|
||||
"\n",
|
||||
"> * Set up your development environment\n",
|
||||
"> * Access and examine the data\n",
|
||||
"> * Train using an automated classifier locally with custom parameters\n",
|
||||
"> * Explore the results\n",
|
||||
"> * Review training results\n",
|
||||
"> * Register the best model\n",
|
||||
"\n",
|
||||
"## Prerequisites\n",
|
||||
"\n",
|
||||
"Use [these instructions](https://aka.ms/aml-how-to-configure-environment) to: \n",
|
||||
"* Create a workspace and its configuration file (**config.json**) \n",
|
||||
"* Upload your **config.json** to the same folder as this notebook"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Start a notebook\n",
|
||||
"\n",
|
||||
"To follow along, start a new notebook from the same directory as **config.json** and copy the code from the sections below.\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"## Set up your development environment\n",
|
||||
"\n",
|
||||
"All the setup for your development work can be accomplished in the Python notebook. Setup includes:\n",
|
||||
"\n",
|
||||
"* Import Python packages\n",
|
||||
"* Configure a workspace to enable communication between your local computer and remote resources\n",
|
||||
"* Create a directory to store training scripts\n",
|
||||
"\n",
|
||||
"### Import packages\n",
|
||||
"Import Python packages you need in this tutorial."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import azureml.core\n",
|
||||
"import pandas as pd\n",
|
||||
"from azureml.core.workspace import Workspace\n",
|
||||
"from azureml.train.automl.run import AutoMLRun\n",
|
||||
"import time\n",
|
||||
"import logging\n",
|
||||
"from sklearn import datasets\n",
|
||||
"from matplotlib import pyplot as plt\n",
|
||||
"from matplotlib.pyplot import imshow\n",
|
||||
"import random\n",
|
||||
"import numpy as np"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Configure workspace\n",
|
||||
"\n",
|
||||
"Create a workspace object from the existing workspace. `Workspace.from_config()` reads the file **aml_config/config.json** and loads the details into an object named `ws`. `ws` is used throughout the rest of the code in this tutorial.\n",
|
||||
"\n",
|
||||
"Once you have a workspace object, specify a name for the experiment and create and register a local directory with the workspace. The history of all runs is recorded under the specified experiment."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"ws = Workspace.from_config()\n",
|
||||
"# choose a name for the run history container in the workspace\n",
|
||||
"experiment_name = 'automl-classifier'\n",
|
||||
"# project folder\n",
|
||||
"project_folder = './automl-classifier'\n",
|
||||
"\n",
|
||||
"import os\n",
|
||||
"\n",
|
||||
"output = {}\n",
|
||||
"output['SDK version'] = azureml.core.VERSION\n",
|
||||
"output['Subscription ID'] = ws.subscription_id\n",
|
||||
"output['Workspace'] = ws.name\n",
|
||||
"output['Resource Group'] = ws.resource_group\n",
|
||||
"output['Location'] = ws.location\n",
|
||||
"output['Project Directory'] = project_folder\n",
|
||||
"pd.set_option('display.max_colwidth', -1)\n",
|
||||
"pd.DataFrame(data=output, index=['']).T"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Explore data\n",
|
||||
"\n",
|
||||
"The initial training tutorial used a high-resolution version of the MNIST dataset (28x28 pixels). Since auto training requires many iterations, this tutorial uses a smaller resolution version of the images (8x8 pixels) to demonstrate the concepts while speeding up the time needed for each iteration."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from sklearn import datasets\n",
|
||||
"\n",
|
||||
"digits = datasets.load_digits()\n",
|
||||
"\n",
|
||||
"# Exclude the first 100 rows from training so that they can be used for test.\n",
|
||||
"X_train = digits.data[100:,:]\n",
|
||||
"y_train = digits.target[100:]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Display some sample images\n",
|
||||
"\n",
|
||||
"Load the data into `numpy` arrays. Then use `matplotlib` to plot 30 random images from the dataset with their labels above them."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"count = 0\n",
|
||||
"sample_size = 30\n",
|
||||
"plt.figure(figsize = (16, 6))\n",
|
||||
"for i in np.random.permutation(X_train.shape[0])[:sample_size]:\n",
|
||||
" count = count + 1\n",
|
||||
" plt.subplot(1, sample_size, count)\n",
|
||||
" plt.axhline('')\n",
|
||||
" plt.axvline('')\n",
|
||||
" plt.text(x = 2, y = -2, s = y_train[i], fontsize = 18)\n",
|
||||
" plt.imshow(X_train[i].reshape(8, 8), cmap = plt.cm.Greys)\n",
|
||||
"plt.show()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"You now have the necessary packages and data ready for auto training for your model. \n",
|
||||
"\n",
|
||||
"## Auto train a model \n",
|
||||
"\n",
|
||||
"To auto train a model, first define settings for autogeneration and tuning and then run the automatic classifier.\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"### Define settings for autogeneration and tuning\n",
|
||||
"\n",
|
||||
"Define the experiment parameters and models settings for autogeneration and tuning. \n",
|
||||
"\n",
|
||||
"\n",
|
||||
"|Property| Value in this tutorial |Description|\n",
|
||||
"|----|----|---|\n",
|
||||
"|**primary_metric**|AUC Weighted | Metric that you want to optimize.|\n",
|
||||
"|**max_time_sec**|12,000|Time limit in seconds for each iteration|\n",
|
||||
"|**iterations**|20|Number of iterations. In each iteration, the model trains with the data with a specific pipeline|\n",
|
||||
"|**n_cross_validations**|3|Number of cross validation splits|\n",
|
||||
"|**exit_score**|0.9985|*double* value indicating the target for *primary_metric*. Once the target is surpassed the run terminates|\n",
|
||||
"|**blacklist_algos**|['kNN','LinearSVM']|*Array* of *strings* indicating algorithms to ignore.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"tags": [
|
||||
"configure automl"
|
||||
]
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.train.automl import AutoMLConfig\n",
|
||||
"\n",
|
||||
"##Local compute \n",
|
||||
"Automl_config = AutoMLConfig(task = 'classification',\n",
|
||||
" primary_metric = 'AUC_weighted',\n",
|
||||
" max_time_sec = 12000,\n",
|
||||
" iterations = 20,\n",
|
||||
" n_cross_validations = 3,\n",
|
||||
" exit_score = 0.9985,\n",
|
||||
" blacklist_algos = ['kNN','LinearSVM'],\n",
|
||||
" X = X_train,\n",
|
||||
" y = y_train,\n",
|
||||
" path=project_folder)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Run the automatic classifier\n",
|
||||
"\n",
|
||||
"Start the experiment to run locally. Define the compute target as local and set the output to true to view progress on the experiment."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"tags": [
|
||||
"local submitted run",
|
||||
"automl"
|
||||
]
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.core.experiment import Experiment\n",
|
||||
"experiment=Experiment(ws, experiment_name)\n",
|
||||
"local_run = experiment.submit(Automl_config, show_output=True)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Explore the results\n",
|
||||
"\n",
|
||||
"Explore the results of automatic training with a Jupyter widget or by examining the experiment history.\n",
|
||||
"\n",
|
||||
"### Jupyter widget\n",
|
||||
"\n",
|
||||
"Use the Jupyter notebook widget to see a graph and a table of all results."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"tags": [
|
||||
"use notebook widget"
|
||||
]
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.widgets import RunDetails\n",
|
||||
"RunDetails(local_run).show()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Retrieve all iterations\n",
|
||||
"\n",
|
||||
"View the experiment history and see individual metrics for each iteration run."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"tags": [
|
||||
"get metrics",
|
||||
"query history"
|
||||
]
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"children = list(local_run.get_children())\n",
|
||||
"metricslist = {}\n",
|
||||
"for run in children:\n",
|
||||
" properties = run.get_properties()\n",
|
||||
" metrics = {k: v for k, v in run.get_metrics().items() if isinstance(v, float)}\n",
|
||||
" metricslist[int(properties['iteration'])] = metrics\n",
|
||||
"\n",
|
||||
"import pandas as pd\n",
|
||||
"rundata = pd.DataFrame(metricslist).sort_index(1)\n",
|
||||
"rundata"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Register the best model \n",
|
||||
"\n",
|
||||
"Use the `local_run` object to get the best model and register it into the workspace. "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"tags": [
|
||||
"query history",
|
||||
"register model from history"
|
||||
]
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# find the run with the highest accuracy value.\n",
|
||||
"best_run, fitted_model = local_run.get_output()\n",
|
||||
"\n",
|
||||
"# register model in workspace\n",
|
||||
"description = 'Automated Machine Learning Model'\n",
|
||||
"tags = None\n",
|
||||
"local_run.register_model(description=description, tags=tags)\n",
|
||||
"local_run.model_id # Use this id to deploy the model as a web service in Azure"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Test the best model\n",
|
||||
"\n",
|
||||
"Use the model to predict a few random digits. Display the predicted and the image. Red font and inverse image (white on black) is used to highlight the misclassified samples.\n",
|
||||
"\n",
|
||||
"Since the model accuracy is high, you might have to run the following code a few times before you can see a misclassified sample."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# find 30 random samples from test set\n",
|
||||
"n = 30\n",
|
||||
"X_test = digits.data[:100, :]\n",
|
||||
"y_test = digits.target[:100]\n",
|
||||
"sample_indices = np.random.permutation(X_test.shape[0])[0:n]\n",
|
||||
"test_samples = X_test[sample_indices]\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"# predict using the model\n",
|
||||
"result = fitted_model.predict(test_samples)\n",
|
||||
"\n",
|
||||
"# compare actual value vs. the predicted values:\n",
|
||||
"i = 0\n",
|
||||
"plt.figure(figsize = (20, 1))\n",
|
||||
"\n",
|
||||
"for s in sample_indices:\n",
|
||||
" plt.subplot(1, n, i + 1)\n",
|
||||
" plt.axhline('')\n",
|
||||
" plt.axvline('')\n",
|
||||
" \n",
|
||||
" # use different color for misclassified sample\n",
|
||||
" font_color = 'red' if y_test[s] != result[i] else 'black'\n",
|
||||
" clr_map = plt.cm.gray if y_test[s] != result[i] else plt.cm.Greys\n",
|
||||
" \n",
|
||||
" plt.text(x = 2, y = -2, s = result[i], fontsize = 18, color = font_color)\n",
|
||||
" plt.imshow(X_test[s].reshape(8, 8), cmap = clr_map)\n",
|
||||
" \n",
|
||||
" i = i + 1\n",
|
||||
"plt.show()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Next steps\n",
|
||||
"\n",
|
||||
"In this Azure Machine Learning tutorial, you used Python to:\n",
|
||||
"\n",
|
||||
"> * Set up your development environment\n",
|
||||
"> * Access and examine the data\n",
|
||||
"> * Train using an automated classifier locally with custom parameters\n",
|
||||
"> * Explore the results\n",
|
||||
"> * Review training results\n",
|
||||
"> * Register the best model\n",
|
||||
"\n",
|
||||
"Learn more about [how to configure settings for automatic training](https://aka.ms/aml-how-to-configure-auto) or [how to use automatic training on a remote resource](https://aka.ms/aml-how-to-auto-remote)."
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"authors": [
|
||||
{
|
||||
"name": "jeffshep"
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.6"
|
||||
},
|
||||
"msauthor": "sgilley"
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
@@ -1,561 +0,0 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Copyright (c) Microsoft Corporation. All rights reserved.\n",
|
||||
"\n",
|
||||
"Licensed under the MIT License."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Tutorial: Use Azure DataPrep SDK to prepare data for machine learning"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Prepare data for use as a training data set in a machine learning model with the Azure DataPrep SDK. Perform various transformations to filter and combine two different NYC Taxi data sets. Learn some of the unique features of the DataPrep SDK: \n",
|
||||
"\n",
|
||||
"* Transform data from derived examples \n",
|
||||
"* Infer field type from data \n",
|
||||
"\n",
|
||||
"This tutorial is part one of a two-part tutorial series."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In this tutorial you:\n",
|
||||
"* Load two datasets with different field names \n",
|
||||
"* Cleanse the data \n",
|
||||
"* Use smart transforms to predict your logic based on an example\n",
|
||||
"* Use automated feature engineering to build dynamic fields \n",
|
||||
"* Merge the two datasets to use for your machine learning training \n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Import packages\n",
|
||||
"Begin by importing the Azure DataPrep SDK."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import azureml.dataprep as dprep"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Load data\n",
|
||||
"Download two different NYC Taxi data sets into dataflow objects. These datasets contain slightly different fields. The method `auto_read_file()` automatically recognizes the input file type."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dataset_root = \"https://dprepdata.blob.core.windows.net/demo\"\n",
|
||||
"\n",
|
||||
"green_path = \"/\".join([dataset_root, \"green-small/*\"])\n",
|
||||
"yellow_path = \"/\".join([dataset_root, \"yellow-small/*\"])\n",
|
||||
"\n",
|
||||
"green_df = dprep.read_csv(path=green_path, header=dprep.PromoteHeadersMode.GROUPED)\n",
|
||||
"yellow_df = dprep.auto_read_file(path=yellow_path)\n",
|
||||
"\n",
|
||||
"display(green_df.head(5))\n",
|
||||
"display(yellow_df.head(5))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Data transformation"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Now you populate some variables with shortcut transforms that will apply to all dataflows. The variable `drop_if_all_null` will be used to delete records where all fields are null. The variable `useful_columns` holds an array of column descriptions that are retained in each dataflow."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"all_columns = dprep.ColumnSelector(term=\".*\", use_regex=True)\n",
|
||||
"drop_if_all_null = [all_columns, dprep.ColumnRelationship(dprep.ColumnRelationship.ALL)]\n",
|
||||
"useful_columns = [\n",
|
||||
" \"cost\", \"distance\"\"distance\", \"dropoff_datetime\", \"dropoff_latitude\", \"dropoff_longitude\",\n",
|
||||
" \"passengers\", \"pickup_datetime\", \"pickup_latitude\", \"pickup_longitude\", \"store_forward\", \"vendor\"\n",
|
||||
"]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"You first work with the green taxi data and get it into a valid shape that can be combined with the yellow taxi data. Create a temporary dataflow `tmp_df`, and call the `replace_na()`, `drop_nulls()`, and `keep_columns()` functions using the shortcut transform variables you created. Additionally, rename all the columns in the dataframe to match the names in `useful_columns`."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"tmp_df = (green_df\n",
|
||||
" .replace_na(columns=all_columns)\n",
|
||||
" .drop_nulls(*drop_if_all_null)\n",
|
||||
" .rename_columns(column_pairs={\n",
|
||||
" \"VendorID\": \"vendor\",\n",
|
||||
" \"lpep_pickup_datetime\": \"pickup_datetime\",\n",
|
||||
" \"Lpep_dropoff_datetime\": \"dropoff_datetime\",\n",
|
||||
" \"lpep_dropoff_datetime\": \"dropoff_datetime\",\n",
|
||||
" \"Store_and_fwd_flag\": \"store_forward\",\n",
|
||||
" \"store_and_fwd_flag\": \"store_forward\",\n",
|
||||
" \"Pickup_longitude\": \"pickup_longitude\",\n",
|
||||
" \"Pickup_latitude\": \"pickup_latitude\",\n",
|
||||
" \"Dropoff_longitude\": \"dropoff_longitude\",\n",
|
||||
" \"Dropoff_latitude\": \"dropoff_latitude\",\n",
|
||||
" \"Passenger_count\": \"passengers\",\n",
|
||||
" \"Fare_amount\": \"cost\",\n",
|
||||
" \"Trip_distance\": \"distance\"\n",
|
||||
" })\n",
|
||||
" .keep_columns(columns=useful_columns))\n",
|
||||
"tmp_df.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Overwrite the `green_df` variable with the transforms performed on `tmp_df` in the previous step."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"green_df = tmp_df"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Perform the same transformation steps to the yellow taxi data."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"tmp_df = (yellow_df\n",
|
||||
" .replace_na(columns=all_columns)\n",
|
||||
" .drop_nulls(*drop_if_all_null)\n",
|
||||
" .rename_columns(column_pairs={\n",
|
||||
" \"vendor_name\": \"vendor\",\n",
|
||||
" \"VendorID\": \"vendor\",\n",
|
||||
" \"vendor_id\": \"vendor\",\n",
|
||||
" \"Trip_Pickup_DateTime\": \"pickup_datetime\",\n",
|
||||
" \"tpep_pickup_datetime\": \"pickup_datetime\",\n",
|
||||
" \"Trip_Dropoff_DateTime\": \"dropoff_datetime\",\n",
|
||||
" \"tpep_dropoff_datetime\": \"dropoff_datetime\",\n",
|
||||
" \"store_and_forward\": \"store_forward\",\n",
|
||||
" \"store_and_fwd_flag\": \"store_forward\",\n",
|
||||
" \"Start_Lon\": \"pickup_longitude\",\n",
|
||||
" \"Start_Lat\": \"pickup_latitude\",\n",
|
||||
" \"End_Lon\": \"dropoff_longitude\",\n",
|
||||
" \"End_Lat\": \"dropoff_latitude\",\n",
|
||||
" \"Passenger_Count\": \"passengers\",\n",
|
||||
" \"passenger_count\": \"passengers\",\n",
|
||||
" \"Fare_Amt\": \"cost\",\n",
|
||||
" \"fare_amount\": \"cost\",\n",
|
||||
" \"Trip_Distance\": \"distance\",\n",
|
||||
" \"trip_distance\": \"distance\"\n",
|
||||
" })\n",
|
||||
" .keep_columns(columns=useful_columns))\n",
|
||||
"tmp_df.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Again, overwrite `yellow_df` with `tmp_df`, and then call the `append_rows()` function on the green taxi data to append the yellow taxi data, creating a new combined dataframe."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"yellow_df = tmp_df\n",
|
||||
"combined_df = green_df.append_rows(other_activities=[yellow_df])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Convert types and filter "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Examine the pickup and drop-off coordinates summary statistics to see how the data is distributed. First define a `TypeConverter` object to change the lat/long fields to decimal type. Next, call the `keep_columns()` function to restrict output to only the lat/long fields, and then call `get_profile()`."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"decimal_type = dprep.TypeConverter(data_type=dprep.FieldType.DECIMAL)\n",
|
||||
"combined_df = combined_df.set_column_types(type_conversions={\n",
|
||||
" \"pickup_longitude\": decimal_type,\n",
|
||||
" \"pickup_latitude\": decimal_type,\n",
|
||||
" \"dropoff_longitude\": decimal_type,\n",
|
||||
" \"dropoff_latitude\": decimal_type\n",
|
||||
"})\n",
|
||||
"combined_df.keep_columns(columns=[\n",
|
||||
" \"pickup_longitude\", \"pickup_latitude\", \n",
|
||||
" \"dropoff_longitude\", \"dropoff_latitude\"\n",
|
||||
"]).get_profile()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"From the summary statistics output, you see that there are coordinates that are missing, and coordinates that are not in New York City. Filter out coordinates not in the city border by chaining column filter commands within the `filter()` function, and defining minimum and maximum bounds for each field. Then call `get_profile()` again to verify the transformation."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"tmp_df = (combined_df\n",
|
||||
" .drop_nulls(\n",
|
||||
" columns=[\"pickup_longitude\", \"pickup_latitude\", \"dropoff_longitude\", \"dropoff_latitude\"],\n",
|
||||
" column_relationship=dprep.ColumnRelationship(dprep.ColumnRelationship.ANY)\n",
|
||||
" ) \n",
|
||||
" .filter(dprep.f_and(\n",
|
||||
" dprep.col(\"pickup_longitude\") <= -73.72,\n",
|
||||
" dprep.col(\"pickup_longitude\") >= -74.09,\n",
|
||||
" dprep.col(\"pickup_latitude\") <= 40.88,\n",
|
||||
" dprep.col(\"pickup_latitude\") >= 40.53,\n",
|
||||
" dprep.col(\"dropoff_longitude\") <= -73.72,\n",
|
||||
" dprep.col(\"dropoff_longitude\") >= -74.09,\n",
|
||||
" dprep.col(\"dropoff_latitude\") <= 40.88,\n",
|
||||
" dprep.col(\"dropoff_latitude\") >= 40.53\n",
|
||||
" )))\n",
|
||||
"tmp_df.keep_columns(columns=[\n",
|
||||
" \"pickup_longitude\", \"pickup_latitude\", \n",
|
||||
" \"dropoff_longitude\", \"dropoff_latitude\"\n",
|
||||
"]).get_profile()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Overwrite `combined_df` with the transformations you made to `tmp_df`."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"combined_df = tmp_df"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Split and rename columns"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Look at the data profile for the `store_forward` column."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"combined_df.keep_columns(columns='store_forward').get_profile()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"From the data profile output of `store_forward`, you see that the data is inconsistent and there are missing/null values. Replace these values using the `replace()` and `fill_nulls()` functions, and in both cases change to the string \"N\"."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"combined_df = combined_df.replace(columns=\"store_forward\", find=\"0\", replace_with=\"N\").fill_nulls(\"store_forward\", \"N\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Split the pick up and drop off datetimes into respective date and time columns. Use `split_column_by_example()` to perform the split. In this case, the optional `example` parameter of `split_column_by_example()` is omitted. Therefore the function will automatically determine where to split based on the data."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"tmp_df = (combined_df\n",
|
||||
" .split_column_by_example(source_column=\"pickup_datetime\")\n",
|
||||
" .split_column_by_example(source_column=\"dropoff_datetime\"))\n",
|
||||
"tmp_df.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Rename the columns generated by `split_column_by_example()` into meaningful names."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"tmp_df_renamed = (tmp_df\n",
|
||||
" .rename_columns(column_pairs={\n",
|
||||
" \"pickup_datetime_1\": \"pickup_date\",\n",
|
||||
" \"pickup_datetime_2\": \"pickup_time\",\n",
|
||||
" \"dropoff_datetime_1\": \"dropoff_date\",\n",
|
||||
" \"dropoff_datetime_2\": \"dropoff_time\"\n",
|
||||
" }))\n",
|
||||
"tmp_df_renamed.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Overwrite `combined_df` with the executed transformations, and then call `get_profile()` to see full summary statistics after all transformations."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"combined_df = tmp_df_renamed\n",
|
||||
"combined_df.get_profile()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Feature engineering"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Split the pickup and drop-off date further into day of week, day of month, and month. To get day of week, use the `derive_column_by_example()` function. This function takes as a parameter an array of example objects that define the input data, and the desired output. The function then automatically determines your desired transformation. For pickup and drop-off time columns, split into hour, minute, and second using the `split_column_by_example()` function with no example parameter.\n",
|
||||
"\n",
|
||||
"Once you have generated these new features, delete the original fields in favor of the newly generated features using `drop_columns()`. Rename all remaining fields to accurate descriptions."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"tmp_df = (combined_df\n",
|
||||
" .derive_column_by_example(\n",
|
||||
" source_columns=\"pickup_date\", \n",
|
||||
" new_column_name=\"pickup_weekday\", \n",
|
||||
" example_data=[(\"2009-01-04\", \"Sunday\"), (\"2013-08-22\", \"Thursday\")]\n",
|
||||
" )\n",
|
||||
" .derive_column_by_example(\n",
|
||||
" source_columns=\"dropoff_date\",\n",
|
||||
" new_column_name=\"dropoff_weekday\",\n",
|
||||
" example_data=[(\"2013-08-22\", \"Thursday\"), (\"2013-11-03\", \"Sunday\")]\n",
|
||||
" )\n",
|
||||
" \n",
|
||||
" .split_column_by_example(source_column=\"pickup_time\")\n",
|
||||
" .split_column_by_example(source_column=\"dropoff_time\")\n",
|
||||
" # the following two split_column_by_example calls reference the generated column names from the above two calls\n",
|
||||
" .split_column_by_example(source_column=\"pickup_time_1\")\n",
|
||||
" .split_column_by_example(source_column=\"dropoff_time_1\")\n",
|
||||
" .drop_columns(columns=[\n",
|
||||
" \"pickup_date\", \"pickup_time\", \"dropoff_date\", \"dropoff_time\", \n",
|
||||
" \"pickup_date_1\", \"dropoff_date_1\", \"pickup_time_1\", \"dropoff_time_1\"\n",
|
||||
" ])\n",
|
||||
" \n",
|
||||
" .rename_columns(column_pairs={\n",
|
||||
" \"pickup_date_2\": \"pickup_month\",\n",
|
||||
" \"pickup_date_3\": \"pickup_monthday\",\n",
|
||||
" \"pickup_time_1_1\": \"pickup_hour\",\n",
|
||||
" \"pickup_time_1_2\": \"pickup_minute\",\n",
|
||||
" \"pickup_time_2\": \"pickup_second\",\n",
|
||||
" \"dropoff_date_2\": \"dropoff_month\",\n",
|
||||
" \"dropoff_date_3\": \"dropoff_monthday\",\n",
|
||||
" \"dropoff_time_1_1\": \"dropoff_hour\",\n",
|
||||
" \"dropoff_time_1_2\": \"dropoff_minute\",\n",
|
||||
" \"dropoff_time_2\": \"dropoff_second\"\n",
|
||||
" }))\n",
|
||||
"\n",
|
||||
"tmp_df.head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"From the data above, you see that the pickup and drop-off date and time components produced from the derived transformations are correct. Drop the `pickup_datetime` and `dropoff_datetime` columns as they are no longer needed."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"tmp_df = tmp_df.drop_columns(columns=[\"pickup_datetime\", \"dropoff_datetime\"])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Use the type inference functionality to automatically check the data type of each field, and display the inference results using `inference_info()`."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"type_infer = tmp_df.builders.set_column_types()\n",
|
||||
"type_infer.learn()\n",
|
||||
"type_infer.inference_info"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The inference results look correct based on the data, now apply the type conversions to the dataflow."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"tmp_df = type_infer.to_dataflow()\n",
|
||||
"tmp_df.get_profile()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"At this point, you have a fully transformed and prepared dataflow object to use in a machine learning model. The DataPrep SDK includes object serialization functionality, which is used as follows."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dflow_prepared = tmp_df\n",
|
||||
"package = dprep.Package([dflow_prepared])\n",
|
||||
"package.save(\".\\dflow\")"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"authors": [
|
||||
{
|
||||
"name": "cforbe"
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.7"
|
||||
},
|
||||
"msauthor": "trbye"
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
Reference in New Issue
Block a user