Compare commits

...

3 Commits

Author SHA1 Message Date
Sheri Gilley
184680f1d2 Update img-classification-part1-training.ipynb
updated explanation of datastore
2019-08-20 17:52:45 -05:00
Shané Winner
474f58bd0b Merge pull request #540 from trevorbye/master
removing tutorials for single combined tutorial
2019-08-20 15:22:47 -07:00
Trevor Bye
22c8433897 removing tutorials for single combined tutorial 2019-08-20 12:09:21 -07:00
7 changed files with 666 additions and 1206 deletions

View File

@@ -125,10 +125,10 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create or Attach existing compute resource\n",
"### Create or attach existing compute target\n",
"By using Azure Machine Learning Compute, a managed service, data scientists can train machine learning models on clusters of Azure virtual machines. Examples include VMs with GPU support. In this tutorial, you create Azure Machine Learning Compute as your training environment. The code below creates the compute clusters for you if they don't already exist in your workspace.\n",
"\n",
"**Creation of compute takes approximately 5 minutes.** If the AmlCompute with that name is already in your workspace the code will skip the creation process."
"**Creation of compute target takes approximately 5 minutes.** If the AmlCompute with that name is already in your workspace the code will skip the creation process."
]
},
{
@@ -258,9 +258,9 @@
"\n",
"### Upload data to the cloud\n",
"\n",
"Now make the data accessible remotely by uploading that data from your local machine into Azure so it can be accessed for remote training. The datastore is a convenient construct associated with your workspace for you to upload/download data, and interact with it from your remote compute targets. It is backed by Azure blob storage account.\n",
"You downloaded and used the training data on the computer your notebook is running on. In the next section, you will train a model on the remote Azure Machine Learning Compute. The remote compute resource will also need access to your data. To provide access, upload your data to a centralized datastore associated with your workspace. This datastore provides fast access to your data when using remote compute targets in the cloud, as it is in the Azure data center.\n",
"\n",
"The MNIST files are uploaded into a directory named `mnist` at the root of the datastore. See [access data from your datastores](https://docs.microsoft.com/bs-latn-ba/azure/machine-learning/service/how-to-access-data) for more information."
"Upload the MNIST files into a directory named `mnist` at the root of the datastore: See [access data from your datastores](https://docs.microsoft.com/azure/machine-learning/service/how-to-access-data) for more information."
]
},
{
@@ -690,4 +690,4 @@
},
"nbformat": 4,
"nbformat_minor": 2
}
}

View File

@@ -0,0 +1,654 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Copyright (c) Microsoft Corporation. All rights reserved."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/tutorials/regression-part2-automated-ml.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Tutorial: Use automated machine learning to predict taxi fares"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this tutorial, you use automated machine learning in Azure Machine Learning service to create a regression model to predict NYC taxi fare prices. This process accepts training data and configuration settings, and automatically iterates through combinations of different feature normalization/standardization methods, models, and hyperparameter settings to arrive at the best model.\n",
"\n",
"In this tutorial you learn the following tasks:\n",
"\n",
"* Download, transform, and clean data using Azure Open Datasets\n",
"* Train an automated machine learning regression model\n",
"* Calculate model accuracy\n",
"\n",
"If you dont have an Azure subscription, create a free account before you begin. Try the [free or paid version](https://aka.ms/AMLFree) of Azure Machine Learning service today."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Prerequisites"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* Complete the [setup tutorial](https://docs.microsoft.com/azure/machine-learning/service/tutorial-1st-experiment-sdk-setup) if you don't already have an Azure Machine Learning service workspace or notebook virtual machine.\n",
"* After you complete the setup tutorial, open the **tutorials/regression-automated-ml.ipynb** notebook using the same notebook server.\n",
"\n",
"This tutorial is also available on [GitHub](https://github.com/Azure/MachineLearningNotebooks/tree/master/tutorials) if you wish to run it in your own [local environment](https://docs.microsoft.com/azure/machine-learning/service/how-to-configure-environment#local). Run `pip install azureml-sdk[automl] azureml-opendatasets azureml-widgets` to get the required packages."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Download and prepare data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Import the necessary packages. The Open Datasets package contains a class representing each data source (`NycTlcGreen` for example) to easily filter date parameters before downloading."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.opendatasets import NycTlcGreen\n",
"import pandas as pd\n",
"from datetime import datetime\n",
"from dateutil.relativedelta import relativedelta"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Begin by creating a dataframe to hold the taxi data. When working in a non-Spark environment, Open Datasets only allows downloading one month of data at a time with certain classes to avoid `MemoryError` with large datasets. To download taxi data, iteratively fetch one month at a time, and before appending it to `green_taxi_df` randomly sample 2,000 records from each month to avoid bloating the dataframe. Then preview the data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"green_taxi_df = pd.DataFrame([])\n",
"start = datetime.strptime(\"1/1/2015\",\"%m/%d/%Y\")\n",
"end = datetime.strptime(\"1/31/2015\",\"%m/%d/%Y\")\n",
"\n",
"for sample_month in range(12):\n",
" temp_df_green = NycTlcGreen(start + relativedelta(months=sample_month), end + relativedelta(months=sample_month)) \\\n",
" .to_pandas_dataframe()\n",
" green_taxi_df = green_taxi_df.append(temp_df_green.sample(2000))\n",
" \n",
"green_taxi_df.head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that the initial data is loaded, define a function to create various time-based features from the pickup datetime field. This will create new fields for the month number, day of month, day of week, and hour of day, and will allow the model to factor in time-based seasonality. \n",
"\n",
"Use the `apply()` function on the dataframe to iteratively apply the `build_time_features()` function to each row in the taxi data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def build_time_features(vector):\n",
" pickup_datetime = vector[0]\n",
" month_num = pickup_datetime.month\n",
" day_of_month = pickup_datetime.day\n",
" day_of_week = pickup_datetime.weekday()\n",
" hour_of_day = pickup_datetime.hour\n",
" \n",
" return pd.Series((month_num, day_of_month, day_of_week, hour_of_day))\n",
"\n",
"green_taxi_df[[\"month_num\", \"day_of_month\",\"day_of_week\", \"hour_of_day\"]] = green_taxi_df[[\"lpepPickupDatetime\"]].apply(build_time_features, axis=1)\n",
"green_taxi_df.head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Remove some of the columns that you won't need for training or additional feature building."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"columns_to_remove = [\"lpepPickupDatetime\", \"lpepDropoffDatetime\", \"puLocationId\", \"doLocationId\", \"extra\", \"mtaTax\",\n",
" \"improvementSurcharge\", \"tollsAmount\", \"ehailFee\", \"tripType\", \"rateCodeID\", \n",
" \"storeAndFwdFlag\", \"paymentType\", \"fareAmount\", \"tipAmount\"\n",
" ]\n",
"for col in columns_to_remove:\n",
" green_taxi_df.pop(col)\n",
" \n",
"green_taxi_df.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Cleanse data "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Run the `describe()` function on the new dataframe to see summary statistics for each field."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"green_taxi_df.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"From the summary statistics, you see that there are several fields that have outliers or values that will reduce model accuracy. First filter the lat/long fields to be within the bounds of the Manhattan area. This will filter out longer taxi trips or trips that are outliers in respect to their relationship with other features. \n",
"\n",
"Additionally filter the `tripDistance` field to be greater than zero but less than 31 miles (the haversine distance between the two lat/long pairs). This eliminates long outlier trips that have inconsistent trip cost.\n",
"\n",
"Lastly, the `totalAmount` field has negative values for the taxi fares, which don't make sense in the context of our model, and the `passengerCount` field has bad data with the minimum values being zero.\n",
"\n",
"Filter out these anomalies using query functions, and then remove the last few columns unnecessary for training."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"final_df = green_taxi_df.query(\"pickupLatitude>=40.53 and pickupLatitude<=40.88\")\n",
"final_df = final_df.query(\"pickupLongitude>=-74.09 and pickupLongitude<=-73.72\")\n",
"final_df = final_df.query(\"tripDistance>=0.25 and tripDistance<31\")\n",
"final_df = final_df.query(\"passengerCount>0 and totalAmount>0\")\n",
"\n",
"columns_to_remove_for_training = [\"pickupLongitude\", \"pickupLatitude\", \"dropoffLongitude\", \"dropoffLatitude\"]\n",
"for col in columns_to_remove_for_training:\n",
" final_df.pop(col)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Call `describe()` again on the data to ensure cleansing worked as expected. You now have a prepared and cleansed set of taxi, holiday, and weather data to use for machine learning model training."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"final_df.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Configure workspace\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create a workspace object from the existing workspace. A [Workspace](https://docs.microsoft.com/python/api/azureml-core/azureml.core.workspace.workspace?view=azure-ml-py) is a class that accepts your Azure subscription and resource information. It also creates a cloud resource to monitor and track your model runs. `Workspace.from_config()` reads the file **config.json** and loads the authentication details into an object named `ws`. `ws` is used throughout the rest of the code in this tutorial."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.workspace import Workspace\n",
"ws = Workspace.from_config()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Split the data into train and test sets"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Split the data into training and test sets by using the `train_test_split` function in the `scikit-learn` library. This function segregates the data into the x (**features**) data set for model training and the y (**values to predict**) data set for testing. The `test_size` parameter determines the percentage of data to allocate to testing. The `random_state` parameter sets a seed to the random generator, so that your train-test splits are deterministic."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"y_df = final_df.pop(\"totalAmount\")\n",
"x_df = final_df\n",
"\n",
"x_train, x_test, y_train, y_test = train_test_split(x_df, y_df, test_size=0.2, random_state=223)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The purpose of this step is to have data points to test the finished model that haven't been used to train the model, in order to measure true accuracy. \n",
"\n",
"In other words, a well-trained model should be able to accurately make predictions from data it hasn't already seen. You now have data prepared for auto-training a machine learning model."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Automatically train a model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To automatically train a model, take the following steps:\n",
"1. Define settings for the experiment run. Attach your training data to the configuration, and modify settings that control the training process.\n",
"1. Submit the experiment for model tuning. After submitting the experiment, the process iterates through different machine learning algorithms and hyperparameter settings, adhering to your defined constraints. It chooses the best-fit model by optimizing an accuracy metric."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Define training settings"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Define the experiment parameter and model settings for training. View the full list of [settings](https://docs.microsoft.com/azure/machine-learning/service/how-to-configure-auto-train). Submitting the experiment with these default settings will take approximately 5-10 min, but if you want a shorter run time, reduce the `iterations` parameter.\n",
"\n",
"\n",
"|Property| Value in this tutorial |Description|\n",
"|----|----|---|\n",
"|**iteration_timeout_minutes**|2|Time limit in minutes for each iteration. Reduce this value to decrease total runtime.|\n",
"|**iterations**|20|Number of iterations. In each iteration, a new machine learning model is trained with your data. This is the primary value that affects total run time.|\n",
"|**primary_metric**| spearman_correlation | Metric that you want to optimize. The best-fit model will be chosen based on this metric.|\n",
"|**preprocess**| True | By using **True**, the experiment can preprocess the input data (handling missing data, converting text to numeric, etc.)|\n",
"|**verbosity**| logging.INFO | Controls the level of logging.|\n",
"|**n_cross_validations**|5|Number of cross-validation splits to perform when validation data is not specified.|"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import logging\n",
"\n",
"automl_settings = {\n",
" \"iteration_timeout_minutes\": 2,\n",
" \"iterations\": 20,\n",
" \"primary_metric\": 'spearman_correlation',\n",
" \"preprocess\": True,\n",
" \"verbosity\": logging.INFO,\n",
" \"n_cross_validations\": 5\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use your defined training settings as a `**kwargs` parameter to an `AutoMLConfig` object. Additionally, specify your training data and the type of model, which is `regression` in this case."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.train.automl import AutoMLConfig\n",
"\n",
"automl_config = AutoMLConfig(task='regression',\n",
" debug_log='automated_ml_errors.log',\n",
" X=x_train.values,\n",
" y=y_train.values.flatten(),\n",
" **automl_settings)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Automated machine learning pre-processing steps (feature normalization, handling missing data, converting text to numeric, etc.) become part of the underlying model. When using the model for predictions, the same pre-processing steps applied during training are applied to your input data automatically."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Train the automatic regression model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create an experiment object in your workspace. An experiment acts as a container for your individual runs. Pass the defined `automl_config` object to the experiment, and set the output to `True` to view progress during the run. \n",
"\n",
"After starting the experiment, the output shown updates live as the experiment runs. For each iteration, you see the model type, the run duration, and the training accuracy. The field `BEST` tracks the best running training score based on your metric type."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.experiment import Experiment\n",
"experiment = Experiment(ws, \"taxi-experiment\")\n",
"local_run = experiment.submit(automl_config, show_output=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Explore the results"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Explore the results of automatic training with a [Jupyter widget](https://docs.microsoft.com/python/api/azureml-widgets/azureml.widgets?view=azure-ml-py). The widget allows you to see a graph and table of all individual run iterations, along with training accuracy metrics and metadata. Additionally, you can filter on different accuracy metrics than your primary metric with the dropdown selector."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.widgets import RunDetails\n",
"RunDetails(local_run).show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Retrieve the best model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Select the best model from your iterations. The `get_output` function returns the best run and the fitted model for the last fit invocation. By using the overloads on `get_output`, you can retrieve the best run and fitted model for any logged metric or a particular iteration."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"best_run, fitted_model = local_run.get_output()\n",
"print(best_run)\n",
"print(fitted_model)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Test the best model accuracy"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use the best model to run predictions on the test data set to predict taxi fares. The function `predict` uses the best model and predicts the values of y, **trip cost**, from the `x_test` data set. Print the first 10 predicted cost values from `y_predict`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"y_predict = fitted_model.predict(x_test.values)\n",
"print(y_predict[:10])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Calculate the `root mean squared error` of the results. Convert the `y_test` dataframe to a list to compare to the predicted values. The function `mean_squared_error` takes two arrays of values and calculates the average squared error between them. Taking the square root of the result gives an error in the same units as the y variable, **cost**. It indicates roughly how far the taxi fare predictions are from the actual fares."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.metrics import mean_squared_error\n",
"from math import sqrt\n",
"\n",
"y_actual = y_test.values.flatten().tolist()\n",
"rmse = sqrt(mean_squared_error(y_actual, y_predict))\n",
"rmse"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Run the following code to calculate mean absolute percent error (MAPE) by using the full `y_actual` and `y_predict` data sets. This metric calculates an absolute difference between each predicted and actual value and sums all the differences. Then it expresses that sum as a percent of the total of the actual values."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sum_actuals = sum_errors = 0\n",
"\n",
"for actual_val, predict_val in zip(y_actual, y_predict):\n",
" abs_error = actual_val - predict_val\n",
" if abs_error < 0:\n",
" abs_error = abs_error * -1\n",
"\n",
" sum_errors = sum_errors + abs_error\n",
" sum_actuals = sum_actuals + actual_val\n",
"\n",
"mean_abs_percent_error = sum_errors / sum_actuals\n",
"print(\"Model MAPE:\")\n",
"print(mean_abs_percent_error)\n",
"print()\n",
"print(\"Model Accuracy:\")\n",
"print(1 - mean_abs_percent_error)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"From the two prediction accuracy metrics, you see that the model is fairly good at predicting taxi fares from the data set's features, typically within +- $4.00, and approximately 15% error. \n",
"\n",
"The traditional machine learning model development process is highly resource-intensive, and requires significant domain knowledge and time investment to run and compare the results of dozens of models. Using automated machine learning is a great way to rapidly test many different models for your scenario."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Clean up resources"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Do not complete this section if you plan on running other Azure Machine Learning service tutorials."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Stop the notebook VM"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you used a cloud notebook server, stop the VM when you are not using it to reduce cost."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1. In your workspace, select **Notebook VMs**.\n",
"1. From the list, select the VM.\n",
"1. Select **Stop**.\n",
"1. When you're ready to use the server again, select **Start**."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Delete everything"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you don't plan to use the resources you created, delete them, so you don't incur any charges."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1. In the Azure portal, select **Resource groups** on the far left.\n",
"1. From the list, select the resource group you created.\n",
"1. Select **Delete resource group**.\n",
"1. Enter the resource group name. Then select **Delete**.\n",
"\n",
"You can also keep the resource group but delete a single workspace. Display the workspace properties and select **Delete**."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Next steps"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this automated machine learning tutorial, you did the following tasks:\n",
"\n",
"> * Configured a workspace and prepared data for an experiment.\n",
"> * Trained by using an automated regression model locally with custom parameters.\n",
"> * Explored and reviewed training results.\n",
"\n",
"[Deploy your model](https://docs.microsoft.com/azure/machine-learning/service/tutorial-deploy-models-with-aml) with Azure Machine Learning service."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"authors": [
{
"name": "jeffshep"
}
],
"kernelspec": {
"display_name": "Python 3.6",
"language": "python",
"name": "python36"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.7"
},
"msauthor": "trbye"
},
"nbformat": 4,
"nbformat_minor": 2
}

View File

@@ -0,0 +1,7 @@
name: regression-automated-ml
dependencies:
- pip:
- azureml-sdk
- azureml-train-automl
- azureml-widgets
- azureml-opendatasets

View File

@@ -1,637 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Copyright (c) Microsoft Corporation. All rights reserved.\n",
"\n",
"Licensed under the MIT License."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Tutorial: Prepare data for regression modeling"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this tutorial, you learn how to prepare data for regression modeling by using the Azure Machine Learning Data Prep SDK. You run various transformations to filter and combine two different NYC taxi data sets.\n",
"\n",
"This tutorial is **part one of a two-part tutorial series**. After you complete the tutorial series, you can predict the cost of a taxi trip by training a model on data features. These features include the pickup day and time, the number of passengers, and the pickup location.\n",
"\n",
"In this tutorial, you:\n",
"\n",
"\n",
"> * Setup a Python environment and import packages\n",
"> * Load two datasets with different field names\n",
"> * Cleanse data to remove anomalies\n",
"> * Transform data using intelligent transforms to create new features\n",
"> * Save your dataflow object to use in a regression model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Prerequisites\n",
"\n",
"To run the notebook you will need:\n",
"\n",
"* A Python 3.6 notebook server with the following installed:\n",
" * The Azure Machine Learning Data Prep SDK for Python\n",
"* The tutorial notebook\n",
"\n",
"Navigate back to the [tutorial page](https://docs.microsoft.com/azure/machine-learning/service/tutorial-data-prep) for specific environment setup instructions.\n",
"\n",
"## <a name=\"start\"></a>Set up your development environment\n",
"\n",
"All the setup for your development work can be accomplished in a Python notebook. Setup includes the following actions:\n",
"\n",
"* Install the SDK\n",
"* Import Python packages\n",
"\n",
"### Install and import packages\n",
"\n",
"Use the following to install necessary packages if you don't already have them.\n",
"\n",
"```shell\n",
"pip install \"azureml-dataprep[pandas]>=1.1.2,<1.2.0\"\n",
"```\n",
"\n",
"Import the SDK."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import azureml.dataprep as dprep"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load data\n",
"Download two different NYC Taxi data sets into dataflow objects. These datasets contain slightly different fields. The method `auto_read_file()` automatically recognizes the input file type."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from IPython.display import display\n",
"green_path = \"https://dprepdata.blob.core.windows.net/demo/green-small/*\"\n",
"yellow_path = \"https://dprepdata.blob.core.windows.net/demo/yellow-small/*\"\n",
"# (optional) Download and view a subset of the data: https://dprepdata.blob.core.windows.net/demo/green-small/green_tripdata_2013-08.csv\n",
"\n",
"green_df_raw = dprep.read_csv(path=green_path, header=dprep.PromoteHeadersMode.GROUPED)\n",
"# auto_read_file automatically identifies and parses the file type, which is useful when you don't know the file type.\n",
"yellow_df_raw = dprep.auto_read_file(path=yellow_path)\n",
"\n",
"display(green_df_raw.head(5))\n",
"display(yellow_df_raw.head(5))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A `Dataflow` object is similar to a dataframe, and represents a series of lazily-evaluated, immutable operations on data. Operations can be added by invoking the different transformation and filtering methods available. The result of adding an operation to a `Dataflow` is always a new `Dataflow` object."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Cleanse data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now you populate some variables with shortcut transforms to apply to all dataflows. The `drop_if_all_null` variable is used to delete records where all fields are null. The `useful_columns` variable holds an array of column descriptions that are kept in each dataflow."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"all_columns = dprep.ColumnSelector(term=\".*\", use_regex=True)\n",
"drop_if_all_null = [all_columns, dprep.ColumnRelationship(dprep.ColumnRelationship.ALL)]\n",
"useful_columns = [\n",
" \"cost\", \"distance\", \"dropoff_datetime\", \"dropoff_latitude\", \"dropoff_longitude\",\n",
" \"passengers\", \"pickup_datetime\", \"pickup_latitude\", \"pickup_longitude\", \"store_forward\", \"vendor\"\n",
"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You first work with the green taxi data to get it into a valid shape that can be combined with the yellow taxi data. Call the `replace_na()`, `drop_nulls()`, and `keep_columns()` functions by using the shortcut transform variables you created. Additionally, rename all the columns in the dataframe to match the names in the `useful_columns` variable."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"green_df = (green_df_raw\n",
" .replace_na(columns=all_columns)\n",
" .drop_nulls(*drop_if_all_null)\n",
" .rename_columns(column_pairs={\n",
" \"VendorID\": \"vendor\",\n",
" \"lpep_pickup_datetime\": \"pickup_datetime\",\n",
" \"Lpep_dropoff_datetime\": \"dropoff_datetime\",\n",
" \"lpep_dropoff_datetime\": \"dropoff_datetime\",\n",
" \"Store_and_fwd_flag\": \"store_forward\",\n",
" \"store_and_fwd_flag\": \"store_forward\",\n",
" \"Pickup_longitude\": \"pickup_longitude\",\n",
" \"Pickup_latitude\": \"pickup_latitude\",\n",
" \"Dropoff_longitude\": \"dropoff_longitude\",\n",
" \"Dropoff_latitude\": \"dropoff_latitude\",\n",
" \"Passenger_count\": \"passengers\",\n",
" \"Fare_amount\": \"cost\",\n",
" \"Trip_distance\": \"distance\"\n",
" })\n",
" .keep_columns(columns=useful_columns))\n",
"green_df.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Run the same transformation steps on the yellow taxi data. These functions ensure that null data is removed from the data set, which will help increase machine learning model accuracy."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"yellow_df = (yellow_df_raw\n",
" .replace_na(columns=all_columns)\n",
" .drop_nulls(*drop_if_all_null)\n",
" .rename_columns(column_pairs={\n",
" \"vendor_name\": \"vendor\",\n",
" \"VendorID\": \"vendor\",\n",
" \"vendor_id\": \"vendor\",\n",
" \"Trip_Pickup_DateTime\": \"pickup_datetime\",\n",
" \"tpep_pickup_datetime\": \"pickup_datetime\",\n",
" \"Trip_Dropoff_DateTime\": \"dropoff_datetime\",\n",
" \"tpep_dropoff_datetime\": \"dropoff_datetime\",\n",
" \"store_and_forward\": \"store_forward\",\n",
" \"store_and_fwd_flag\": \"store_forward\",\n",
" \"Start_Lon\": \"pickup_longitude\",\n",
" \"Start_Lat\": \"pickup_latitude\",\n",
" \"End_Lon\": \"dropoff_longitude\",\n",
" \"End_Lat\": \"dropoff_latitude\",\n",
" \"Passenger_Count\": \"passengers\",\n",
" \"passenger_count\": \"passengers\",\n",
" \"Fare_Amt\": \"cost\",\n",
" \"fare_amount\": \"cost\",\n",
" \"Trip_Distance\": \"distance\",\n",
" \"trip_distance\": \"distance\"\n",
" })\n",
" .keep_columns(columns=useful_columns))\n",
"yellow_df.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Call the `append_rows()` function on the green taxi data to append the yellow taxi data. A new combined dataframe is created."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"combined_df = green_df.append_rows([yellow_df])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Convert types and filter "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Examine the pickup and drop-off coordinates summary statistics to see how the data is distributed. First, define a `TypeConverter` object to change the latitude and longitude fields to decimal type. Next, call the `keep_columns()` function to restrict output to only the latitude and longitude fields, and then call the `get_profile()` function. These function calls create a condensed view of the dataflow to just show the lat/long fields, which makes it easier to evaluate missing or out-of-scope coordinates."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"decimal_type = dprep.TypeConverter(data_type=dprep.FieldType.DECIMAL)\n",
"combined_df = combined_df.set_column_types(type_conversions={\n",
" \"pickup_longitude\": decimal_type,\n",
" \"pickup_latitude\": decimal_type,\n",
" \"dropoff_longitude\": decimal_type,\n",
" \"dropoff_latitude\": decimal_type\n",
"})\n",
"combined_df.keep_columns(columns=[\n",
" \"pickup_longitude\", \"pickup_latitude\",\n",
" \"dropoff_longitude\", \"dropoff_latitude\"\n",
"]).get_profile()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"From the summary statistics output, you see there are missing coordinates and coordinates that aren't in New York City (this is determined from subjective analysis). Filter out coordinates for locations that are outside the city border. Chain the column filter commands within the `filter()` function and define the minimum and maximum bounds for each field. Then call the `get_profile()` function again to verify the transformation."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"latlong_filtered_df = (combined_df\n",
" .drop_nulls(\n",
" columns=[\"pickup_longitude\", \"pickup_latitude\", \"dropoff_longitude\", \"dropoff_latitude\"],\n",
" column_relationship=dprep.ColumnRelationship(dprep.ColumnRelationship.ANY)\n",
" )\n",
" .filter(dprep.f_and(\n",
" dprep.col(\"pickup_longitude\") <= -73.72,\n",
" dprep.col(\"pickup_longitude\") >= -74.09,\n",
" dprep.col(\"pickup_latitude\") <= 40.88,\n",
" dprep.col(\"pickup_latitude\") >= 40.53,\n",
" dprep.col(\"dropoff_longitude\") <= -73.72,\n",
" dprep.col(\"dropoff_longitude\") >= -74.09,\n",
" dprep.col(\"dropoff_latitude\") <= 40.88,\n",
" dprep.col(\"dropoff_latitude\") >= 40.53\n",
" )))\n",
"latlong_filtered_df.keep_columns(columns=[\n",
" \"pickup_longitude\", \"pickup_latitude\",\n",
" \"dropoff_longitude\", \"dropoff_latitude\"\n",
"]).get_profile()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Split and rename columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Look at the data profile for the `store_forward` column. This field is a boolean flag that is `Y` when the taxi did not have a connection to the server after the trip, and thus had to store the trip data in memory, and later forward it to the server when connected."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"latlong_filtered_df.keep_columns(columns='store_forward').get_profile()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice that the data profile output in the `store_forward` column shows that the data is inconsistent and there are missing or null values. Use the `replace()` and `fill_nulls()` functions to replace these values with the string \"N\":"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"replaced_stfor_vals_df = latlong_filtered_df.replace(columns=\"store_forward\", find=\"0\", replace_with=\"N\").fill_nulls(\"store_forward\", \"N\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Execute the `replace` function on the `distance` field. The function reformats distance values that are incorrectly labeled as `.00`, and fills any nulls with zeros. Convert the `distance` field to numerical format. These incorrect data points are likely anomolies in the data collection system on the taxi cabs."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"replaced_distance_vals_df = replaced_stfor_vals_df.replace(columns=\"distance\", find=\".00\", replace_with=0).fill_nulls(\"distance\", 0)\n",
"replaced_distance_vals_df = replaced_distance_vals_df.to_number([\"distance\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Split the pickup and dropoff datetime values into the respective date and time columns. Use the `split_column_by_example()` function to make the split. In this case, the optional `example` parameter of the `split_column_by_example()` function is omitted. Therefore, the function automatically determines where to split based on the data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"time_split_df = (replaced_distance_vals_df\n",
" .split_column_by_example(source_column=\"pickup_datetime\")\n",
" .split_column_by_example(source_column=\"dropoff_datetime\"))\n",
"time_split_df.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Rename the columns generated by `split_column_by_example()` into meaningful names."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"renamed_col_df = (time_split_df\n",
" .rename_columns(column_pairs={\n",
" \"pickup_datetime_1\": \"pickup_date\",\n",
" \"pickup_datetime_2\": \"pickup_time\",\n",
" \"dropoff_datetime_1\": \"dropoff_date\",\n",
" \"dropoff_datetime_2\": \"dropoff_time\"\n",
" }))\n",
"renamed_col_df.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Call the `get_profile()` function to see the full summary statistics after all cleansing steps."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"renamed_col_df.get_profile()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Transform data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Split the pickup and dropoff date further into the day of the week, day of the month, and month values. To get the day of the week value, use the `derive_column_by_example()` function. The function takes an array parameter of example objects that define the input data, and the preferred output. The function automatically determines your preferred transformation. For the pickup and dropoff time columns, split the time into the hour, minute, and second by using the `split_column_by_example()` function with no example parameter.\n",
"\n",
"After you generate the new features, use the `drop_columns()` function to delete the original fields as the newly generated features are preferred. Rename the rest of the fields to use meaningful descriptions.\n",
"\n",
"Transforming the data in this way to create new time-based features will improve machine learning model accuracy. For example, generating a new feature for the weekday will help establish a relationship between the day of the week and the taxi fare price, which is often more expensive on certain days of the week due to high demand."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"transformed_features_df = (renamed_col_df\n",
" .derive_column_by_example(\n",
" source_columns=\"pickup_date\",\n",
" new_column_name=\"pickup_weekday\",\n",
" example_data=[(\"2009-01-04\", \"Sunday\"), (\"2013-08-22\", \"Thursday\")]\n",
" )\n",
" .derive_column_by_example(\n",
" source_columns=\"dropoff_date\",\n",
" new_column_name=\"dropoff_weekday\",\n",
" example_data=[(\"2013-08-22\", \"Thursday\"), (\"2013-11-03\", \"Sunday\")]\n",
" )\n",
"\n",
" .split_column_by_example(source_column=\"pickup_time\")\n",
" .split_column_by_example(source_column=\"dropoff_time\")\n",
" # The following two calls to split_column_by_example reference the column names generated from the previous two calls.\n",
" .split_column_by_example(source_column=\"pickup_time_1\")\n",
" .split_column_by_example(source_column=\"dropoff_time_1\")\n",
" .drop_columns(columns=[\n",
" \"pickup_date\", \"pickup_time\", \"dropoff_date\", \"dropoff_time\",\n",
" \"pickup_date_1\", \"dropoff_date_1\", \"pickup_time_1\", \"dropoff_time_1\"\n",
" ])\n",
"\n",
" .rename_columns(column_pairs={\n",
" \"pickup_date_2\": \"pickup_month\",\n",
" \"pickup_date_3\": \"pickup_monthday\",\n",
" \"pickup_time_1_1\": \"pickup_hour\",\n",
" \"pickup_time_1_2\": \"pickup_minute\",\n",
" \"pickup_time_2\": \"pickup_second\",\n",
" \"dropoff_date_2\": \"dropoff_month\",\n",
" \"dropoff_date_3\": \"dropoff_monthday\",\n",
" \"dropoff_time_1_1\": \"dropoff_hour\",\n",
" \"dropoff_time_1_2\": \"dropoff_minute\",\n",
" \"dropoff_time_2\": \"dropoff_second\"\n",
" }))\n",
"\n",
"transformed_features_df.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice that the data shows that the pickup and dropoff date and time components produced from the derived transformations are correct. Drop the `pickup_datetime` and `dropoff_datetime` columns because they're no longer needed (granular time features like hour, minute and second are more useful for model training)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"processed_df = transformed_features_df.drop_columns(columns=[\"pickup_datetime\", \"dropoff_datetime\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use the type inference functionality to automatically check the data type of each field, and display the inference results."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"type_infer = processed_df.builders.set_column_types()\n",
"type_infer.learn()\n",
"type_infer"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The inference results look correct based on the data. Now apply the type conversions to the dataflow."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"type_converted_df = type_infer.to_dataflow()\n",
"type_converted_df.get_profile()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Before you package the dataflow, run two final filters on the data set. To eliminate incorrectly captured data points, filter the dataflow on records where both the `cost` and `distance` variable values are greater than zero. This step will significantly improve machine learning model accuracy, because data points with a zero cost or distance represent major outliers that throw off prediction accuracy."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"final_df = type_converted_df.filter(dprep.col(\"distance\") > 0)\n",
"final_df = final_df.filter(dprep.col(\"cost\") > 0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You now have a fully transformed and prepared dataflow object to use in a machine learning model. The SDK includes object serialization functionality, which is used as shown in the following code."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"file_path = os.path.join(os.getcwd(), \"dflows.dprep\")\n",
"\n",
"final_df.save(file_path)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Clean up resources"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To continue with part two of the tutorial, you need the **dflows.dprep** file in the current directory.\n",
"\n",
"If you don't plan to continue to part two, delete the **dflows.dprep** file in your current directory. Delete this file whether you're running the execution locally or in [Azure Notebooks](https://notebooks.azure.com/)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Next steps"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this Azure Machine Learning Data Prep SDK tutorial, you:\n",
"\n",
"> * Set up your development environment\n",
"> * Loaded and cleansed data sets\n",
"> * Used smart transforms to predict your logic based on an example\n",
"> * Merged and packaged datasets for machine learning training\n",
"\n",
"You are ready to use this training data in the next part of the tutorial series:\n",
"\n",
"\n",
"> [Tutorial #2: Train regression model](regression-part2-automated-ml.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/tutorials/regression-part1-data-prep.png)"
]
}
],
"metadata": {
"authors": [
{
"name": "cforbe"
}
],
"kernelspec": {
"display_name": "Python 3.6",
"language": "python",
"name": "python36"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.7"
},
"msauthor": "trbye"
},
"nbformat": 4,
"nbformat_minor": 2
}

View File

@@ -1,5 +0,0 @@
name: regression-part1-data-prep
dependencies:
- pip:
- azureml-sdk
- azureml-dataprep[pandas]>=1.1.2,<1.2.0

View File

@@ -1,549 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Copyright (c) Microsoft Corporation. All rights reserved."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Tutorial: Use automated machine learning to build your regression model\n",
"\n",
"This tutorial is **part two of a two-part tutorial series**. In the previous tutorial, you [prepared the NYC taxi data for regression modeling](regression-part1-data-prep.ipynb).\n",
"\n",
"Now you're ready to start building your model with Azure Machine Learning service. In this part of the tutorial, you use the prepared data and automatically generate a regression model to predict taxi fare prices. By using the automated machine learning capabilities of the service, you define your machine learning goals and constraints. You launch the automated machine learning process. Then allow the algorithm selection and hyperparameter tuning to happen for you. The automated machine learning technique iterates over many combinations of algorithms and hyperparameters until it finds the best model based on your criterion.\n",
"\n",
"In this tutorial, you learn the following tasks:\n",
"\n",
"> * Set up a Python environment and import the SDK packages\n",
"> * Configure an Azure Machine Learning service workspace\n",
"> * Auto-train a regression model \n",
"> * Run the model locally with custom parameters\n",
"> * Explore the results\n",
"\n",
"If you do not have an Azure subscription, create a [free account](https://aka.ms/AMLfree) before you begin. \n",
"\n",
"> Code in this article was tested with Azure Machine Learning SDK version 1.0.0\n",
"\n",
"\n",
"## Prerequisites\n",
"\n",
"To run the notebook you will need:\n",
"\n",
"* [Run the data preparation tutorial](regression-part1-data-prep.ipynb).\n",
"* A Python 3.6 notebook server with the following installed:\n",
" * The Azure Machine Learning SDK for Python with `automl` and `notebooks` extras\n",
" * `matplotlib`\n",
"* The tutorial notebook\n",
"* A machine learning workspace\n",
"* The configuration file for the workspace in the same directory as the notebook\n",
"\n",
"Navigate back to the [tutorial page](https://docs.microsoft.com/azure/machine-learning/service/tutorial-auto-train-models) for specific environment setup instructions.\n",
"\n",
"## <a name=\"start\"></a>Set up your development environment\n",
"\n",
"All the setup for your development work can be accomplished in a Python notebook. Setup includes the following actions:\n",
"\n",
"* Install the SDK\n",
"* Import Python packages\n",
"* Configure your workspace"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Install and import packages\n",
"\n",
"If you are following the tutorial in your own Python environment, use the following to install necessary packages.\n",
"\n",
"```shell\n",
"pip install azureml-sdk[automl,notebooks] matplotlib\n",
"```\n",
"\n",
"Import the Python packages you need in this tutorial:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import azureml.core\n",
"import pandas as pd\n",
"from azureml.core.workspace import Workspace\n",
"import logging\n",
"import os"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Configure workspace\n",
"\n",
"Create a workspace object from the existing workspace. A `Workspace` is a class that accepts your Azure subscription and resource information. It also creates a cloud resource to monitor and track your model runs.\n",
"\n",
"`Workspace.from_config()` reads the file **aml_config/config.json** and loads the details into an object named `ws`. `ws` is used throughout the rest of the code in this tutorial.\n",
"\n",
"After you have a workspace object, specify a name for the experiment. Create and register a local directory with the workspace. The history of all runs is recorded under the specified experiment and in the [Azure portal](https://portal.azure.com)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ws = Workspace.from_config()\n",
"# choose a name for the run history container in the workspace\n",
"experiment_name = 'automated-ml-regression'\n",
"# project folder\n",
"project_folder = './automated-ml-regression'\n",
"\n",
"output = {}\n",
"output['SDK version'] = azureml.core.VERSION\n",
"output['Subscription ID'] = ws.subscription_id\n",
"output['Workspace'] = ws.name\n",
"output['Resource Group'] = ws.resource_group\n",
"output['Location'] = ws.location\n",
"output['Project Directory'] = project_folder\n",
"pd.set_option('display.max_colwidth', -1)\n",
"outputDf = pd.DataFrame(data = output, index = [''])\n",
"outputDf.T"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Explore data\n",
"\n",
"Use the data flow object created in the previous tutorial. To summarize, part 1 of this tutorial cleaned the NYC Taxi data so it could be used in a machine learning model. Now, you use various features from the data set and allow an automated model to build relationships between the features and the price of a taxi trip. Open and run the data flow and review the results:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import azureml.dataprep as dprep\n",
"\n",
"file_path = os.path.join(os.getcwd(), \"dflows.dprep\")\n",
"\n",
"dflow_prepared = dprep.Dataflow.open(file_path)\n",
"dflow_prepared.get_profile()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You prepare the data for the experiment by adding columns to `dflow_x` to be features for our model creation. You define `dflow_y` to be our prediction value, **cost**:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dflow_X = dflow_prepared.keep_columns(['pickup_weekday','pickup_hour', 'distance','passengers', 'vendor'])\n",
"dflow_y = dflow_prepared.keep_columns('cost')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Split data into train and test sets\n",
"\n",
"Now you split the data into training and test sets by using the `train_test_split` function in the `sklearn` library. This function segregates the data into the x, **features**, dataset for model training and the y, **values to predict**, dataset for testing. The `test_size` parameter determines the percentage of data to allocate to testing. The `random_state` parameter sets a seed to the random generator, so that your train-test splits are always deterministic:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"\n",
"x_df = dflow_X.to_pandas_dataframe()\n",
"y_df = dflow_y.to_pandas_dataframe()\n",
"\n",
"x_train, x_test, y_train, y_test = train_test_split(x_df, y_df, test_size=0.2, random_state=223)\n",
"# flatten y_train to 1d array\n",
"y_train.values.flatten()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The purpose of this step is to have data points to test the finished model that haven't been used to train the model, in order to measure true accuracy. In other words, a well-trained model should be able to accurately make predictions from data it hasn't already seen. You now have the necessary packages and data ready for autotraining your model.\n",
"\n",
"## Automatically train a model\n",
"\n",
"To automatically train a model, take the following steps:\n",
"1. Define settings for the experiment run. Attach your training data to the configuration, and modify settings that control the training process.\n",
"1. Submit the experiment for model tuning. After submitting the experiment, the process iterates through different machine learning algorithms and hyperparameter settings, adhering to your defined constraints. It chooses the best-fit model by optimizing an accuracy metric.\n",
"\n",
"\n",
"### Define settings for autogeneration and tuning\n",
"\n",
"Define the experiment parameters and models settings for autogeneration and tuning. View the full list of [settings](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-configure-auto-train). Submitting the experiment with these default settings will take approximately 10-15 min, but if you want a shorter run time, reduce either `iterations` or `iteration_timeout_minutes`.\n",
"\n",
"\n",
"|Property| Value in this tutorial |Description|\n",
"|----|----|---|\n",
"|**iteration_timeout_minutes**|10|Time limit in minutes for each iteration. Reduce this value to decrease total runtime.|\n",
"|**iterations**|30|Number of iterations. In each iteration, a new machine learning model is trained with your data. This is the primary value that affects total run time.|\n",
"|**primary_metric**|spearman_correlation | Metric that you want to optimize. The best-fit model will be chosen based on this metric.|\n",
"|**preprocess**| True | By using **True**, the experiment can preprocess the input data (handling missing data, converting text to numeric, etc.)|\n",
"|**verbosity**| logging.INFO | Controls the level of logging.|\n",
"|**n_cross_validationss**|5| Number of cross-validation splits to perform when validation data is not specified.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"automl_settings = {\n",
" \"iteration_timeout_minutes\" : 10,\n",
" \"iterations\" : 30,\n",
" \"primary_metric\" : 'spearman_correlation',\n",
" \"preprocess\" : True,\n",
" \"verbosity\" : logging.INFO,\n",
" \"n_cross_validations\": 5\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use your defined training settings as a parameter to an `AutoMLConfig` object. Additionally, specify your training data and the type of model, which is `regression` in this case."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"configure automl"
]
},
"outputs": [],
"source": [
"from azureml.train.automl import AutoMLConfig\n",
"\n",
"# local compute \n",
"automated_ml_config = AutoMLConfig(task = 'regression',\n",
" debug_log = 'automated_ml_errors.log',\n",
" path = project_folder,\n",
" X = x_train.values,\n",
" y = y_train.values.flatten(),\n",
" **automl_settings)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Train the automatic regression model\n",
"\n",
"Start the experiment to run locally. Pass the defined `automated_ml_config` object to the experiment. Set the output to `True` to view progress during the experiment:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"local submitted run",
"automl"
]
},
"outputs": [],
"source": [
"from azureml.core.experiment import Experiment\n",
"experiment=Experiment(ws, experiment_name)\n",
"local_run = experiment.submit(automated_ml_config, show_output=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The output shown updates live as the experiment runs. For each iteration, you see the model type, the run duration, and the training accuracy. The field `BEST` tracks the best running training score based on your metric type."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Explore the results\n",
"\n",
"Explore the results of automatic training with a Jupyter widget or by examining the experiment history.\n",
"\n",
"### Option 1: Add a Jupyter widget to see results\n",
"\n",
"If you use a Jupyter notebook, use this Jupyter notebook widget to see a graph and a table of all results:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"use notebook widget"
]
},
"outputs": [],
"source": [
"from azureml.widgets import RunDetails\n",
"RunDetails(local_run).show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Option 2: Get and examine all run iterations in Python\n",
"\n",
"You can also retrieve the history of each experiment and explore the individual metrics for each iteration run. By examining RMSE (root_mean_squared_error) for each individual model run, you see that most iterations are predicting the taxi fair cost within a reasonable margin ($3-4).\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"get metrics",
"query history"
]
},
"outputs": [],
"source": [
"children = list(local_run.get_children())\n",
"metricslist = {}\n",
"for run in children:\n",
" properties = run.get_properties()\n",
" metrics = {k: v for k, v in run.get_metrics().items() if isinstance(v, float)}\n",
" metricslist[int(properties['iteration'])] = metrics\n",
"\n",
"rundata = pd.DataFrame(metricslist).sort_index(1)\n",
"rundata"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Retrieve the best model\n",
"\n",
"Select the best pipeline from our iterations. The `get_output` method on `automl_classifier` returns the best run and the fitted model for the last fit invocation. By using the overloads on `get_output`, you can retrieve the best run and fitted model for any logged metric or a particular iteration:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"best_run, fitted_model = local_run.get_output()\n",
"print(best_run)\n",
"print(fitted_model)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Test the best model accuracy\n",
"\n",
"Use the best model to run predictions on the test dataset to predict taxi fares. The function `predict` uses the best model and predicts the values of y, **trip cost**, from the `x_test` dataset. Print the first 10 predicted cost values from `y_predict`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"y_predict = fitted_model.predict(x_test.values) \n",
"print(y_predict[:10])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create a scatter plot to visualize the predicted cost values compared to the actual cost values. The following code uses the `distance` feature as the x-axis and trip `cost` as the y-axis. To compare the variance of predicted cost at each trip distance value, the first 100 predicted and actual cost values are created as separate series. Examining the plot shows that the distance/cost relationship is nearly linear, and the predicted cost values are in most cases very close to the actual cost values for the same trip distance."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\n",
"\n",
"fig = plt.figure(figsize=(14, 10))\n",
"ax1 = fig.add_subplot(111)\n",
"\n",
"distance_vals = [x[4] for x in x_test.values]\n",
"y_actual = y_test.values.flatten().tolist()\n",
"\n",
"ax1.scatter(distance_vals[:100], y_predict[:100], s=18, c='b', marker=\"s\", label='Predicted')\n",
"ax1.scatter(distance_vals[:100], y_actual[:100], s=18, c='r', marker=\"o\", label='Actual')\n",
"\n",
"ax1.set_xlabel('distance (mi)')\n",
"ax1.set_title('Predicted and Actual Cost/Distance')\n",
"ax1.set_ylabel('Cost ($)')\n",
"\n",
"plt.legend(loc='upper left', prop={'size': 12})\n",
"plt.rcParams.update({'font.size': 14})\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" Calculate the `root mean squared error` of the results. Use the `y_test` dataframe. Convert it to a list to compare to the predicted values. The function `mean_squared_error` takes two arrays of values and calculates the average squared error between them. Taking the square root of the result gives an error in the same units as the y variable, **cost**. It indicates roughly how far the taxi fare predictions are from the actual fares:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.metrics import mean_squared_error\n",
"from math import sqrt\n",
"\n",
"rmse = sqrt(mean_squared_error(y_actual, y_predict))\n",
"rmse"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Run the following code to calculate mean absolute percent error (MAPE) by using the full `y_actual` and `y_predict` datasets. This metric calculates an absolute difference between each predicted and actual value and sums all the differences. Then it expresses that sum as a percent of the total of the actual values:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sum_actuals = sum_errors = 0\n",
"\n",
"for actual_val, predict_val in zip(y_actual, y_predict):\n",
" abs_error = actual_val - predict_val\n",
" if abs_error < 0:\n",
" abs_error = abs_error * -1\n",
" \n",
" sum_errors = sum_errors + abs_error\n",
" sum_actuals = sum_actuals + actual_val\n",
" \n",
"mean_abs_percent_error = sum_errors / sum_actuals\n",
"print(\"Model MAPE:\")\n",
"print(mean_abs_percent_error)\n",
"print()\n",
"print(\"Model Accuracy:\")\n",
"print(1 - mean_abs_percent_error)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"From the final prediction accuracy metrics, you see that the model is fairly good at predicting taxi fares from the data set's features, typically within +- $3.00. The traditional machine learning model development process is highly resource-intensive, and requires significant domain knowledge and time investment to run and compare the results of dozens of models. Using automated machine learning is a great way to rapidly test many different models for your scenario."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Clean up resources\n",
"\n",
">The resources you created can be used as prerequisites to other Azure Machine Learning service tutorials and how-to articles. \n",
"\n",
"\n",
"If you do not plan to use the resources you created, delete them, so you do not incur any charges:\n",
"\n",
"1. In the Azure portal, select **Resource groups** on the far left.\n",
"\n",
"1. From the list, select the resource group you created.\n",
"\n",
"1. Select **Delete resource group**.\n",
"\n",
"1. Enter the resource group name. Then select **Delete**."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Next steps\n",
"\n",
"In this automated machine learning tutorial, you did the following tasks:\n",
"\n",
"* Configured a workspace and prepared data for an experiment.\n",
"* Trained by using an automated regression model locally with custom parameters.\n",
"* Explored and reviewed training results.\n",
"\n",
"[Deploy your model](https://docs.microsoft.com/azure/machine-learning/service/tutorial-deploy-models-with-aml) with Azure Machine Learning."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/tutorials/regression-part2-automated-ml.png)"
]
}
],
"metadata": {
"authors": [
{
"name": "jeffshep"
}
],
"kernelspec": {
"display_name": "Python 3.6",
"language": "python",
"name": "python36"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.7"
},
"msauthor": "sgilley"
},
"nbformat": 4,
"nbformat_minor": 2
}

View File

@@ -1,10 +0,0 @@
name: regression-part2-automated-ml
dependencies:
- pip:
- azureml-sdk
- azureml-train-automl
- azureml-widgets
- azureml-explain-model
- matplotlib
- pandas_ml
- seaborn