Files
MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/forecasting-bike-share/auto-ml-forecasting-bike-share.ipynb

607 lines
24 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Copyright (c) Microsoft Corporation. All rights reserved.\n",
"\n",
"Licensed under the MIT License."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/forecasting-bike-share/auto-ml-forecasting-bike-share.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Automated Machine Learning\n",
"**BikeShare Demand Forecasting**\n",
"\n",
"## Contents\n",
"1. [Introduction](#Introduction)\n",
"1. [Setup](#Setup)\n",
"1. [Data](#Data)\n",
"1. [Train](#Train)\n",
"1. [Evaluate](#Evaluate)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Introduction\n",
"This notebook demonstrates demand forecasting for a bike-sharing service using AutoML.\n",
"\n",
"AutoML highlights here include built-in holiday featurization, accessing engineered feature names, and working with the `forecast` function. Please also look at the additional forecasting notebooks, which document lagging, rolling windows, forecast quantiles, other ways to use the forecast function, and forecaster deployment.\n",
"\n",
"Make sure you have executed the [configuration](../../../configuration.ipynb) before running this notebook.\n",
"\n",
"Notebook synopsis:\n",
"1. Creating an Experiment in an existing Workspace\n",
"2. Configuration and local run of AutoML for a time-series model with lag and holiday features \n",
"3. Viewing the engineered names for featurized data and featurization summary for all raw features\n",
"4. Evaluating the fitted model using a rolling test "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import azureml.core\n",
"import pandas as pd\n",
"import numpy as np\n",
"import logging\n",
"import warnings\n",
"\n",
"from pandas.tseries.frequencies import to_offset\n",
"\n",
"# Squash warning messages for cleaner output in the notebook\n",
"warnings.showwarning = lambda *args, **kwargs: None\n",
"\n",
"from azureml.core.workspace import Workspace\n",
"from azureml.core.experiment import Experiment\n",
"from azureml.train.automl import AutoMLConfig\n",
"from matplotlib import pyplot as plt\n",
"from sklearn.metrics import mean_absolute_error, mean_squared_error"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As part of the setup you have already created a <b>Workspace</b>. To run AutoML, you also need to create an <b>Experiment</b>. An Experiment corresponds to a prediction problem you are trying to solve, while a Run corresponds to a specific approach to the problem."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ws = Workspace.from_config()\n",
"\n",
"# choose a name for the run history container in the workspace\n",
"experiment_name = 'automl-bikeshareforecasting'\n",
"# project folder\n",
"project_folder = './sample_projects/automl-local-bikeshareforecasting'\n",
"\n",
"experiment = Experiment(ws, experiment_name)\n",
"\n",
"output = {}\n",
"output['SDK version'] = azureml.core.VERSION\n",
"output['Subscription ID'] = ws.subscription_id\n",
"output['Workspace'] = ws.name\n",
"output['Resource Group'] = ws.resource_group\n",
"output['Location'] = ws.location\n",
"output['Project Directory'] = project_folder\n",
"output['Run History Name'] = experiment_name\n",
"pd.set_option('display.max_colwidth', -1)\n",
"outputDf = pd.DataFrame(data = output, index = [''])\n",
"outputDf.T"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data\n",
"Read bike share demand data from file, and preview data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data = pd.read_csv('bike-no.csv', parse_dates=['date'])\n",
"data.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's set up what we know about the dataset. \n",
"\n",
"**Target column** is what we want to forecast.\n",
"\n",
"**Time column** is the time axis along which to predict.\n",
"\n",
"**Grain** is another word for an individual time series in your dataset. Grains are identified by values of the columns listed `grain_column_names`, for example \"store\" and \"item\" if your data has multiple time series of sales, one series for each combination of store and item sold.\n",
"\n",
"This dataset has only one time series. Please see the [orange juice notebook](https://github.com/Azure/MachineLearningNotebooks/tree/master/how-to-use-azureml/automated-machine-learning/forecasting-orange-juice-sales) for an example of a multi-time series dataset."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"target_column_name = 'cnt'\n",
"time_column_name = 'date'\n",
"grain_column_names = []"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Split the data\n",
"\n",
"The first split we make is into train and test sets. Note we are splitting on time."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"train = data[data[time_column_name] < '2012-09-01']\n",
"test = data[data[time_column_name] >= '2012-09-01']\n",
"\n",
"X_train = train.copy()\n",
"y_train = X_train.pop(target_column_name).values\n",
"\n",
"X_test = test.copy()\n",
"y_test = X_test.pop(target_column_name).values\n",
"\n",
"print(X_train.shape)\n",
"print(y_train.shape)\n",
"print(X_test.shape)\n",
"print(y_test.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Setting forecaster maximum horizon \n",
"\n",
"The forecast horizon is the number of periods into the future that the model should predict. Here, we set the horizon to 14 periods (i.e. 14 days). Notice that this is much shorter than the number of days in the test set; we will need to use a rolling test to evaluate the performance on the whole test set. For more discussion of forecast horizons and guiding principles for setting them, please see the [energy demand notebook](https://github.com/Azure/MachineLearningNotebooks/tree/master/how-to-use-azureml/automated-machine-learning/forecasting-energy-demand). "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"max_horizon = 14"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Train\n",
"\n",
"Instantiate a AutoMLConfig object. This defines the settings and data used to run the experiment.\n",
"\n",
"|Property|Description|\n",
"|-|-|\n",
"|**task**|forecasting|\n",
"|**primary_metric**|This is the metric that you want to optimize.<br> Forecasting supports the following primary metrics <br><i>spearman_correlation</i><br><i>normalized_root_mean_squared_error</i><br><i>r2_score</i><br><i>normalized_mean_absolute_error</i>\n",
"|**iterations**|Number of iterations. In each iteration, Auto ML trains a specific pipeline on the given data|\n",
"|**iteration_timeout_minutes**|Time limit in minutes for each iteration.|\n",
"|**X**|(sparse) array-like, shape = [n_samples, n_features]|\n",
"|**y**|(sparse) array-like, shape = [n_samples, ], targets values.|\n",
"|**n_cross_validations**|Number of cross validation splits.|\n",
"|**country_or_region**|The country/region used to generate holiday features. These should be ISO 3166 two-letter country/region codes (i.e. 'US', 'GB').|\n",
"|**path**|Relative path to the project folder. AutoML stores configuration files for the experiment under this folder. You can specify a new empty folder. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"automl_settings = {\n",
" 'time_column_name': time_column_name,\n",
" 'max_horizon': max_horizon,\n",
" # knowing the country/region allows Automated ML to bring in holidays\n",
" 'country_or_region': 'US',\n",
" 'target_lags': 1,\n",
" # these columns are a breakdown of the total and therefore a leak\n",
" 'drop_column_names': ['casual', 'registered']\n",
"}\n",
"\n",
"automl_config = AutoMLConfig(task='forecasting', \n",
" primary_metric='normalized_root_mean_squared_error',\n",
" iterations=10,\n",
" iteration_timeout_minutes=5,\n",
" X=X_train,\n",
" y=y_train,\n",
" n_cross_validations=3, \n",
" path=project_folder,\n",
" verbosity=logging.INFO,\n",
" **automl_settings)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will now run the experiment, starting with 10 iterations of model search. The experiment can be continued for more iterations if more accurate results are required. You will see the currently running iterations printing to the console."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"local_run = experiment.submit(automl_config, show_output=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Displaying the run objects gives you links to the visual tools in the Azure Portal. Go try them!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"local_run"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Retrieve the Best Model\n",
"Below we select the best pipeline from our iterations. The get_output method on automl_classifier returns the best run and the fitted model for the last fit invocation. There are overloads on get_output that allow you to retrieve the best run and fitted model for any logged metric or a particular iteration."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"best_run, fitted_model = local_run.get_output()\n",
"fitted_model.steps"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### View the engineered names for featurized data\n",
"\n",
"You can accees the engineered feature names generated in time-series featurization. Note that a number of named holiday periods are represented. We recommend that you have at least one year of data when using this feature to ensure that all yearly holidays are captured in the training featurization."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fitted_model.named_steps['timeseriestransformer'].get_engineered_feature_names()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### View the featurization summary\n",
"\n",
"You can also see what featurization steps were performed on different raw features in the user data. For each raw feature in the user data, the following information is displayed:\n",
"\n",
"- Raw feature name\n",
"- Number of engineered features formed out of this raw feature\n",
"- Type detected\n",
"- If feature was dropped\n",
"- List of feature transformations for the raw feature"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Get the featurization summary as a list of JSON\n",
"featurization_summary = fitted_model.named_steps['timeseriestransformer'].get_featurization_summary()\n",
"# View the featurization summary as a pandas dataframe\n",
"pd.DataFrame.from_records(featurization_summary)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Evaluate"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We now use the best fitted model from the AutoML Run to make forecasts for the test set. \n",
"\n",
"We always score on the original dataset whose schema matches the training set schema."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X_test.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We now define some functions for aligning output to input and for producing rolling forecasts over the full test set. As previously stated, the forecast horizon of 14 days is shorter than the length of the test set - which is about 120 days. To get predictions over the full test set, we iterate over the test set, making forecasts 14 days at a time and combining the results. We also make sure that each 14-day forecast uses up-to-date actuals - the current context - to construct lag features. \n",
"\n",
"It is a good practice to always align the output explicitly to the input, as the count and order of the rows may have changed during transformations that span multiple rows."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def align_outputs(y_predicted, X_trans, X_test, y_test, predicted_column_name='predicted',\n",
" horizon_colname='horizon_origin'):\n",
" \"\"\"\n",
" Demonstrates how to get the output aligned to the inputs\n",
" using pandas indexes. Helps understand what happened if\n",
" the output's shape differs from the input shape, or if\n",
" the data got re-sorted by time and grain during forecasting.\n",
" \n",
" Typical causes of misalignment are:\n",
" * we predicted some periods that were missing in actuals -> drop from eval\n",
" * model was asked to predict past max_horizon -> increase max horizon\n",
" * data at start of X_test was needed for lags -> provide previous periods\n",
" \"\"\"\n",
" df_fcst = pd.DataFrame({predicted_column_name : y_predicted,\n",
" horizon_colname: X_trans[horizon_colname]})\n",
" # y and X outputs are aligned by forecast() function contract\n",
" df_fcst.index = X_trans.index\n",
" \n",
" # align original X_test to y_test \n",
" X_test_full = X_test.copy()\n",
" X_test_full[target_column_name] = y_test\n",
"\n",
" # X_test_full's index does not include origin, so reset for merge\n",
" df_fcst.reset_index(inplace=True)\n",
" X_test_full = X_test_full.reset_index().drop(columns='index')\n",
" together = df_fcst.merge(X_test_full, how='right')\n",
" \n",
" # drop rows where prediction or actuals are nan \n",
" # happens because of missing actuals \n",
" # or at edges of time due to lags/rolling windows\n",
" clean = together[together[[target_column_name, predicted_column_name]].notnull().all(axis=1)]\n",
" return(clean)\n",
"\n",
"def do_rolling_forecast(fitted_model, X_test, y_test, max_horizon, freq='D'):\n",
" \"\"\"\n",
" Produce forecasts on a rolling origin over the given test set.\n",
" \n",
" Each iteration makes a forecast for the next 'max_horizon' periods \n",
" with respect to the current origin, then advances the origin by the horizon time duration. \n",
" The prediction context for each forecast is set so that the forecaster uses \n",
" the actual target values prior to the current origin time for constructing lag features.\n",
" \n",
" This function returns a concatenated DataFrame of rolling forecasts.\n",
" \"\"\"\n",
" df_list = []\n",
" origin_time = X_test[time_column_name].min()\n",
" while origin_time <= X_test[time_column_name].max():\n",
" # Set the horizon time - end date of the forecast\n",
" horizon_time = origin_time + max_horizon * to_offset(freq)\n",
" \n",
" # Extract test data from an expanding window up-to the horizon \n",
" expand_wind = (X_test[time_column_name] < horizon_time)\n",
" X_test_expand = X_test[expand_wind]\n",
" y_query_expand = np.zeros(len(X_test_expand)).astype(np.float)\n",
" y_query_expand.fill(np.NaN)\n",
" \n",
" if origin_time != X_test[time_column_name].min():\n",
" # Set the context by including actuals up-to the origin time\n",
" test_context_expand_wind = (X_test[time_column_name] < origin_time)\n",
" context_expand_wind = (X_test_expand[time_column_name] < origin_time)\n",
" y_query_expand[context_expand_wind] = y_test[test_context_expand_wind]\n",
" \n",
" # Make a forecast out to the maximum horizon\n",
" y_fcst, X_trans = fitted_model.forecast(X_test_expand, y_query_expand)\n",
" \n",
" # Align forecast with test set for dates within the current rolling window \n",
" trans_tindex = X_trans.index.get_level_values(time_column_name)\n",
" trans_roll_wind = (trans_tindex >= origin_time) & (trans_tindex < horizon_time)\n",
" test_roll_wind = expand_wind & (X_test[time_column_name] >= origin_time)\n",
" df_list.append(align_outputs(y_fcst[trans_roll_wind], X_trans[trans_roll_wind],\n",
" X_test[test_roll_wind], y_test[test_roll_wind]))\n",
" \n",
" # Advance the origin time\n",
" origin_time = horizon_time\n",
" \n",
" return pd.concat(df_list, ignore_index=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_all = do_rolling_forecast(fitted_model, X_test, y_test, max_horizon)\n",
"df_all"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We now calculate some error metrics for the forecasts and vizualize the predictions vs. the actuals."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def APE(actual, pred):\n",
" \"\"\"\n",
" Calculate absolute percentage error.\n",
" Returns a vector of APE values with same length as actual/pred.\n",
" \"\"\"\n",
" return 100*np.abs((actual - pred)/actual)\n",
"\n",
"def MAPE(actual, pred):\n",
" \"\"\"\n",
" Calculate mean absolute percentage error.\n",
" Remove NA and values where actual is close to zero\n",
" \"\"\"\n",
" not_na = ~(np.isnan(actual) | np.isnan(pred))\n",
" not_zero = ~np.isclose(actual, 0.0)\n",
" actual_safe = actual[not_na & not_zero]\n",
" pred_safe = pred[not_na & not_zero]\n",
" return np.mean(APE(actual_safe, pred_safe))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"Simple forecasting model\")\n",
"rmse = np.sqrt(mean_squared_error(df_all[target_column_name], df_all['predicted']))\n",
"print(\"[Test Data] \\nRoot Mean squared error: %.2f\" % rmse)\n",
"mae = mean_absolute_error(df_all[target_column_name], df_all['predicted'])\n",
"print('mean_absolute_error score: %.2f' % mae)\n",
"print('MAPE: %.2f' % MAPE(df_all[target_column_name], df_all['predicted']))\n",
"\n",
"# Plot outputs\n",
"%matplotlib inline\n",
"test_pred = plt.scatter(df_all[target_column_name], df_all['predicted'], color='b')\n",
"test_test = plt.scatter(y_test, y_test, color='g')\n",
"plt.legend((test_pred, test_test), ('prediction', 'truth'), loc='upper left', fontsize=8)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The MAPE seems high; it is being skewed by an actual with a small absolute value. For a more informative evaluation, we can calculate the metrics by forecast horizon:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_all.groupby('horizon_origin').apply(\n",
" lambda df: pd.Series({'MAPE': MAPE(df[target_column_name], df['predicted']),\n",
" 'RMSE': np.sqrt(mean_squared_error(df[target_column_name], df['predicted'])),\n",
" 'MAE': mean_absolute_error(df[target_column_name], df['predicted'])}))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It's also interesting to see the distributions of APE (absolute percentage error) by horizon. On a log scale, the outlying APE in the horizon-3 group is clear."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_all_APE = df_all.assign(APE=APE(df_all[target_column_name], df_all['predicted']))\n",
"APEs = [df_all_APE[df_all['horizon_origin'] == h].APE.values for h in range(1, max_horizon + 1)]\n",
"\n",
"%matplotlib inline\n",
"plt.boxplot(APEs)\n",
"plt.yscale('log')\n",
"plt.xlabel('horizon')\n",
"plt.ylabel('APE (%)')\n",
"plt.title('Absolute Percentage Errors by Forecast Horizon')\n",
"\n",
"plt.show()"
]
}
],
"metadata": {
"authors": [
{
"name": "erwright"
}
],
"kernelspec": {
"display_name": "Python 3.6",
"language": "python",
"name": "python36"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
}
},
"nbformat": 4,
"nbformat_minor": 2
}