mirror of
https://github.com/Azure/MachineLearningNotebooks.git
synced 2025-12-20 01:27:06 -05:00
856 lines
36 KiB
Plaintext
856 lines
36 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Automated Machine Learning\n",
|
|
"\n",
|
|
"#### Forecasting away from training data\n",
|
|
"\n",
|
|
"\n",
|
|
"## Contents\n",
|
|
"1. [Introduction](#Introduction)\n",
|
|
"2. [Setup](#Setup)\n",
|
|
"3. [Data](#Data)\n",
|
|
"4. [Prepare remote compute and data.](#prepare_remote)\n",
|
|
"4. [Create the configuration and train a forecaster](#train)\n",
|
|
"5. [Forecasting from the trained model](#forecasting)\n",
|
|
"6. [Forecasting away from training data](#forecasting_away)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Introduction\n",
|
|
"This notebook demonstrates the full interface to the `forecast()` function. \n",
|
|
"\n",
|
|
"The best known and most frequent usage of `forecast` enables forecasting on test sets that immediately follows training data. \n",
|
|
"\n",
|
|
"However, in many use cases it is necessary to continue using the model for some time before retraining it. This happens especially in **high frequency forecasting** when forecasts need to be made more frequently than the model can be retrained. Examples are in Internet of Things and predictive cloud resource scaling.\n",
|
|
"\n",
|
|
"Here we show how to use the `forecast()` function when a time gap exists between training data and prediction period.\n",
|
|
"\n",
|
|
"Terminology:\n",
|
|
"* forecast origin: the last period when the target value is known\n",
|
|
"* forecast periods(s): the period(s) for which the value of the target is desired.\n",
|
|
"* lookback: how many past periods (before forecast origin) the model function depends on. The larger of number of lags and length of rolling window.\n",
|
|
"* prediction context: `lookback` periods immediately preceding the forecast origin\n",
|
|
"\n",
|
|
""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Setup"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Please make sure you have followed the `configuration.ipynb` notebook so that your ML workspace information is saved in the config file."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import os\n",
|
|
"import pandas as pd\n",
|
|
"import numpy as np\n",
|
|
"import logging\n",
|
|
"import warnings\n",
|
|
"\n",
|
|
"import azureml.core\n",
|
|
"from azureml.core.dataset import Dataset\n",
|
|
"from pandas.tseries.frequencies import to_offset\n",
|
|
"from azureml.core.compute import AmlCompute\n",
|
|
"from azureml.core.compute import ComputeTarget\n",
|
|
"from azureml.core.runconfig import RunConfiguration\n",
|
|
"from azureml.core.conda_dependencies import CondaDependencies\n",
|
|
"\n",
|
|
"# Squash warning messages for cleaner output in the notebook\n",
|
|
"warnings.showwarning = lambda *args, **kwargs: None\n",
|
|
"\n",
|
|
"np.set_printoptions(precision=4, suppress=True, linewidth=120)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"This sample notebook may use features that are not available in previous versions of the Azure ML SDK."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"print(\"This notebook was created using version 1.13.0 of the Azure ML SDK\")\n",
|
|
"print(\"You are currently using version\", azureml.core.VERSION, \"of the Azure ML SDK\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.core.workspace import Workspace\n",
|
|
"from azureml.core.experiment import Experiment\n",
|
|
"from azureml.train.automl import AutoMLConfig\n",
|
|
"\n",
|
|
"ws = Workspace.from_config()\n",
|
|
"\n",
|
|
"# choose a name for the run history container in the workspace\n",
|
|
"experiment_name = 'automl-forecast-function-demo'\n",
|
|
"\n",
|
|
"experiment = Experiment(ws, experiment_name)\n",
|
|
"\n",
|
|
"output = {}\n",
|
|
"output['Subscription ID'] = ws.subscription_id\n",
|
|
"output['Workspace'] = ws.name\n",
|
|
"output['SKU'] = ws.sku\n",
|
|
"output['Resource Group'] = ws.resource_group\n",
|
|
"output['Location'] = ws.location\n",
|
|
"output['Run History Name'] = experiment_name\n",
|
|
"pd.set_option('display.max_colwidth', -1)\n",
|
|
"outputDf = pd.DataFrame(data = output, index = [''])\n",
|
|
"outputDf.T"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Data\n",
|
|
"For the demonstration purposes we will generate the data artificially and use them for the forecasting."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"TIME_COLUMN_NAME = 'date'\n",
|
|
"TIME_SERIES_ID_COLUMN_NAME = 'time_series_id'\n",
|
|
"TARGET_COLUMN_NAME = 'y'\n",
|
|
"\n",
|
|
"def get_timeseries(train_len: int,\n",
|
|
" test_len: int,\n",
|
|
" time_column_name: str,\n",
|
|
" target_column_name: str,\n",
|
|
" time_series_id_column_name: str,\n",
|
|
" time_series_number: int = 1,\n",
|
|
" freq: str = 'H'):\n",
|
|
" \"\"\"\n",
|
|
" Return the time series of designed length.\n",
|
|
"\n",
|
|
" :param train_len: The length of training data (one series).\n",
|
|
" :type train_len: int\n",
|
|
" :param test_len: The length of testing data (one series).\n",
|
|
" :type test_len: int\n",
|
|
" :param time_column_name: The desired name of a time column.\n",
|
|
" :type time_column_name: str\n",
|
|
" :param time_series_number: The number of time series in the data set.\n",
|
|
" :type time_series_number: int\n",
|
|
" :param freq: The frequency string representing pandas offset.\n",
|
|
" see https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html\n",
|
|
" :type freq: str\n",
|
|
" :returns: the tuple of train and test data sets.\n",
|
|
" :rtype: tuple\n",
|
|
"\n",
|
|
" \"\"\"\n",
|
|
" data_train = [] # type: List[pd.DataFrame]\n",
|
|
" data_test = [] # type: List[pd.DataFrame]\n",
|
|
" data_length = train_len + test_len\n",
|
|
" for i in range(time_series_number):\n",
|
|
" X = pd.DataFrame({\n",
|
|
" time_column_name: pd.date_range(start='2000-01-01',\n",
|
|
" periods=data_length,\n",
|
|
" freq=freq),\n",
|
|
" target_column_name: np.arange(data_length).astype(float) + np.random.rand(data_length) + i*5,\n",
|
|
" 'ext_predictor': np.asarray(range(42, 42 + data_length)),\n",
|
|
" time_series_id_column_name: np.repeat('ts{}'.format(i), data_length)\n",
|
|
" })\n",
|
|
" data_train.append(X[:train_len])\n",
|
|
" data_test.append(X[train_len:])\n",
|
|
" X_train = pd.concat(data_train)\n",
|
|
" y_train = X_train.pop(target_column_name).values\n",
|
|
" X_test = pd.concat(data_test)\n",
|
|
" y_test = X_test.pop(target_column_name).values\n",
|
|
" return X_train, y_train, X_test, y_test\n",
|
|
"\n",
|
|
"n_test_periods = 6\n",
|
|
"n_train_periods = 30\n",
|
|
"X_train, y_train, X_test, y_test = get_timeseries(train_len=n_train_periods,\n",
|
|
" test_len=n_test_periods,\n",
|
|
" time_column_name=TIME_COLUMN_NAME,\n",
|
|
" target_column_name=TARGET_COLUMN_NAME,\n",
|
|
" time_series_id_column_name=TIME_SERIES_ID_COLUMN_NAME,\n",
|
|
" time_series_number=2)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Let's see what the training data looks like."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"X_train.tail()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# plot the example time series\n",
|
|
"import matplotlib.pyplot as plt\n",
|
|
"whole_data = X_train.copy()\n",
|
|
"target_label = 'y'\n",
|
|
"whole_data[target_label] = y_train\n",
|
|
"for g in whole_data.groupby('time_series_id'): \n",
|
|
" plt.plot(g[1]['date'].values, g[1]['y'].values, label=g[0])\n",
|
|
"plt.legend()\n",
|
|
"plt.show()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Prepare remote compute and data. <a id=\"prepare_remote\"></a>\n",
|
|
"The [Machine Learning service workspace](https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-workspace), is paired with the storage account, which contains the default data store. We will use it to upload the artificial data and create [tabular dataset](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.tabulardataset?view=azure-ml-py) for training. A tabular dataset defines a series of lazily-evaluated, immutable operations to load data from the data source into tabular representation."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# We need to save thw artificial data and then upload them to default workspace datastore.\n",
|
|
"DATA_PATH = \"fc_fn_data\"\n",
|
|
"DATA_PATH_X = \"{}/data_train.csv\".format(DATA_PATH)\n",
|
|
"if not os.path.isdir('data'):\n",
|
|
" os.mkdir('data')\n",
|
|
"pd.DataFrame(whole_data).to_csv(\"data/data_train.csv\", index=False)\n",
|
|
"# Upload saved data to the default data store.\n",
|
|
"ds = ws.get_default_datastore()\n",
|
|
"ds.upload(src_dir='./data', target_path=DATA_PATH, overwrite=True, show_progress=True)\n",
|
|
"train_data = Dataset.Tabular.from_delimited_files(path=ds.path(DATA_PATH_X))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"You will need to create a [compute target](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-set-up-training-targets#amlcompute) for your AutoML run. In this tutorial, you create AmlCompute as your training compute resource."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.core.compute import ComputeTarget, AmlCompute\n",
|
|
"from azureml.core.compute_target import ComputeTargetException\n",
|
|
"\n",
|
|
"# Choose a name for your CPU cluster\n",
|
|
"amlcompute_cluster_name = \"fcfn-cluster\"\n",
|
|
"\n",
|
|
"# Verify that cluster does not exist already\n",
|
|
"try:\n",
|
|
" compute_target = ComputeTarget(workspace=ws, name=amlcompute_cluster_name)\n",
|
|
" print('Found existing cluster, use it.')\n",
|
|
"except ComputeTargetException:\n",
|
|
" compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',\n",
|
|
" max_nodes=6)\n",
|
|
" compute_target = ComputeTarget.create(ws, amlcompute_cluster_name, compute_config)\n",
|
|
"\n",
|
|
"compute_target.wait_for_completion(show_output=True)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Create the configuration and train a forecaster <a id=\"train\"></a>\n",
|
|
"First generate the configuration, in which we:\n",
|
|
"* Set metadata columns: target, time column and time-series id column names.\n",
|
|
"* Validate our data using cross validation with rolling window method.\n",
|
|
"* Set normalized root mean squared error as a metric to select the best model.\n",
|
|
"* Set early termination to True, so the iterations through the models will stop when no improvements in accuracy score will be made.\n",
|
|
"* Set limitations on the length of experiment run to 15 minutes.\n",
|
|
"* Finally, we set the task to be forecasting.\n",
|
|
"* We apply the lag lead operator to the target value i.e. we use the previous values as a predictor for the future ones."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.automl.core.forecasting_parameters import ForecastingParameters\n",
|
|
"lags = [1,2,3]\n",
|
|
"forecast_horizon = n_test_periods\n",
|
|
"forecasting_parameters = ForecastingParameters(\n",
|
|
" time_column_name=TIME_COLUMN_NAME,\n",
|
|
" forecast_horizon=forecast_horizon,\n",
|
|
" time_series_id_column_names=[ TIME_SERIES_ID_COLUMN_NAME ],\n",
|
|
" target_lags=lags\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Run the model selection and training process. Validation errors and current status will be shown when setting `show_output=True` and the execution will be synchronous."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.core.workspace import Workspace\n",
|
|
"from azureml.core.experiment import Experiment\n",
|
|
"from azureml.train.automl import AutoMLConfig\n",
|
|
"\n",
|
|
"\n",
|
|
"automl_config = AutoMLConfig(task='forecasting',\n",
|
|
" debug_log='automl_forecasting_function.log',\n",
|
|
" primary_metric='normalized_root_mean_squared_error',\n",
|
|
" experiment_timeout_hours=0.25,\n",
|
|
" enable_early_stopping=True,\n",
|
|
" training_data=train_data,\n",
|
|
" compute_target=compute_target,\n",
|
|
" n_cross_validations=3,\n",
|
|
" verbosity = logging.INFO,\n",
|
|
" max_concurrent_iterations=4,\n",
|
|
" max_cores_per_iteration=-1,\n",
|
|
" label_column_name=target_label,\n",
|
|
" forecasting_parameters=forecasting_parameters)\n",
|
|
"\n",
|
|
"remote_run = experiment.submit(automl_config, show_output=False)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"remote_run.wait_for_completion()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Retrieve the best model to use it further.\n",
|
|
"_, fitted_model = remote_run.get_output()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Forecasting from the trained model <a id=\"forecasting\"></a>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"In this section we will review the `forecast` interface for two main scenarios: forecasting right after the training data, and the more complex interface for forecasting when there is a gap (in the time sense) between training and testing data."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### X_train is directly followed by the X_test\n",
|
|
"\n",
|
|
"Let's first consider the case when the prediction period immediately follows the training data. This is typical in scenarios where we have the time to retrain the model every time we wish to forecast. Forecasts that are made on daily and slower cadence typically fall into this category. Retraining the model every time benefits the accuracy because the most recent data is often the most informative.\n",
|
|
"\n",
|
|
"\n",
|
|
"\n",
|
|
"We use `X_test` as a **forecast request** to generate the predictions."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Typical path: X_test is known, forecast all upcoming periods"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# The data set contains hourly data, the training set ends at 01/02/2000 at 05:00\n",
|
|
"\n",
|
|
"# These are predictions we are asking the model to make (does not contain thet target column y),\n",
|
|
"# for 6 periods beginning with 2000-01-02 06:00, which immediately follows the training data\n",
|
|
"X_test"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"y_pred_no_gap, xy_nogap = fitted_model.forecast(X_test)\n",
|
|
"\n",
|
|
"# xy_nogap contains the predictions in the _automl_target_col column.\n",
|
|
"# Those same numbers are output in y_pred_no_gap\n",
|
|
"xy_nogap"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Confidence intervals"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Forecasting model may be used for the prediction of forecasting intervals by running ```forecast_quantiles()```. \n",
|
|
"This method accepts the same parameters as forecast()."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"quantiles = fitted_model.forecast_quantiles(X_test)\n",
|
|
"quantiles"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Distribution forecasts\n",
|
|
"\n",
|
|
"Often the figure of interest is not just the point prediction, but the prediction at some quantile of the distribution. \n",
|
|
"This arises when the forecast is used to control some kind of inventory, for example of grocery items or virtual machines for a cloud service. In such case, the control point is usually something like \"we want the item to be in stock and not run out 99% of the time\". This is called a \"service level\". Here is how you get quantile forecasts."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# specify which quantiles you would like \n",
|
|
"fitted_model.quantiles = [0.01, 0.5, 0.95]\n",
|
|
"# use forecast_quantiles function, not the forecast() one\n",
|
|
"y_pred_quantiles = fitted_model.forecast_quantiles(X_test)\n",
|
|
"\n",
|
|
"# quantile forecasts returned in a Dataframe along with the time and time series id columns \n",
|
|
"y_pred_quantiles"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Destination-date forecast: \"just do something\"\n",
|
|
"\n",
|
|
"In some scenarios, the X_test is not known. The forecast is likely to be weak, because it is missing contemporaneous predictors, which we will need to impute. If you still wish to predict forward under the assumption that the last known values will be carried forward, you can forecast out to \"destination date\". The destination date still needs to fit within the forecast horizon from training."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# We will take the destination date as a last date in the test set.\n",
|
|
"dest = max(X_test[TIME_COLUMN_NAME])\n",
|
|
"y_pred_dest, xy_dest = fitted_model.forecast(forecast_destination=dest)\n",
|
|
"\n",
|
|
"# This form also shows how we imputed the predictors which were not given. (Not so well! Use with caution!)\n",
|
|
"xy_dest"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Forecasting away from training data <a id=\"forecasting_away\"></a>\n",
|
|
"\n",
|
|
"Suppose we trained a model, some time passed, and now we want to apply the model without re-training. If the model \"looks back\" -- uses previous values of the target -- then we somehow need to provide those values to the model.\n",
|
|
"\n",
|
|
"\n",
|
|
"\n",
|
|
"The notion of forecast origin comes into play: the forecast origin is **the last period for which we have seen the target value**. This applies per time-series, so each time-series can have a different forecast origin. \n",
|
|
"\n",
|
|
"The part of data before the forecast origin is the **prediction context**. To provide the context values the model needs when it looks back, we pass definite values in `y_test` (aligned with corresponding times in `X_test`)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# generate the same kind of test data we trained on, \n",
|
|
"# but now make the train set much longer, so that the test set will be in the future\n",
|
|
"X_context, y_context, X_away, y_away = get_timeseries(train_len=42, # train data was 30 steps long\n",
|
|
" test_len=4,\n",
|
|
" time_column_name=TIME_COLUMN_NAME,\n",
|
|
" target_column_name=TARGET_COLUMN_NAME,\n",
|
|
" time_series_id_column_name=TIME_SERIES_ID_COLUMN_NAME,\n",
|
|
" time_series_number=2)\n",
|
|
"\n",
|
|
"# end of the data we trained on\n",
|
|
"print(X_train.groupby(TIME_SERIES_ID_COLUMN_NAME)[TIME_COLUMN_NAME].max())\n",
|
|
"# start of the data we want to predict on\n",
|
|
"print(X_away.groupby(TIME_SERIES_ID_COLUMN_NAME)[TIME_COLUMN_NAME].min())"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"There is a gap of 12 hours between end of training and beginning of `X_away`. (It looks like 13 because all timestamps point to the start of the one hour periods.) Using only `X_away` will fail without adding context data for the model to consume."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"try: \n",
|
|
" y_pred_away, xy_away = fitted_model.forecast(X_away)\n",
|
|
" xy_away\n",
|
|
"except Exception as e:\n",
|
|
" print(e)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"How should we read that eror message? The forecast origin is at the last time the model saw an actual value of `y` (the target). That was at the end of the training data! The model is attempting to forecast from the end of training data. But the requested forecast periods are past the forecast horizon. We need to provide a define `y` value to establish the forecast origin.\n",
|
|
"\n",
|
|
"We will use this helper function to take the required amount of context from the data preceding the testing data. It's definition is intentionally simplified to keep the idea in the clear."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"def make_forecasting_query(fulldata, time_column_name, target_column_name, forecast_origin, horizon, lookback):\n",
|
|
"\n",
|
|
" \"\"\"\n",
|
|
" This function will take the full dataset, and create the query\n",
|
|
" to predict all values of the time series from the `forecast_origin`\n",
|
|
" forward for the next `horizon` horizons. Context from previous\n",
|
|
" `lookback` periods will be included.\n",
|
|
"\n",
|
|
" \n",
|
|
"\n",
|
|
" fulldata: pandas.DataFrame a time series dataset. Needs to contain X and y.\n",
|
|
" time_column_name: string which column (must be in fulldata) is the time axis\n",
|
|
" target_column_name: string which column (must be in fulldata) is to be forecast\n",
|
|
" forecast_origin: datetime type the last time we (pretend to) have target values \n",
|
|
" horizon: timedelta how far forward, in time units (not periods)\n",
|
|
" lookback: timedelta how far back does the model look?\n",
|
|
"\n",
|
|
" Example:\n",
|
|
"\n",
|
|
"\n",
|
|
" ```\n",
|
|
"\n",
|
|
" forecast_origin = pd.to_datetime('2012-09-01') + pd.DateOffset(days=5) # forecast 5 days after end of training\n",
|
|
" print(forecast_origin)\n",
|
|
"\n",
|
|
" X_query, y_query = make_forecasting_query(data, \n",
|
|
" forecast_origin = forecast_origin,\n",
|
|
" horizon = pd.DateOffset(days=7), # 7 days into the future\n",
|
|
" lookback = pd.DateOffset(days=1), # model has lag 1 period (day)\n",
|
|
" )\n",
|
|
"\n",
|
|
" ```\n",
|
|
" \"\"\"\n",
|
|
"\n",
|
|
" X_past = fulldata[ (fulldata[ time_column_name ] > forecast_origin - lookback) &\n",
|
|
" (fulldata[ time_column_name ] <= forecast_origin)\n",
|
|
" ]\n",
|
|
"\n",
|
|
" X_future = fulldata[ (fulldata[ time_column_name ] > forecast_origin) &\n",
|
|
" (fulldata[ time_column_name ] <= forecast_origin + horizon)\n",
|
|
" ]\n",
|
|
"\n",
|
|
" y_past = X_past.pop(target_column_name).values.astype(np.float)\n",
|
|
" y_future = X_future.pop(target_column_name).values.astype(np.float)\n",
|
|
"\n",
|
|
" # Now take y_future and turn it into question marks\n",
|
|
" y_query = y_future.copy().astype(np.float) # because sometimes life hands you an int\n",
|
|
" y_query.fill(np.NaN)\n",
|
|
"\n",
|
|
"\n",
|
|
" print(\"X_past is \" + str(X_past.shape) + \" - shaped\")\n",
|
|
" print(\"X_future is \" + str(X_future.shape) + \" - shaped\")\n",
|
|
" print(\"y_past is \" + str(y_past.shape) + \" - shaped\")\n",
|
|
" print(\"y_query is \" + str(y_query.shape) + \" - shaped\")\n",
|
|
"\n",
|
|
"\n",
|
|
" X_pred = pd.concat([X_past, X_future])\n",
|
|
" y_pred = np.concatenate([y_past, y_query])\n",
|
|
" return X_pred, y_pred"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Let's see where the context data ends - it ends, by construction, just before the testing data starts."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"print(X_context.groupby(TIME_SERIES_ID_COLUMN_NAME)[TIME_COLUMN_NAME].agg(['min','max','count']))\n",
|
|
"print(X_away.groupby(TIME_SERIES_ID_COLUMN_NAME)[TIME_COLUMN_NAME].agg(['min','max','count']))\n",
|
|
"X_context.tail(5)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Since the length of the lookback is 3, \n",
|
|
"# we need to add 3 periods from the context to the request\n",
|
|
"# so that the model has the data it needs\n",
|
|
"\n",
|
|
"# Put the X and y back together for a while. \n",
|
|
"# They like each other and it makes them happy.\n",
|
|
"X_context[TARGET_COLUMN_NAME] = y_context\n",
|
|
"X_away[TARGET_COLUMN_NAME] = y_away\n",
|
|
"fulldata = pd.concat([X_context, X_away])\n",
|
|
"\n",
|
|
"# forecast origin is the last point of data, which is one 1-hr period before test\n",
|
|
"forecast_origin = X_away[TIME_COLUMN_NAME].min() - pd.DateOffset(hours=1)\n",
|
|
"# it is indeed the last point of the context\n",
|
|
"assert forecast_origin == X_context[TIME_COLUMN_NAME].max()\n",
|
|
"print(\"Forecast origin: \" + str(forecast_origin))\n",
|
|
" \n",
|
|
"# the model uses lags and rolling windows to look back in time\n",
|
|
"n_lookback_periods = max(lags)\n",
|
|
"lookback = pd.DateOffset(hours=n_lookback_periods)\n",
|
|
"\n",
|
|
"horizon = pd.DateOffset(hours=forecast_horizon)\n",
|
|
"\n",
|
|
"# now make the forecast query from context (refer to figure)\n",
|
|
"X_pred, y_pred = make_forecasting_query(fulldata, TIME_COLUMN_NAME, TARGET_COLUMN_NAME,\n",
|
|
" forecast_origin, horizon, lookback)\n",
|
|
"\n",
|
|
"# show the forecast request aligned\n",
|
|
"X_show = X_pred.copy()\n",
|
|
"X_show[TARGET_COLUMN_NAME] = y_pred\n",
|
|
"X_show"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Note that the forecast origin is at 17:00 for both time-series, and periods from 18:00 are to be forecast."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Now everything works\n",
|
|
"y_pred_away, xy_away = fitted_model.forecast(X_pred, y_pred)\n",
|
|
"\n",
|
|
"# show the forecast aligned\n",
|
|
"X_show = xy_away.reset_index()\n",
|
|
"# without the generated features\n",
|
|
"X_show[['date', 'time_series_id', 'ext_predictor', '_automl_target_col']]\n",
|
|
"# prediction is in _automl_target_col"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Forecasting farther than the forecast horizon <a id=\"recursive forecasting\"></a>\n",
|
|
"When the forecast destination, or the latest date in the prediction data frame, is farther into the future than the specified forecast horizon, the `forecast()` function will still make point predictions out to the later date using a recursive operation mode. Internally, the method recursively applies the regular forecaster to generate context so that we can forecast further into the future. \n",
|
|
"\n",
|
|
"To illustrate the use-case and operation of recursive forecasting, we'll consider an example with a single time-series where the forecasting period directly follows the training period and is twice as long as the forecasting horizon given at training time.\n",
|
|
"\n",
|
|
"\n",
|
|
"\n",
|
|
"Internally, we apply the forecaster in an iterative manner and finish the forecast task in two interations. In the first iteration, we apply the forecaster and get the prediction for the first forecast-horizon periods (y_pred1). In the second iteraction, y_pred1 is used as the context to produce the prediction for the next forecast-horizon periods (y_pred2). The combination of (y_pred1 and y_pred2) gives the results for the total forecast periods. \n",
|
|
"\n",
|
|
"A caveat: forecast accuracy will likely be worse the farther we predict into the future since errors are compounded with recursive application of the forecaster.\n",
|
|
"\n",
|
|
"\n",
|
|
""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# generate the same kind of test data we trained on, but with a single time-series and test period twice as long\n",
|
|
"# as the forecast_horizon.\n",
|
|
"_, _, X_test_long, y_test_long = get_timeseries(train_len=n_train_periods,\n",
|
|
" test_len=forecast_horizon*2,\n",
|
|
" time_column_name=TIME_COLUMN_NAME,\n",
|
|
" target_column_name=TARGET_COLUMN_NAME,\n",
|
|
" time_series_id_column_name=TIME_SERIES_ID_COLUMN_NAME,\n",
|
|
" time_series_number=1)\n",
|
|
"\n",
|
|
"print(X_test_long.groupby(TIME_SERIES_ID_COLUMN_NAME)[TIME_COLUMN_NAME].min())\n",
|
|
"print(X_test_long.groupby(TIME_SERIES_ID_COLUMN_NAME)[TIME_COLUMN_NAME].max())"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# forecast() function will invoke the recursive forecast method internally.\n",
|
|
"y_pred_long, X_trans_long = fitted_model.forecast(X_test_long)\n",
|
|
"y_pred_long"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# What forecast() function does in this case is equivalent to iterating it twice over the test set as the following. \n",
|
|
"y_pred1, _ = fitted_model.forecast(X_test_long[:forecast_horizon])\n",
|
|
"y_pred_all, _ = fitted_model.forecast(X_test_long, np.concatenate((y_pred1, np.full(forecast_horizon, np.nan))))\n",
|
|
"np.array_equal(y_pred_all, y_pred_long)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Confidence interval and distributional forecasts\n",
|
|
"AutoML cannot currently estimate forecast errors beyond the forecast horizon set during training, so the `forecast_quantiles()` function will return missing values for quantiles not equal to 0.5 beyond the forecast horizon. "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"fitted_model.forecast_quantiles(X_test_long)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Similarly with the simple senarios illustrated above, forecasting farther than the forecast horizon in other senarios like 'multiple time-series', 'Destination-date forecast', and 'forecast away from the training data' are also automatically handled by the `forecast()` function. "
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"authors": [
|
|
{
|
|
"name": "erwright"
|
|
}
|
|
],
|
|
"category": "tutorial",
|
|
"compute": [
|
|
"Remote"
|
|
],
|
|
"datasets": [
|
|
"None"
|
|
],
|
|
"deployment": [
|
|
"None"
|
|
],
|
|
"exclude_from_index": false,
|
|
"framework": [
|
|
"Azure ML AutoML"
|
|
],
|
|
"friendly_name": "Forecasting away from training data",
|
|
"index_order": 3,
|
|
"kernelspec": {
|
|
"display_name": "Python 3.6",
|
|
"language": "python",
|
|
"name": "python36"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.6.8"
|
|
},
|
|
"tags": [
|
|
"Forecasting",
|
|
"Confidence Intervals"
|
|
],
|
|
"task": "Forecasting"
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
} |