Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# Automated Machine Learning: Energy Demand Forecasting

In this example, we show how AutoML can be used for energy demand forecasting.

Make sure you have executed the [configuration](../configuration.ipynb) before running this notebook.

In this notebook you would see
1. Creating an Experiment in an existing Workspace
2. Instantiating AutoMLConfig with new task type "forecasting" for timeseries data training, and other timeseries related settings: for this dataset we use the basic one: "time_column_name" 
3. Training the Model using local compute
4. Exploring the results
5. Testing the fitted model

## Create Experiment

As part of the setup you have already created a <b>Workspace</b>. For AutoML you would need to create an <b>Experiment</b>. An <b>Experiment</b> is a named object in a <b>Workspace</b>, which is used to run experiments.

In [None]:
import azureml.core
import pandas as pd
import numpy as np
import os
import logging
import warnings
# Squash warning messages for cleaner output in the notebook
warnings.showwarning = lambda *args, **kwargs: None


from azureml.core.workspace import Workspace
from azureml.core.experiment import Experiment
from azureml.train.automl import AutoMLConfig
from azureml.train.automl.run import AutoMLRun
from matplotlib import pyplot as plt
from matplotlib.pyplot import imshow
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [None]:
ws = Workspace.from_config()

# choose a name for the run history container in the workspace
experiment_name = 'automl-energydemandforecasting'
# project folder
project_folder = './sample_projects/automl-local-energydemandforecasting'

experiment = Experiment(ws, experiment_name)

output = {}
output['SDK version'] = azureml.core.VERSION
output['Subscription ID'] = ws.subscription_id
output['Workspace'] = ws.name
output['Resource Group'] = ws.resource_group
output['Location'] = ws.location
output['Project Directory'] = project_folder
output['Run History Name'] = experiment_name
pd.set_option('display.max_colwidth', -1)
pd.DataFrame(data=output, index=['']).T

## Read Data
Read energy demanding data from file, and preview data.

In [None]:
data = pd.read_csv("nyc_energy.csv", parse_dates=['timeStamp'])
data.head()

### Split the data to train and test



In [None]:
train = data[data['timeStamp'] < '2017-02-01']
test = data[data['timeStamp'] >= '2017-02-01']


### Prepare the test data, we will feed X_test to the fitted model and get prediction

In [None]:
y_test = test.pop('demand').values
X_test = test

### Split the train data to train and valid

Use one month's data as valid data


In [None]:
X_train = train[train['timeStamp'] < '2017-01-01']
X_valid = train[train['timeStamp'] >= '2017-01-01']
y_train = X_train.pop('demand').values
y_valid = X_valid.pop('demand').values
print(X_train.shape)
print(y_train.shape)
print(X_valid.shape)
print(y_valid.shape)

## Instantiate Auto ML Config

Instantiate a AutoMLConfig object. This defines the settings and data used to run the experiment.

|Property|Description|
|-|-|
|**task**|forecasting|
|**primary_metric**|This is the metric that you want to optimize.<br> Forecasting supports the following primary metrics <br><i>spearman_correlation</i><br><i>normalized_root_mean_squared_error</i><br><i>r2_score</i><br><i>normalized_mean_absolute_error</i>
|**iterations**|Number of iterations. In each iteration, Auto ML trains a specific pipeline on the given data|
|**iteration_timeout_minutes**|Time limit in minutes for each iteration.|
|**X**|(sparse) array-like, shape = [n_samples, n_features]|
|**y**|(sparse) array-like, shape = [n_samples, ], [n_samples, n_classes]<br>Multi-class targets. An indicator matrix turns on multilabel classification.  This should be an array of integers. |
|**X_valid**|Data used to evaluate a model in a iteration. (sparse) array-like, shape = [n_samples, n_features]|
|**y_valid**|Data used to evaluate a model in a iteration. (sparse) array-like, shape = [n_samples, ], [n_samples, n_classes]<br>Multi-class targets. An indicator matrix turns on multilabel classification.  This should be an array of integers. |
|**path**|Relative path to the project folder.  AutoML stores configuration files for the experiment under this folder. You can specify a new empty folder. 

In [None]:
time_column_name = 'timeStamp'
automl_settings = {
    "time_column_name": time_column_name,
}


automl_config = AutoMLConfig(task = 'forecasting',
                             debug_log = 'automl_nyc_energy_errors.log',
                             primary_metric='normalized_root_mean_squared_error',
                             iterations = 10,
                             iteration_timeout_minutes = 5,
                             X = X_train,
                             y = y_train,
                             X_valid = X_valid,
                             y_valid = y_valid,
                             path=project_folder,
                             verbosity = logging.INFO,
                            **automl_settings)

## Training the Model

You can call the submit method on the experiment object and pass the run configuration. For Local runs the execution is synchronous. Depending on the data and number of iterations this can run for while.
You will see the currently running iterations printing to the console.

In [None]:
local_run = experiment.submit(automl_config, show_output=True)

### Retrieve the Best Model
Below we select the best pipeline from our iterations. The get_output method on automl_classifier returns the best run and the fitted model for the last fit invocation. There are overloads on get_output that allow you to retrieve the best run and fitted model for any logged metric or a particular iteration.

In [None]:
best_run, fitted_model = local_run.get_output()
fitted_model.steps

### Test the Best Fitted Model

Predict on training and test set, and calculate residual values.

In [None]:
y_pred = fitted_model.predict(X_test)
y_pred

### Define a Check Data Function

Remove the nan values from y_test to avoid error when calculate metrics 

In [None]:
def _check_calc_input(y_true, y_pred, rm_na=True):
    """
    Check that 'y_true' and 'y_pred' are non-empty and
    have equal length.

    :param y_true: Vector of actual values
    :type y_true: array-like

    :param y_pred: Vector of predicted values
    :type y_pred: array-like

    :param rm_na:
        If rm_na=True, remove entries where y_true=NA and y_pred=NA.
    :type rm_na: boolean

    :return:
        Tuple (y_true, y_pred). if rm_na=True,
        the returned vectors may differ from their input values.
    :rtype: Tuple with 2 entries
    """
    if len(y_true) != len(y_pred):
        raise ValueError(
            'the true values and prediction values do not have equal length.')
    elif len(y_true) == 0:
        raise ValueError(
            'y_true and y_pred are empty.')
    # if there is any non-numeric element in the y_true or y_pred,
    # the ValueError exception will be thrown.
    y_true = np.array(y_true).astype(float)
    y_pred = np.array(y_pred).astype(float)
    if rm_na:
        # remove entries both in y_true and y_pred where at least
        # one element in y_true or y_pred is missing
        y_true_rm_na = y_true[~(np.isnan(y_true) | np.isnan(y_pred))]
        y_pred_rm_na = y_pred[~(np.isnan(y_true) | np.isnan(y_pred))]
        return (y_true_rm_na, y_pred_rm_na)
    else:
        return y_true, y_pred

### Use the Check Data Function to remove the nan values from y_test to avoid error when calculate metrics 

In [None]:
y_test,y_pred =  _check_calc_input(y_test,y_pred)

### Calculate metrics for the prediction


In [None]:
print("[Test Data] \nRoot Mean squared error: %.2f" % np.sqrt(mean_squared_error(y_test, y_pred)))
# Explained variance score: 1 is perfect prediction
print('mean_absolute_error score: %.2f' % mean_absolute_error(y_test, y_pred))
print('R2 score: %.2f' % r2_score(y_test, y_pred))



# Plot outputs
%matplotlib notebook
test_pred = plt.scatter(y_test, y_pred, color='b')
test_test = plt.scatter(y_test, y_test, color='g')
plt.legend((test_pred, test_test), ('prediction', 'truth'), loc='upper left', fontsize=8)
plt.show()