mirror of
https://github.com/Azure/MachineLearningNotebooks.git
synced 2025-12-19 17:17:04 -05:00
855 lines
37 KiB
Plaintext
855 lines
37 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Copyright (c) Microsoft Corporation. All rights reserved.\n",
|
|
"\n",
|
|
"Licensed under the MIT License."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Automated Machine Learning\n",
|
|
"_**Orange Juice Sales Forecasting**_\n",
|
|
"\n",
|
|
"## Contents\n",
|
|
"1. [Introduction](#Introduction)\n",
|
|
"1. [Setup](#Setup)\n",
|
|
"1. [Data](#Data)\n",
|
|
"1. [Train](#Train)\n",
|
|
"1. [Predict](#Predict)\n",
|
|
"1. [Operationalize](#Operationalize)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Introduction\n",
|
|
"In this example, we use AutoML to train, select, and operationalize a time-series forecasting model for multiple time-series.\n",
|
|
"\n",
|
|
"Make sure you have executed the [configuration notebook](../../../configuration.ipynb) before running this notebook.\n",
|
|
"\n",
|
|
"The examples in the follow code samples use the University of Chicago's Dominick's Finer Foods dataset to forecast orange juice sales. Dominick's was a grocery chain in the Chicago metropolitan area."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Setup"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import azureml.core\n",
|
|
"import pandas as pd\n",
|
|
"import numpy as np\n",
|
|
"import logging\n",
|
|
"import warnings\n",
|
|
"\n",
|
|
"# Squash warning messages for cleaner output in the notebook\n",
|
|
"warnings.showwarning = lambda *args, **kwargs: None\n",
|
|
"\n",
|
|
"from azureml.core.workspace import Workspace\n",
|
|
"from azureml.core.experiment import Experiment\n",
|
|
"from azureml.train.automl import AutoMLConfig\n",
|
|
"from sklearn.metrics import mean_absolute_error, mean_squared_error"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"As part of the setup you have already created a <b>Workspace</b>. To run AutoML, you also need to create an <b>Experiment</b>. An Experiment corresponds to a prediction problem you are trying to solve, while a Run corresponds to a specific approach to the problem. "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"ws = Workspace.from_config()\n",
|
|
"\n",
|
|
"# choose a name for the run history container in the workspace\n",
|
|
"experiment_name = 'automl-ojforecasting'\n",
|
|
"# project folder\n",
|
|
"project_folder = './sample_projects/automl-local-ojforecasting'\n",
|
|
"\n",
|
|
"experiment = Experiment(ws, experiment_name)\n",
|
|
"\n",
|
|
"output = {}\n",
|
|
"output['SDK version'] = azureml.core.VERSION\n",
|
|
"output['Subscription ID'] = ws.subscription_id\n",
|
|
"output['Workspace'] = ws.name\n",
|
|
"output['Resource Group'] = ws.resource_group\n",
|
|
"output['Location'] = ws.location\n",
|
|
"output['Project Directory'] = project_folder\n",
|
|
"output['Run History Name'] = experiment_name\n",
|
|
"pd.set_option('display.max_colwidth', -1)\n",
|
|
"outputDf = pd.DataFrame(data = output, index = [''])\n",
|
|
"outputDf.T"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Data\n",
|
|
"You are now ready to load the historical orange juice sales data. We will load the CSV file into a plain pandas DataFrame; the time column in the CSV is called _WeekStarting_, so it will be specially parsed into the datetime type."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"time_column_name = 'WeekStarting'\n",
|
|
"data = pd.read_csv(\"dominicks_OJ.csv\", parse_dates=[time_column_name])\n",
|
|
"data.head()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Each row in the DataFrame holds a quantity of weekly sales for an OJ brand at a single store. The data also includes the sales price, a flag indicating if the OJ brand was advertised in the store that week, and some customer demographic information based on the store location. For historical reasons, the data also include the logarithm of the sales quantity. The Dominick's grocery data is commonly used to illustrate econometric modeling techniques where logarithms of quantities are generally preferred. \n",
|
|
"\n",
|
|
"The task is now to build a time-series model for the _Quantity_ column. It is important to note that this dataset is comprised of many individual time-series - one for each unique combination of _Store_ and _Brand_. To distinguish the individual time-series, we thus define the **grain** - the columns whose values determine the boundaries between time-series: "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"grain_column_names = ['Store', 'Brand']\n",
|
|
"nseries = data.groupby(grain_column_names).ngroups\n",
|
|
"print('Data contains {0} individual time-series.'.format(nseries))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"For demonstration purposes, we extract sales time-series for just a few of the stores:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"use_stores = [2, 5, 8]\n",
|
|
"data_subset = data[data.Store.isin(use_stores)]\n",
|
|
"nseries = data_subset.groupby(grain_column_names).ngroups\n",
|
|
"print('Data subset contains {0} individual time-series.'.format(nseries))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Data Splitting\n",
|
|
"We now split the data into a training and a testing set for later forecast evaluation. The test set will contain the final 20 weeks of observed sales for each time-series. The splits should be stratified by series, so we use a group-by statement on the grain columns."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"n_test_periods = 20\n",
|
|
"\n",
|
|
"def split_last_n_by_grain(df, n):\n",
|
|
" \"\"\"Group df by grain and split on last n rows for each group.\"\"\"\n",
|
|
" df_grouped = (df.sort_values(time_column_name) # Sort by ascending time\n",
|
|
" .groupby(grain_column_names, group_keys=False))\n",
|
|
" df_head = df_grouped.apply(lambda dfg: dfg.iloc[:-n])\n",
|
|
" df_tail = df_grouped.apply(lambda dfg: dfg.iloc[-n:])\n",
|
|
" return df_head, df_tail\n",
|
|
"\n",
|
|
"X_train, X_test = split_last_n_by_grain(data_subset, n_test_periods)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Modeling\n",
|
|
"\n",
|
|
"For forecasting tasks, AutoML uses pre-processing and estimation steps that are specific to time-series. AutoML will undertake the following pre-processing steps:\n",
|
|
"* Detect time-series sample frequency (e.g. hourly, daily, weekly) and create new records for absent time points to make the series regular. A regular time series has a well-defined frequency and has a value at every sample point in a contiguous time span \n",
|
|
"* Impute missing values in the target (via forward-fill) and feature columns (using median column values) \n",
|
|
"* Create grain-based features to enable fixed effects across different series\n",
|
|
"* Create time-based features to assist in learning seasonal patterns\n",
|
|
"* Encode categorical variables to numeric quantities\n",
|
|
"\n",
|
|
"AutoML will currently train a single, regression-type model across **all** time-series in a given training set. This allows the model to generalize across related series.\n",
|
|
"\n",
|
|
"You are almost ready to start an AutoML training job. First, we need to separate the target column from the rest of the DataFrame: "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"target_column_name = 'Quantity'\n",
|
|
"y_train = X_train.pop(target_column_name).values"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Train\n",
|
|
"\n",
|
|
"The AutoMLConfig object defines the settings and data for an AutoML training job. Here, we set necessary inputs like the task type, the number of AutoML iterations to try, the training data, and cross-validation parameters. \n",
|
|
"\n",
|
|
"For forecasting tasks, there are some additional parameters that can be set: the name of the column holding the date/time, the grain column names, and the maximum forecast horizon. A time column is required for forecasting, while the grain is optional. If a grain is not given, AutoML assumes that the whole dataset is a single time-series. We also pass a list of columns to drop prior to modeling. The _logQuantity_ column is completely correlated with the target quantity, so it must be removed to prevent a target leak.\n",
|
|
"\n",
|
|
"The forecast horizon is given in units of the time-series frequency; for instance, the OJ series frequency is weekly, so a horizon of 20 means that a trained model will estimate sales up-to 20 weeks beyond the latest date in the training data for each series. In this example, we set the maximum horizon to the number of samples per series in the test set (n_test_periods). Generally, the value of this parameter will be dictated by business needs. For example, a demand planning organizaion that needs to estimate the next month of sales would set the horizon accordingly. Please see the [energy_demand notebook](https://github.com/Azure/MachineLearningNotebooks/tree/master/how-to-use-azureml/automated-machine-learning/forecasting-energy-demand) for more discussion of forecast horizon.\n",
|
|
"\n",
|
|
"Finally, a note about the cross-validation (CV) procedure for time-series data. AutoML uses out-of-sample error estimates to select a best pipeline/model, so it is important that the CV fold splitting is done correctly. Time-series can violate the basic statistical assumptions of the canonical K-Fold CV strategy, so AutoML implements a [rolling origin validation](https://robjhyndman.com/hyndsight/tscv/) procedure to create CV folds for time-series data. To use this procedure, you just need to specify the desired number of CV folds in the AutoMLConfig object. It is also possible to bypass CV and use your own validation set by setting the *X_valid* and *y_valid* parameters of AutoMLConfig.\n",
|
|
"\n",
|
|
"Here is a summary of AutoMLConfig parameters used for training the OJ model:\n",
|
|
"\n",
|
|
"|Property|Description|\n",
|
|
"|-|-|\n",
|
|
"|**task**|forecasting|\n",
|
|
"|**primary_metric**|This is the metric that you want to optimize.<br> Forecasting supports the following primary metrics <br><i>spearman_correlation</i><br><i>normalized_root_mean_squared_error</i><br><i>r2_score</i><br><i>normalized_mean_absolute_error</i>\n",
|
|
"|**iterations**|Number of iterations. In each iteration, Auto ML trains a specific pipeline on the given data|\n",
|
|
"|**X**|Training matrix of features as a pandas DataFrame, shape = [n_training_samples, n_features]|\n",
|
|
"|**y**|Target values as a numpy.ndarray, shape = [n_training_samples, ]|\n",
|
|
"|**n_cross_validations**|Number of cross-validation folds to use for model/pipeline selection|\n",
|
|
"|**enable_ensembling**|Allow AutoML to create ensembles of the best performing models\n",
|
|
"|**debug_log**|Log file path for writing debugging information\n",
|
|
"|**path**|Relative path to the project folder. AutoML stores configuration files for the experiment under this folder. You can specify a new empty folder.|\n",
|
|
"|**time_column_name**|Name of the datetime column in the input data|\n",
|
|
"|**grain_column_names**|Name(s) of the columns defining individual series in the input data|\n",
|
|
"|**drop_column_names**|Name(s) of columns to drop prior to modeling|\n",
|
|
"|**max_horizon**|Maximum desired forecast horizon in units of time-series frequency|"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"time_series_settings = {\n",
|
|
" 'time_column_name': time_column_name,\n",
|
|
" 'grain_column_names': grain_column_names,\n",
|
|
" 'drop_column_names': ['logQuantity'],\n",
|
|
" 'max_horizon': n_test_periods\n",
|
|
"}\n",
|
|
"\n",
|
|
"automl_config = AutoMLConfig(task='forecasting',\n",
|
|
" debug_log='automl_oj_sales_errors.log',\n",
|
|
" primary_metric='normalized_mean_absolute_error',\n",
|
|
" iterations=10,\n",
|
|
" X=X_train,\n",
|
|
" y=y_train,\n",
|
|
" n_cross_validations=3,\n",
|
|
" enable_ensembling=False,\n",
|
|
" path=project_folder,\n",
|
|
" verbosity=logging.INFO,\n",
|
|
" **time_series_settings)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"You can now submit a new training run. For local runs, the execution is synchronous. Depending on the data and number of iterations this operation may take several minutes.\n",
|
|
"Information from each iteration will be printed to the console."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"local_run = experiment.submit(automl_config, show_output=True)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Retrieve the Best Model\n",
|
|
"Each run within an Experiment stores serialized (i.e. pickled) pipelines from the AutoML iterations. We can now retrieve the pipeline with the best performance on the validation dataset:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"best_run, fitted_pipeline = local_run.get_output()\n",
|
|
"fitted_pipeline.steps"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Forecasting\n",
|
|
"\n",
|
|
"Now that we have retrieved the best pipeline/model, it can be used to make predictions on test data. First, we remove the target values from the test set:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"y_test = X_test.pop(target_column_name).values"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"X_test.head()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"To produce predictions on the test set, we need to know the feature values at all dates in the test set. This requirement is somewhat reasonable for the OJ sales data since the features mainly consist of price, which is usually set in advance, and customer demographics which are approximately constant for each store over the 20 week forecast horizon in the testing data. \n",
|
|
"\n",
|
|
"We will first create a query `y_query`, which is aligned index-for-index to `X_test`. This is a vector of target values where each `NaN` serves the function of the question mark to be replaced by forecast. Passing definite values in the `y` argument allows the `forecast` function to make predictions on data that does not immediately follow the train data which contains `y`. In each grain, the last time point where the model sees a definite value of `y` is that grain's _forecast origin_."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Replace ALL values in y_pred by NaN.\n",
|
|
"# The forecast origin will be at the beginning of the first forecast period.\n",
|
|
"# (Which is the same time as the end of the last training period.)\n",
|
|
"y_query = y_test.copy().astype(np.float)\n",
|
|
"y_query.fill(np.nan)\n",
|
|
"# The featurized data, aligned to y, will also be returned.\n",
|
|
"# This contains the assumptions that were made in the forecast\n",
|
|
"# and helps align the forecast to the original data\n",
|
|
"y_pred, X_trans = fitted_pipeline.forecast(X_test, y_query)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"If you are used to scikit pipelines, perhaps you expected `predict(X_test)`. However, forecasting requires a more general interface that also supplies the past target `y` values. Please use `forecast(X,y)` as `predict(X)` is reserved for internal purposes on forecasting models.\n",
|
|
"\n",
|
|
"The [energy demand forecasting notebook](https://github.com/Azure/MachineLearningNotebooks/tree/master/how-to-use-azureml/automated-machine-learning/forecasting-energy-demand) demonstrates the use of the forecast function in more detail in the context of using lags and rolling window features. "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Evaluate\n",
|
|
"\n",
|
|
"To evaluate the accuracy of the forecast, we'll compare against the actual sales quantities for some select metrics, included the mean absolute percentage error (MAPE). \n",
|
|
"\n",
|
|
"It is a good practice to always align the output explicitly to the input, as the count and order of the rows may have changed during transformations that span multiple rows."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"def align_outputs(y_predicted, X_trans, X_test, y_test, predicted_column_name = 'predicted'):\n",
|
|
" \"\"\"\n",
|
|
" Demonstrates how to get the output aligned to the inputs\n",
|
|
" using pandas indexes. Helps understand what happened if\n",
|
|
" the output's shape differs from the input shape, or if\n",
|
|
" the data got re-sorted by time and grain during forecasting.\n",
|
|
" \n",
|
|
" Typical causes of misalignment are:\n",
|
|
" * we predicted some periods that were missing in actuals -> drop from eval\n",
|
|
" * model was asked to predict past max_horizon -> increase max horizon\n",
|
|
" * data at start of X_test was needed for lags -> provide previous periods in y\n",
|
|
" \"\"\"\n",
|
|
" \n",
|
|
" df_fcst = pd.DataFrame({predicted_column_name : y_predicted})\n",
|
|
" # y and X outputs are aligned by forecast() function contract\n",
|
|
" df_fcst.index = X_trans.index\n",
|
|
" \n",
|
|
" # align original X_test to y_test \n",
|
|
" X_test_full = X_test.copy()\n",
|
|
" X_test_full[target_column_name] = y_test\n",
|
|
"\n",
|
|
" # X_test_full's index does not include origin, so reset for merge\n",
|
|
" df_fcst.reset_index(inplace=True)\n",
|
|
" X_test_full = X_test_full.reset_index().drop(columns='index')\n",
|
|
" together = df_fcst.merge(X_test_full, how='right')\n",
|
|
" \n",
|
|
" # drop rows where prediction or actuals are nan \n",
|
|
" # happens because of missing actuals \n",
|
|
" # or at edges of time due to lags/rolling windows\n",
|
|
" clean = together[together[[target_column_name, predicted_column_name]].notnull().all(axis=1)]\n",
|
|
" return(clean)\n",
|
|
"\n",
|
|
"df_all = align_outputs(y_pred, X_trans, X_test, y_test)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"def MAPE(actual, pred):\n",
|
|
" \"\"\"\n",
|
|
" Calculate mean absolute percentage error.\n",
|
|
" Remove NA and values where actual is close to zero\n",
|
|
" \"\"\"\n",
|
|
" not_na = ~(np.isnan(actual) | np.isnan(pred))\n",
|
|
" not_zero = ~np.isclose(actual, 0.0)\n",
|
|
" actual_safe = actual[not_na & not_zero]\n",
|
|
" pred_safe = pred[not_na & not_zero]\n",
|
|
" APE = 100*np.abs((actual_safe - pred_safe)/actual_safe)\n",
|
|
" return np.mean(APE)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"print(\"Simple forecasting model\")\n",
|
|
"rmse = np.sqrt(mean_squared_error(df_all[target_column_name], df_all['predicted']))\n",
|
|
"print(\"[Test Data] \\nRoot Mean squared error: %.2f\" % rmse)\n",
|
|
"mae = mean_absolute_error(df_all[target_column_name], df_all['predicted'])\n",
|
|
"print('mean_absolute_error score: %.2f' % mae)\n",
|
|
"print('MAPE: %.2f' % MAPE(df_all[target_column_name], df_all['predicted']))\n",
|
|
"\n",
|
|
"# Plot outputs\n",
|
|
"import matplotlib.pyplot as plt\n",
|
|
"\n",
|
|
"%matplotlib notebook\n",
|
|
"test_pred = plt.scatter(df_all[target_column_name], df_all['predicted'], color='b')\n",
|
|
"test_test = plt.scatter(y_test, y_test, color='g')\n",
|
|
"plt.legend((test_pred, test_test), ('prediction', 'truth'), loc='upper left', fontsize=8)\n",
|
|
"plt.show()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Operationalize"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"_Operationalization_ means getting the model into the cloud so that other can run it after you close the notebook. We will create a docker running on Azure Container Instances with the model."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"description = 'AutoML OJ forecaster'\n",
|
|
"tags = None\n",
|
|
"model = local_run.register_model(description = description, tags = tags)\n",
|
|
"\n",
|
|
"print(local_run.model_id)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Develop the scoring script\n",
|
|
"\n",
|
|
"Serializing and deserializing complex data frames may be tricky. We first develop the `run()` function of the scoring script locally, then write it into a scoring script. It is much easier to debug any quirks of the scoring function without crossing two compute environments. For this exercise, we handle a common quirk of how pandas dataframes serialize time stamp values."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# this is where we test the run function of the scoring script interactively\n",
|
|
"# before putting it in the scoring script\n",
|
|
"\n",
|
|
"timestamp_columns = ['WeekStarting']\n",
|
|
"\n",
|
|
"def run(rawdata, test_model = None):\n",
|
|
" \"\"\"\n",
|
|
" Intended to process 'rawdata' string produced by\n",
|
|
" \n",
|
|
" {'X': X_test.to_json(), y' : y_test.to_json()}\n",
|
|
" \n",
|
|
" Don't convert the X payload to numpy.array, use it as pandas.DataFrame\n",
|
|
" \"\"\"\n",
|
|
" try:\n",
|
|
" # unpack the data frame with timestamp \n",
|
|
" rawobj = json.loads(rawdata) # rawobj is now a dict of strings \n",
|
|
" X_pred = pd.read_json(rawobj['X'], convert_dates=False) # load the pandas DF from a json string\n",
|
|
" for col in timestamp_columns: # fix timestamps\n",
|
|
" X_pred[col] = pd.to_datetime(X_pred[col], unit='ms') \n",
|
|
" \n",
|
|
" y_pred = np.array(rawobj['y']) # reconstitute numpy array from serialized list\n",
|
|
" \n",
|
|
" if test_model is None:\n",
|
|
" result = model.forecast(X_pred, y_pred) # use the global model from init function\n",
|
|
" else:\n",
|
|
" result = test_model.forecast(X_pred, y_pred) # use the model on which we are testing\n",
|
|
" \n",
|
|
" except Exception as e:\n",
|
|
" result = str(e)\n",
|
|
" return json.dumps({\"error\": result})\n",
|
|
" \n",
|
|
" forecast_as_list = result[0].tolist()\n",
|
|
" index_as_df = result[1].index.to_frame().reset_index(drop=True)\n",
|
|
" \n",
|
|
" return json.dumps({\"forecast\": forecast_as_list, # return the minimum over the wire: \n",
|
|
" \"index\": index_as_df.to_json() # no forecast and its featurized values\n",
|
|
" })"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# test the run function here before putting in the scoring script\n",
|
|
"import json\n",
|
|
"\n",
|
|
"test_sample = json.dumps({'X': X_test.to_json(), 'y' : y_query.tolist()})\n",
|
|
"response = run(test_sample, fitted_pipeline)\n",
|
|
"\n",
|
|
"# unpack the response, dealing with the timestamp serialization again\n",
|
|
"res_dict = json.loads(response)\n",
|
|
"y_fcst_all = pd.read_json(res_dict['index'])\n",
|
|
"y_fcst_all[time_column_name] = pd.to_datetime(y_fcst_all[time_column_name], unit = 'ms')\n",
|
|
"y_fcst_all['forecast'] = res_dict['forecast']\n",
|
|
"y_fcst_all.head()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Now that the function works locally in the notebook, let's write it down into the scoring script. The scoring script is authored by the data scientist. Adjust it to taste, adding inputs, outputs and processing as needed."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"%%writefile score_fcast.py\n",
|
|
"import pickle\n",
|
|
"import json\n",
|
|
"import numpy as np\n",
|
|
"import pandas as pd\n",
|
|
"import azureml.train.automl\n",
|
|
"from sklearn.externals import joblib\n",
|
|
"from azureml.core.model import Model\n",
|
|
"\n",
|
|
"\n",
|
|
"def init():\n",
|
|
" global model\n",
|
|
" model_path = Model.get_model_path(model_name = '<<modelid>>') # this name is model.id of model that we want to deploy\n",
|
|
" # deserialize the model file back into a sklearn model\n",
|
|
" model = joblib.load(model_path)\n",
|
|
"\n",
|
|
"timestamp_columns = ['WeekStarting']\n",
|
|
"\n",
|
|
"def run(rawdata, test_model = None):\n",
|
|
" \"\"\"\n",
|
|
" Intended to process 'rawdata' string produced by\n",
|
|
" \n",
|
|
" {'X': X_test.to_json(), y' : y_test.to_json()}\n",
|
|
" \n",
|
|
" Don't convert the X payload to numpy.array, use it as pandas.DataFrame\n",
|
|
" \"\"\"\n",
|
|
" try:\n",
|
|
" # unpack the data frame with timestamp \n",
|
|
" rawobj = json.loads(rawdata) # rawobj is now a dict of strings \n",
|
|
" X_pred = pd.read_json(rawobj['X'], convert_dates=False) # load the pandas DF from a json string\n",
|
|
" for col in timestamp_columns: # fix timestamps\n",
|
|
" X_pred[col] = pd.to_datetime(X_pred[col], unit='ms') \n",
|
|
" \n",
|
|
" y_pred = np.array(rawobj['y']) # reconstitute numpy array from serialized list\n",
|
|
" \n",
|
|
" if test_model is None:\n",
|
|
" result = model.forecast(X_pred, y_pred) # use the global model from init function\n",
|
|
" else:\n",
|
|
" result = test_model.forecast(X_pred, y_pred) # use the model on which we are testing\n",
|
|
" \n",
|
|
" except Exception as e:\n",
|
|
" result = str(e)\n",
|
|
" return json.dumps({\"error\": result})\n",
|
|
" \n",
|
|
" # prepare to send over wire as json\n",
|
|
" forecast_as_list = result[0].tolist()\n",
|
|
" index_as_df = result[1].index.to_frame().reset_index(drop=True)\n",
|
|
" \n",
|
|
" return json.dumps({\"forecast\": forecast_as_list, # return the minimum over the wire: \n",
|
|
" \"index\": index_as_df.to_json() # no forecast and its featurized values\n",
|
|
" })"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# get the model\n",
|
|
"from azureml.train.automl.run import AutoMLRun\n",
|
|
"\n",
|
|
"experiment = Experiment(ws, experiment_name)\n",
|
|
"ml_run = AutoMLRun(experiment = experiment, run_id = local_run.id)\n",
|
|
"best_iteration = int(str.split(best_run.id,'_')[-1]) # the iteration number is a postfix of the run ID."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# get the best model's dependencies and write them into this file\n",
|
|
"from azureml.core.conda_dependencies import CondaDependencies\n",
|
|
"\n",
|
|
"conda_env_file_name = 'fcast_env.yml'\n",
|
|
"\n",
|
|
"dependencies = ml_run.get_run_sdk_dependencies(iteration = best_iteration)\n",
|
|
"for p in ['azureml-train-automl', 'azureml-sdk', 'azureml-core']:\n",
|
|
" print('{}\\t{}'.format(p, dependencies[p]))\n",
|
|
"\n",
|
|
"myenv = CondaDependencies.create(conda_packages=['numpy','scikit-learn'], pip_packages=['azureml-sdk[automl]'])\n",
|
|
"\n",
|
|
"myenv.save_to_file('.', conda_env_file_name)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# this is the script file name we wrote a few cells above\n",
|
|
"script_file_name = 'score_fcast.py'\n",
|
|
"\n",
|
|
"# Substitute the actual version number in the environment file.\n",
|
|
"# This is not strictly needed in this notebook because the model should have been generated using the current SDK version.\n",
|
|
"# However, we include this in case this code is used on an experiment from a previous SDK version.\n",
|
|
"\n",
|
|
"with open(conda_env_file_name, 'r') as cefr:\n",
|
|
" content = cefr.read()\n",
|
|
"\n",
|
|
"with open(conda_env_file_name, 'w') as cefw:\n",
|
|
" cefw.write(content.replace(azureml.core.VERSION, dependencies['azureml-sdk']))\n",
|
|
"\n",
|
|
"# Substitute the actual model id in the script file.\n",
|
|
"\n",
|
|
"with open(script_file_name, 'r') as cefr:\n",
|
|
" content = cefr.read()\n",
|
|
"\n",
|
|
"with open(script_file_name, 'w') as cefw:\n",
|
|
" cefw.write(content.replace('<<modelid>>', local_run.model_id))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Create a Container Image"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.core.image import Image, ContainerImage\n",
|
|
"\n",
|
|
"image_config = ContainerImage.image_configuration(runtime= \"python\",\n",
|
|
" execution_script = script_file_name,\n",
|
|
" conda_file = conda_env_file_name,\n",
|
|
" tags = {'type': \"automl-forecasting\"},\n",
|
|
" description = \"Image for automl forecasting sample\")\n",
|
|
"\n",
|
|
"image = Image.create(name = \"automl-fcast-image\",\n",
|
|
" # this is the model object \n",
|
|
" models = [model],\n",
|
|
" image_config = image_config, \n",
|
|
" workspace = ws)\n",
|
|
"\n",
|
|
"image.wait_for_creation(show_output = True)\n",
|
|
"\n",
|
|
"if image.creation_state == 'Failed':\n",
|
|
" print(\"Image build log at: \" + image.image_build_log_uri)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Deploy the Image as a Web Service on Azure Container Instance"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.core.webservice import AciWebservice\n",
|
|
"\n",
|
|
"aciconfig = AciWebservice.deploy_configuration(cpu_cores = 1, \n",
|
|
" memory_gb = 2, \n",
|
|
" tags = {'type': \"automl-forecasting\"},\n",
|
|
" description = \"Automl forecasting sample service\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.core.webservice import Webservice\n",
|
|
"\n",
|
|
"aci_service_name = 'automl-forecast-01'\n",
|
|
"print(aci_service_name)\n",
|
|
"\n",
|
|
"aci_service = Webservice.deploy_from_image(deployment_config = aciconfig,\n",
|
|
" image = image,\n",
|
|
" name = aci_service_name,\n",
|
|
" workspace = ws)\n",
|
|
"aci_service.wait_for_deployment(True)\n",
|
|
"print(aci_service.state)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Call the service"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# we send the data to the service serialized into a json string\n",
|
|
"test_sample = json.dumps({'X':X_test.to_json(), 'y' : y_query.tolist()})\n",
|
|
"response = aci_service.run(input_data = test_sample)\n",
|
|
"\n",
|
|
"# translate from networkese to datascientese\n",
|
|
"try: \n",
|
|
" res_dict = json.loads(response)\n",
|
|
" y_fcst_all = pd.read_json(res_dict['index'])\n",
|
|
" y_fcst_all[time_column_name] = pd.to_datetime(y_fcst_all[time_column_name], unit = 'ms')\n",
|
|
" y_fcst_all['forecast'] = res_dict['forecast'] \n",
|
|
"except:\n",
|
|
" print(res_dict)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"y_fcst_all.head()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Delete the web service if desired"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"serv = Webservice(ws, 'automl-forecast-01')\n",
|
|
"# serv.delete() # don't do it accidentally"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"authors": [
|
|
{
|
|
"name": "erwright, tosingli"
|
|
}
|
|
],
|
|
"kernelspec": {
|
|
"display_name": "Python 3.6",
|
|
"language": "python",
|
|
"name": "python36"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.6.8"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
} |