mirror of
https://github.com/Azure/MachineLearningNotebooks.git
synced 2025-12-25 01:00:11 -05:00
834 lines
37 KiB
Plaintext
834 lines
37 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Copyright (c) Microsoft Corporation. All rights reserved.\n",
|
|
"\n",
|
|
"Licensed under the MIT License."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Automated Machine Learning\n",
|
|
"_**Orange Juice Sales Forecasting**_\n",
|
|
"\n",
|
|
"## Contents\n",
|
|
"1. [Introduction](#introduction)\n",
|
|
"1. [Setup](#setup)\n",
|
|
"1. [Compute](#compute)\n",
|
|
"1. [Data](#data)\n",
|
|
"1. [Train](#train)\n",
|
|
"1. [Forecast](#forecast)\n",
|
|
"1. [Operationalize](#operationalize)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Introduction<a id=\"introduction\"></a>\n",
|
|
"In this example, we use AutoML to train, select, and operationalize a time-series forecasting model for multiple time-series.\n",
|
|
"\n",
|
|
"Make sure you have executed the [configuration notebook](../../../configuration.ipynb) before running this notebook.\n",
|
|
"\n",
|
|
"The examples in the follow code samples use the University of Chicago's Dominick's Finer Foods dataset to forecast orange juice sales. Dominick's was a grocery chain in the Chicago metropolitan area."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Setup<a id=\"setup\"></a>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import json\n",
|
|
"import logging\n",
|
|
"\n",
|
|
"import azureml.core\n",
|
|
"import pandas as pd\n",
|
|
"from azureml.automl.core.featurization import FeaturizationConfig\n",
|
|
"from azureml.core.experiment import Experiment\n",
|
|
"from azureml.core.workspace import Workspace\n",
|
|
"from azureml.train.automl import AutoMLConfig\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"This sample notebook may use features that are not available in previous versions of the Azure ML SDK."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"print(\"This notebook was created using version 1.39.0 of the Azure ML SDK\")\n",
|
|
"print(\"You are currently using version\", azureml.core.VERSION, \"of the Azure ML SDK\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"As part of the setup you have already created a <b>Workspace</b>. To run AutoML, you also need to create an <b>Experiment</b>. An Experiment corresponds to a prediction problem you are trying to solve, while a Run corresponds to a specific approach to the problem. "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"ws = Workspace.from_config()\n",
|
|
"\n",
|
|
"# choose a name for the run history container in the workspace\n",
|
|
"experiment_name = \"automl-ojforecasting\"\n",
|
|
"\n",
|
|
"experiment = Experiment(ws, experiment_name)\n",
|
|
"\n",
|
|
"output = {}\n",
|
|
"output[\"Subscription ID\"] = ws.subscription_id\n",
|
|
"output[\"Workspace\"] = ws.name\n",
|
|
"output[\"SKU\"] = ws.sku\n",
|
|
"output[\"Resource Group\"] = ws.resource_group\n",
|
|
"output[\"Location\"] = ws.location\n",
|
|
"output[\"Run History Name\"] = experiment_name\n",
|
|
"pd.set_option(\"display.max_colwidth\", None)\n",
|
|
"outputDf = pd.DataFrame(data=output, index=[\"\"])\n",
|
|
"outputDf.T"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Compute<a id=\"compute\"></a>\n",
|
|
"You will need to create a [compute target](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-set-up-training-targets#amlcompute) for your AutoML run. In this tutorial, you create AmlCompute as your training compute resource.\n",
|
|
"\n",
|
|
"> Note that if you have an AzureML Data Scientist role, you will not have permission to create compute resources. Talk to your workspace or IT admin to create the compute targets described in this section, if they do not already exist.\n",
|
|
"\n",
|
|
"#### Creation of AmlCompute takes approximately 5 minutes. \n",
|
|
"If the AmlCompute with that name is already in your workspace this code will skip the creation process.\n",
|
|
"As with other Azure services, there are limits on certain resources (e.g. AmlCompute) associated with the Azure Machine Learning service. Please read [this article](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-quotas) on the default limits and how to request more quota."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.core.compute import ComputeTarget, AmlCompute\n",
|
|
"from azureml.core.compute_target import ComputeTargetException\n",
|
|
"\n",
|
|
"# Choose a name for your CPU cluster\n",
|
|
"amlcompute_cluster_name = \"oj-cluster\"\n",
|
|
"\n",
|
|
"# Verify that cluster does not exist already\n",
|
|
"try:\n",
|
|
" compute_target = ComputeTarget(workspace=ws, name=amlcompute_cluster_name)\n",
|
|
" print(\"Found existing cluster, use it.\")\n",
|
|
"except ComputeTargetException:\n",
|
|
" compute_config = AmlCompute.provisioning_configuration(\n",
|
|
" vm_size=\"STANDARD_D12_V2\", max_nodes=6\n",
|
|
" )\n",
|
|
" compute_target = ComputeTarget.create(ws, amlcompute_cluster_name, compute_config)\n",
|
|
"\n",
|
|
"compute_target.wait_for_completion(show_output=True)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Data<a id=\"data\"></a>\n",
|
|
"You are now ready to load the historical orange juice sales data. We will load the CSV file into a plain pandas DataFrame; the time column in the CSV is called _WeekStarting_, so it will be specially parsed into the datetime type."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"time_column_name = \"WeekStarting\"\n",
|
|
"data = pd.read_csv(\"dominicks_OJ.csv\", parse_dates=[time_column_name])\n",
|
|
"\n",
|
|
"# Drop the columns 'logQuantity' as it is a leaky feature.\n",
|
|
"data.drop(\"logQuantity\", axis=1, inplace=True)\n",
|
|
"\n",
|
|
"data.head()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Each row in the DataFrame holds a quantity of weekly sales for an OJ brand at a single store. The data also includes the sales price, a flag indicating if the OJ brand was advertised in the store that week, and some customer demographic information based on the store location. For historical reasons, the data also include the logarithm of the sales quantity. The Dominick's grocery data is commonly used to illustrate econometric modeling techniques where logarithms of quantities are generally preferred. \n",
|
|
"\n",
|
|
"The task is now to build a time-series model for the _Quantity_ column. It is important to note that this dataset is comprised of many individual time-series - one for each unique combination of _Store_ and _Brand_. To distinguish the individual time-series, we define the **time_series_id_column_names** - the columns whose values determine the boundaries between time-series: "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"time_series_id_column_names = [\"Store\", \"Brand\"]\n",
|
|
"nseries = data.groupby(time_series_id_column_names).ngroups\n",
|
|
"print(\"Data contains {0} individual time-series.\".format(nseries))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"For demonstration purposes, we extract sales time-series for just a few of the stores:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"use_stores = [2, 5, 8]\n",
|
|
"data_subset = data[data.Store.isin(use_stores)]\n",
|
|
"nseries = data_subset.groupby(time_series_id_column_names).ngroups\n",
|
|
"print(\"Data subset contains {0} individual time-series.\".format(nseries))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Data Splitting\n",
|
|
"We now split the data into a training and a testing set for later forecast evaluation. The test set will contain the final 20 weeks of observed sales for each time-series. The splits should be stratified by series, so we use a group-by statement on the time series identifier columns."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"n_test_periods = 20\n",
|
|
"\n",
|
|
"\n",
|
|
"def split_last_n_by_series_id(df, n):\n",
|
|
" \"\"\"Group df by series identifiers and split on last n rows for each group.\"\"\"\n",
|
|
" df_grouped = df.sort_values(time_column_name).groupby( # Sort by ascending time\n",
|
|
" time_series_id_column_names, group_keys=False\n",
|
|
" )\n",
|
|
" df_head = df_grouped.apply(lambda dfg: dfg.iloc[:-n])\n",
|
|
" df_tail = df_grouped.apply(lambda dfg: dfg.iloc[-n:])\n",
|
|
" return df_head, df_tail\n",
|
|
"\n",
|
|
"\n",
|
|
"train, test = split_last_n_by_series_id(data_subset, n_test_periods)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Upload data to datastore\n",
|
|
"The [Machine Learning service workspace](https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-workspace), is paired with the storage account, which contains the default data store. We will use it to upload the train and test data and create [tabular datasets](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.tabulardataset?view=azure-ml-py) for training and testing. A tabular dataset defines a series of lazily-evaluated, immutable operations to load data from the data source into tabular representation."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.data.dataset_factory import TabularDatasetFactory\n",
|
|
"\n",
|
|
"datastore = ws.get_default_datastore()\n",
|
|
"train_dataset = TabularDatasetFactory.register_pandas_dataframe(\n",
|
|
" train, target=(datastore, \"dataset/\"), name=\"dominicks_OJ_train\"\n",
|
|
")\n",
|
|
"test_dataset = TabularDatasetFactory.register_pandas_dataframe(\n",
|
|
" test, target=(datastore, \"dataset/\"), name=\"dominicks_OJ_test\"\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Create dataset for training"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"train_dataset.to_pandas_dataframe().tail()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Modeling\n",
|
|
"\n",
|
|
"For forecasting tasks, AutoML uses pre-processing and estimation steps that are specific to time-series. AutoML will undertake the following pre-processing steps:\n",
|
|
"* Detect time-series sample frequency (e.g. hourly, daily, weekly) and create new records for absent time points to make the series regular. A regular time series has a well-defined frequency and has a value at every sample point in a contiguous time span \n",
|
|
"* Impute missing values in the target (via forward-fill) and feature columns (using median column values) \n",
|
|
"* Create features based on time series identifiers to enable fixed effects across different series\n",
|
|
"* Create time-based features to assist in learning seasonal patterns\n",
|
|
"* Encode categorical variables to numeric quantities\n",
|
|
"\n",
|
|
"In this notebook, AutoML will train a single, regression-type model across **all** time-series in a given training set. This allows the model to generalize across related series. If you're looking for training multiple models for different time-series, please see the many-models notebook.\n",
|
|
"\n",
|
|
"You are almost ready to start an AutoML training job. First, we need to separate the target column from the rest of the DataFrame: "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"target_column_name = \"Quantity\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Customization\n",
|
|
"\n",
|
|
"The featurization customization in forecasting is an advanced feature in AutoML which allows our customers to change the default forecasting featurization behaviors and column types through `FeaturizationConfig`. The supported scenarios include:\n",
|
|
"\n",
|
|
"1. Column purposes update: Override feature type for the specified column. Currently supports DateTime, Categorical and Numeric. This customization can be used in the scenario that the type of the column cannot correctly reflect its purpose. Some numerical columns, for instance, can be treated as Categorical columns which need to be converted to categorical while some can be treated as epoch timestamp which need to be converted to datetime. To tell our SDK to correctly preprocess these columns, a configuration need to be add with the columns and their desired types.\n",
|
|
"2. Transformer parameters update: Currently supports parameter change for Imputer only. User can customize imputation methods. The supported imputing methods for target column are constant and ffill (forward fill). The supported imputing methods for feature columns are mean, median, most frequent, constant and ffill (forward fill). This customization can be used for the scenario that our customers know which imputation methods fit best to the input data. For instance, some datasets use NaN to represent 0 which the correct behavior should impute all the missing value with 0. To achieve this behavior, these columns need to be configured as constant imputation with `fill_value` 0.\n",
|
|
"3. Drop columns: Columns to drop from being featurized. These usually are the columns which are leaky or the columns contain no useful data."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"tags": [
|
|
"sample-featurizationconfig-remarks"
|
|
]
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"featurization_config = FeaturizationConfig()\n",
|
|
"# Force the CPWVOL5 feature to be numeric type.\n",
|
|
"featurization_config.add_column_purpose(\"CPWVOL5\", \"Numeric\")\n",
|
|
"# Fill missing values in the target column, Quantity, with zeros.\n",
|
|
"featurization_config.add_transformer_params(\n",
|
|
" \"Imputer\", [\"Quantity\"], {\"strategy\": \"constant\", \"fill_value\": 0}\n",
|
|
")\n",
|
|
"# Fill missing values in the INCOME column with median value.\n",
|
|
"featurization_config.add_transformer_params(\n",
|
|
" \"Imputer\", [\"INCOME\"], {\"strategy\": \"median\"}\n",
|
|
")\n",
|
|
"# Fill missing values in the Price column with forward fill (last value carried forward).\n",
|
|
"featurization_config.add_transformer_params(\"Imputer\", [\"Price\"], {\"strategy\": \"ffill\"})"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Forecasting Parameters\n",
|
|
"To define forecasting parameters for your experiment training, you can leverage the ForecastingParameters class. The table below details the forecasting parameter we will be passing into our experiment.\n",
|
|
"\n",
|
|
"\n",
|
|
"|Property|Description|\n",
|
|
"|-|-|\n",
|
|
"|**time_column_name**|The name of your time column.|\n",
|
|
"|**forecast_horizon**|The forecast horizon is how many periods forward you would like to forecast. This integer horizon is in units of the timeseries frequency (e.g. daily, weekly).|\n",
|
|
"|**time_series_id_column_names**|This optional parameter represents the column names used to uniquely identify the time series in data that has multiple rows with the same timestamp. If the time series identifiers are not defined or incorrectly defined, time series identifiers will be created automatically if they exist.|\n",
|
|
"|**freq**|Forecast frequency. This optional parameter represents the period with which the forecast is desired, for example, daily, weekly, yearly, etc. Use this parameter for the correction of time series containing irregular data points or for padding of short time series. The frequency needs to be a pandas offset alias. Please refer to [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects) for more information."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Train<a id=\"train\"></a>\n",
|
|
"\n",
|
|
"The [AutoMLConfig](https://docs.microsoft.com/en-us/python/api/azureml-train-automl-client/azureml.train.automl.automlconfig.automlconfig?view=azure-ml-py) object defines the settings and data for an AutoML training job. Here, we set necessary inputs like the task type, the number of AutoML iterations to try, the training data, and cross-validation parameters.\n",
|
|
"\n",
|
|
"For forecasting tasks, there are some additional parameters that can be set in the `ForecastingParameters` class: the name of the column holding the date/time, the timeseries id column names, and the maximum forecast horizon. A time column is required for forecasting, while the time_series_id is optional. If time_series_id columns are not given or incorrectly given, AutoML automatically creates time_series_id columns if they exist. We also pass a list of columns to drop prior to modeling. The _logQuantity_ column is completely correlated with the target quantity, so it must be removed to prevent a target leak.\n",
|
|
"\n",
|
|
"The forecast horizon is given in units of the time-series frequency; for instance, the OJ series frequency is weekly, so a horizon of 20 means that a trained model will estimate sales up to 20 weeks beyond the latest date in the training data for each series. In this example, we set the forecast horizon to the number of samples per series in the test set (n_test_periods). Generally, the value of this parameter will be dictated by business needs. For example, a demand planning application that estimates the next month of sales should set the horizon according to suitable planning time-scales. Please see the [energy_demand notebook](https://github.com/Azure/MachineLearningNotebooks/tree/master/how-to-use-azureml/automated-machine-learning/forecasting-energy-demand) for more discussion of forecast horizon.\n",
|
|
"\n",
|
|
"We note here that AutoML can sweep over two types of time-series models:\n",
|
|
"* Models that are trained for each series such as ARIMA and Facebook's Prophet.\n",
|
|
"* Models trained across multiple time-series using a regression approach.\n",
|
|
"\n",
|
|
"In the first case, AutoML loops over all time-series in your dataset and trains one model (e.g. AutoArima or Prophet, as the case may be) for each series. This can result in long runtimes to train these models if there are a lot of series in the data. One way to mitigate this problem is to fit models for different series in parallel if you have multiple compute cores available. To enable this behavior, set the `max_cores_per_iteration` parameter in your AutoMLConfig as shown in the example in the next cell. \n",
|
|
"\n",
|
|
"\n",
|
|
"Finally, a note about the cross-validation (CV) procedure for time-series data. AutoML uses out-of-sample error estimates to select a best pipeline/model, so it is important that the CV fold splitting is done correctly. Time-series can violate the basic statistical assumptions of the canonical K-Fold CV strategy, so AutoML implements a [rolling origin validation](https://robjhyndman.com/hyndsight/tscv/) procedure to create CV folds for time-series data. To use this procedure, you just need to specify the desired number of CV folds in the AutoMLConfig object. It is also possible to bypass CV and use your own validation set by setting the *validation_data* parameter of AutoMLConfig.\n",
|
|
"\n",
|
|
"Here is a summary of AutoMLConfig parameters used for training the OJ model:\n",
|
|
"\n",
|
|
"|Property|Description|\n",
|
|
"|-|-|\n",
|
|
"|**task**|forecasting|\n",
|
|
"|**primary_metric**|This is the metric that you want to optimize.<br> Forecasting supports the following primary metrics <br><i>spearman_correlation</i><br><i>normalized_root_mean_squared_error</i><br><i>r2_score</i><br><i>normalized_mean_absolute_error</i>\n",
|
|
"|**experiment_timeout_hours**|Experimentation timeout in hours.|\n",
|
|
"|**enable_early_stopping**|If early stopping is on, training will stop when the primary metric is no longer improving.|\n",
|
|
"|**training_data**|Input dataset, containing both features and label column.|\n",
|
|
"|**label_column_name**|The name of the label column.|\n",
|
|
"|**compute_target**|The remote compute for training.|\n",
|
|
"|**n_cross_validations**|Number of cross-validation folds to use for model/pipeline selection|\n",
|
|
"|**enable_voting_ensemble**|Allow AutoML to create a Voting ensemble of the best performing models|\n",
|
|
"|**enable_stack_ensemble**|Allow AutoML to create a Stack ensemble of the best performing models|\n",
|
|
"|**debug_log**|Log file path for writing debugging information|\n",
|
|
"|**featurization**| 'auto' / 'off' / FeaturizationConfig Indicator for whether featurization step should be done automatically or not, or whether customized featurization should be used. Setting this enables AutoML to perform featurization on the input to handle *missing data*, and to perform some common *feature extraction*.|\n",
|
|
"|**max_cores_per_iteration**|Maximum number of cores to utilize per iteration. A value of -1 indicates all available cores should be used"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.automl.core.forecasting_parameters import ForecastingParameters\n",
|
|
"\n",
|
|
"forecasting_parameters = ForecastingParameters(\n",
|
|
" time_column_name=time_column_name,\n",
|
|
" forecast_horizon=n_test_periods,\n",
|
|
" freq=\"W-THU\", # Set the forecast frequency to be weekly (start on each Thursday)\n",
|
|
")\n",
|
|
"\n",
|
|
"automl_config = AutoMLConfig(\n",
|
|
" task=\"forecasting\",\n",
|
|
" debug_log=\"automl_oj_sales_errors.log\",\n",
|
|
" primary_metric=\"normalized_mean_absolute_error\",\n",
|
|
" experiment_timeout_hours=0.25,\n",
|
|
" training_data=train_dataset,\n",
|
|
" label_column_name=target_column_name,\n",
|
|
" compute_target=compute_target,\n",
|
|
" enable_early_stopping=True,\n",
|
|
" featurization=featurization_config,\n",
|
|
" n_cross_validations=3,\n",
|
|
" verbosity=logging.INFO,\n",
|
|
" max_cores_per_iteration=-1,\n",
|
|
" forecasting_parameters=forecasting_parameters,\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"You can now submit a new training run. Depending on the data and number of iterations this operation may take several minutes.\n",
|
|
"Information from each iteration will be printed to the console. Validation errors and current status will be shown when setting `show_output=True` and the execution will be synchronous."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"remote_run = experiment.submit(automl_config, show_output=False)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"remote_run.wait_for_completion()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Retrieve the Best Run details\n",
|
|
"Below we retrieve the best Run object from among all the runs in the experiment."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"best_run = remote_run.get_best_child()\n",
|
|
"model_name = best_run.properties[\"model_name\"]\n",
|
|
"best_run"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Transparency\n",
|
|
"\n",
|
|
"View updated featurization summary"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Download the featurization summary JSON file locally\n",
|
|
"best_run.download_file(\"outputs/featurization_summary.json\", \"featurization_summary.json\")\n",
|
|
"\n",
|
|
"# Render the JSON as a pandas DataFrame\n",
|
|
"with open(\"featurization_summary.json\", \"r\") as f:\n",
|
|
" records = json.load(f)\n",
|
|
"fs = pd.DataFrame.from_records(records)\n",
|
|
"\n",
|
|
"# View a summary of the featurization \n",
|
|
"fs[[\"RawFeatureName\", \"TypeDetected\", \"Dropped\", \"EngineeredFeatureCount\", \"Transformations\"]]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Forecast<a id=\"forecast\"></a>\n",
|
|
"\n",
|
|
"Now that we have retrieved the best pipeline/model, it can be used to make predictions on test data. We will do batch scoring on the test dataset which should have the same schema as training dataset.\n",
|
|
"\n",
|
|
"The inference will run on a remote compute. In this example, it will re-use the training compute."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"test_experiment = Experiment(ws, experiment_name + \"_inference\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Retreiving forecasts from the model\n",
|
|
"We have created a function called `run_forecast` that submits the test data to the best model determined during the training run and retrieves forecasts. This function uses a helper script `forecasting_script` which is uploaded and expecuted on the remote compute."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"To produce predictions on the test set, we need to know the feature values at all dates in the test set. This requirement is somewhat reasonable for the OJ sales data since the features mainly consist of price, which is usually set in advance, and customer demographics which are approximately constant for each store over the 20 week forecast horizon in the testing data."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from run_forecast import run_remote_inference\n",
|
|
"\n",
|
|
"remote_run_infer = run_remote_inference(\n",
|
|
" test_experiment=test_experiment,\n",
|
|
" compute_target=compute_target,\n",
|
|
" train_run=best_run,\n",
|
|
" test_dataset=test_dataset,\n",
|
|
" target_column_name=target_column_name,\n",
|
|
")\n",
|
|
"remote_run_infer.wait_for_completion(show_output=False)\n",
|
|
"\n",
|
|
"# download the forecast file to the local machine\n",
|
|
"remote_run_infer.download_file(\"outputs/predictions.csv\", \"predictions.csv\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Evaluate\n",
|
|
"\n",
|
|
"To evaluate the accuracy of the forecast, we'll compare against the actual sales quantities for some select metrics, included the mean absolute percentage error (MAPE). For more metrics that can be used for evaluation after training, please see [supported metrics](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-understand-automated-ml#regressionforecasting-metrics), and [how to calculate residuals](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-understand-automated-ml#residuals).\n",
|
|
"\n",
|
|
"We'll add predictions and actuals into a single dataframe for convenience in calculating the metrics."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# load forecast data frame\n",
|
|
"fcst_df = pd.read_csv(\"predictions.csv\", parse_dates=[time_column_name])\n",
|
|
"fcst_df.head()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.automl.core.shared import constants\n",
|
|
"from azureml.automl.runtime.shared.score import scoring\n",
|
|
"from matplotlib import pyplot as plt\n",
|
|
"\n",
|
|
"# use automl scoring module\n",
|
|
"scores = scoring.score_regression(\n",
|
|
" y_test=fcst_df[target_column_name],\n",
|
|
" y_pred=fcst_df[\"predicted\"],\n",
|
|
" metrics=list(constants.Metric.SCALAR_REGRESSION_SET),\n",
|
|
")\n",
|
|
"\n",
|
|
"print(\"[Test data scores]\\n\")\n",
|
|
"for key, value in scores.items():\n",
|
|
" print(\"{}: {:.3f}\".format(key, value))\n",
|
|
"\n",
|
|
"# Plot outputs\n",
|
|
"%matplotlib inline\n",
|
|
"test_pred = plt.scatter(fcst_df[target_column_name], fcst_df[\"predicted\"], color=\"b\")\n",
|
|
"test_test = plt.scatter(\n",
|
|
" fcst_df[target_column_name], fcst_df[target_column_name], color=\"g\"\n",
|
|
")\n",
|
|
"plt.legend(\n",
|
|
" (test_pred, test_test), (\"prediction\", \"truth\"), loc=\"upper left\", fontsize=8\n",
|
|
")\n",
|
|
"plt.show()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Operationalize<a id=\"operationalize\"></a>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"_Operationalization_ means getting the model into the cloud so that other can run it after you close the notebook. We will create a docker running on Azure Container Instances with the model."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"description = \"AutoML OJ forecaster\"\n",
|
|
"tags = None\n",
|
|
"model = remote_run.register_model(\n",
|
|
" model_name=model_name, description=description, tags=tags\n",
|
|
")\n",
|
|
"\n",
|
|
"print(remote_run.model_id)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Develop the scoring script\n",
|
|
"\n",
|
|
"For the deployment we need a function which will run the forecast on serialized data. It can be obtained from the best_run."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"script_file_name = \"score_fcast.py\"\n",
|
|
"best_run.download_file(\"outputs/scoring_file_v_1_0_0.py\", script_file_name)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Deploy the model as a Web Service on Azure Container Instance"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.core.model import InferenceConfig\n",
|
|
"from azureml.core.webservice import AciWebservice\n",
|
|
"from azureml.core.webservice import Webservice\n",
|
|
"from azureml.core.model import Model\n",
|
|
"\n",
|
|
"inference_config = InferenceConfig(\n",
|
|
" environment=best_run.get_environment(), entry_script=script_file_name\n",
|
|
")\n",
|
|
"\n",
|
|
"aciconfig = AciWebservice.deploy_configuration(\n",
|
|
" cpu_cores=2,\n",
|
|
" memory_gb=4,\n",
|
|
" tags={\"type\": \"automl-forecasting\"},\n",
|
|
" description=\"Automl forecasting sample service\",\n",
|
|
")\n",
|
|
"\n",
|
|
"aci_service_name = \"automl-oj-forecast-01\"\n",
|
|
"print(aci_service_name)\n",
|
|
"aci_service = Model.deploy(ws, aci_service_name, [model], inference_config, aciconfig)\n",
|
|
"aci_service.wait_for_deployment(True)\n",
|
|
"print(aci_service.state)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"aci_service.get_logs()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Call the service"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import json\n",
|
|
"\n",
|
|
"X_query = test.copy()\n",
|
|
"X_query.pop(target_column_name)\n",
|
|
"# We have to convert datetime to string, because Timestamps cannot be serialized to JSON.\n",
|
|
"X_query[time_column_name] = X_query[time_column_name].astype(str)\n",
|
|
"# The Service object accept the complex dictionary, which is internally converted to JSON string.\n",
|
|
"# The section 'data' contains the data frame in the form of dictionary.\n",
|
|
"sample_quantiles = [0.025, 0.975]\n",
|
|
"test_sample = json.dumps(\n",
|
|
" {\"data\": X_query.to_dict(orient=\"records\"), \"quantiles\": sample_quantiles}\n",
|
|
")\n",
|
|
"response = aci_service.run(input_data=test_sample)\n",
|
|
"# translate from networkese to datascientese\n",
|
|
"try:\n",
|
|
" res_dict = json.loads(response)\n",
|
|
" y_fcst_all = pd.DataFrame(res_dict[\"index\"])\n",
|
|
" y_fcst_all[time_column_name] = pd.to_datetime(\n",
|
|
" y_fcst_all[time_column_name], unit=\"ms\"\n",
|
|
" )\n",
|
|
" y_fcst_all[\"forecast\"] = res_dict[\"forecast\"]\n",
|
|
" y_fcst_all[\"prediction_interval\"] = res_dict[\"prediction_interval\"]\n",
|
|
"except:\n",
|
|
" print(res_dict)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"y_fcst_all.head()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Delete the web service if desired"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"serv = Webservice(ws, \"automl-oj-forecast-01\")\n",
|
|
"serv.delete() # don't do it accidentally"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"authors": [
|
|
{
|
|
"name": "jialiu"
|
|
}
|
|
],
|
|
"category": "tutorial",
|
|
"celltoolbar": "Raw Cell Format",
|
|
"compute": [
|
|
"Remote"
|
|
],
|
|
"datasets": [
|
|
"Orange Juice Sales"
|
|
],
|
|
"deployment": [
|
|
"Azure Container Instance"
|
|
],
|
|
"exclude_from_index": false,
|
|
"framework": [
|
|
"Azure ML AutoML"
|
|
],
|
|
"friendly_name": "Forecasting orange juice sales with deployment",
|
|
"index_order": 1,
|
|
"kernelspec": {
|
|
"display_name": "Python 3.6",
|
|
"language": "python",
|
|
"name": "python36"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.6.9"
|
|
},
|
|
"tags": [
|
|
"None"
|
|
],
|
|
"task": "Forecasting"
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 4
|
|
} |