Files
MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/forecasting-github-dau/auto-ml-forecasting-github-dau.ipynb

710 lines
23 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"hideCode": false,
"hidePrompt": false
},
"source": [
"Copyright (c) Microsoft Corporation. All rights reserved.\n",
"\n",
"Licensed under the MIT License."
]
},
{
"cell_type": "markdown",
"metadata": {
"hideCode": false,
"hidePrompt": false
},
"source": [
"![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/forecasting-github-dau/auto-ml-forecasting-github-dau.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<font color=\"red\" size=\"5\"><strong>!Important!</strong> </br>This notebook is outdated and is not supported by the AutoML Team. Please use the supported version ([link](https://github.com/Azure/azureml-examples/tree/main/sdk/python/jobs/automl-standalone-jobs/automl-forecasting-github-dau)).</font>"
]
},
{
"cell_type": "markdown",
"metadata": {
"hideCode": false,
"hidePrompt": false
},
"source": [
"# Automated Machine Learning\n",
"**Github DAU Forecasting**\n",
"\n",
"## Contents\n",
"1. [Introduction](#Introduction)\n",
"1. [Setup](#Setup)\n",
"1. [Data](#Data)\n",
"1. [Train](#Train)\n",
"1. [Evaluate](#Evaluate)"
]
},
{
"cell_type": "markdown",
"metadata": {
"hideCode": false,
"hidePrompt": false
},
"source": [
"## Introduction\n",
"This notebook demonstrates demand forecasting for Github Daily Active Users Dataset using AutoML.\n",
"\n",
"AutoML highlights here include using Deep Learning forecasts, Arima, Prophet, Remote Execution and Remote Inferencing, and working with the `forecast` function. Please also look at the additional forecasting notebooks, which document lagging, rolling windows, forecast quantiles, other ways to use the forecast function, and forecaster deployment.\n",
"\n",
"Make sure you have executed the [configuration](https://github.com/Azure/MachineLearningNotebooks/blob/master/configuration.ipynb) before running this notebook.\n",
"\n",
"Notebook synopsis:\n",
"\n",
"1. Creating an Experiment in an existing Workspace\n",
"2. Configuration and remote run of AutoML for a time-series model exploring DNNs\n",
"4. Evaluating the fitted model using a rolling test "
]
},
{
"cell_type": "markdown",
"metadata": {
"hideCode": false,
"hidePrompt": false
},
"source": [
"## Setup\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hideCode": false,
"hidePrompt": false
},
"outputs": [],
"source": [
"import os\n",
"import azureml.core\n",
"import pandas as pd\n",
"import numpy as np\n",
"import logging\n",
"import warnings\n",
"\n",
"from pandas.tseries.frequencies import to_offset\n",
"\n",
"# Squash warning messages for cleaner output in the notebook\n",
"warnings.showwarning = lambda *args, **kwargs: None\n",
"\n",
"from azureml.core import Workspace, Experiment, Dataset\n",
"from azureml.train.automl import AutoMLConfig\n",
"from matplotlib import pyplot as plt\n",
"from sklearn.metrics import mean_absolute_error, mean_squared_error\n",
"from azureml.train.estimator import Estimator"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This notebook is compatible with Azure ML SDK version 1.35.0 or later."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"You are currently using version\", azureml.core.VERSION, \"of the Azure ML SDK\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"hideCode": false,
"hidePrompt": false
},
"source": [
"As part of the setup you have already created a <b>Workspace</b>. To run AutoML, you also need to create an <b>Experiment</b>. An Experiment corresponds to a prediction problem you are trying to solve, while a Run corresponds to a specific approach to the problem."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hideCode": false,
"hidePrompt": false
},
"outputs": [],
"source": [
"ws = Workspace.from_config()\n",
"\n",
"# choose a name for the run history container in the workspace\n",
"experiment_name = \"github-remote-cpu\"\n",
"\n",
"experiment = Experiment(ws, experiment_name)\n",
"\n",
"output = {}\n",
"output[\"Subscription ID\"] = ws.subscription_id\n",
"output[\"Workspace\"] = ws.name\n",
"output[\"Resource Group\"] = ws.resource_group\n",
"output[\"Location\"] = ws.location\n",
"output[\"Run History Name\"] = experiment_name\n",
"output[\"SDK Version\"] = azureml.core.VERSION\n",
"pd.set_option(\"display.max_colwidth\", None)\n",
"outputDf = pd.DataFrame(data=output, index=[\"\"])\n",
"outputDf.T"
]
},
{
"cell_type": "markdown",
"metadata": {
"hideCode": false,
"hidePrompt": false
},
"source": [
"### Using AmlCompute\n",
"You will need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for your AutoML run. In this tutorial, you use `AmlCompute` as your training compute resource.\n",
"\n",
"> Note that if you have an AzureML Data Scientist role, you will not have permission to create compute resources. Talk to your workspace or IT admin to create the compute targets described in this section, if they do not already exist."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hideCode": false,
"hidePrompt": false
},
"outputs": [],
"source": [
"from azureml.core.compute import ComputeTarget, AmlCompute\n",
"from azureml.core.compute_target import ComputeTargetException\n",
"\n",
"# Choose a name for your CPU cluster\n",
"cpu_cluster_name = \"github-cluster\"\n",
"\n",
"# Verify that cluster does not exist already\n",
"try:\n",
" compute_target = ComputeTarget(workspace=ws, name=cpu_cluster_name)\n",
" print(\"Found existing cluster, use it.\")\n",
"except ComputeTargetException:\n",
" compute_config = AmlCompute.provisioning_configuration(\n",
" vm_size=\"STANDARD_DS12_V2\", max_nodes=4\n",
" )\n",
" compute_target = ComputeTarget.create(ws, cpu_cluster_name, compute_config)\n",
"\n",
"compute_target.wait_for_completion(show_output=True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"hideCode": false,
"hidePrompt": false
},
"source": [
"## Data\n",
"Read Github DAU data from file, and preview data."
]
},
{
"cell_type": "markdown",
"metadata": {
"hideCode": false,
"hidePrompt": false
},
"source": [
"Let's set up what we know about the dataset. \n",
"\n",
"**Target column** is what we want to forecast.\n",
"\n",
"**Time column** is the time axis along which to predict.\n",
"\n",
"**Time series identifier columns** are identified by values of the columns listed `time_series_id_column_names`, for example \"store\" and \"item\" if your data has multiple time series of sales, one series for each combination of store and item sold.\n",
"\n",
"**Forecast frequency (freq)** This optional parameter represents the period with which the forecast is desired, for example, daily, weekly, yearly, etc. Use this parameter for the correction of time series containing irregular data points or for padding of short time series. The frequency needs to be a pandas offset alias. Please refer to [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects) for more information.\n",
"\n",
"This dataset has only one time series. Please see the [orange juice notebook](https://github.com/Azure/MachineLearningNotebooks/tree/master/how-to-use-azureml/automated-machine-learning/forecasting-orange-juice-sales) for an example of a multi-time series dataset."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hideCode": false,
"hidePrompt": false
},
"outputs": [],
"source": [
"import pandas as pd\n",
"from pandas import DataFrame\n",
"from pandas import Grouper\n",
"from pandas import concat\n",
"from pandas.plotting import register_matplotlib_converters\n",
"\n",
"register_matplotlib_converters()\n",
"plt.figure(figsize=(20, 10))\n",
"plt.tight_layout()\n",
"\n",
"plt.subplot(2, 1, 1)\n",
"plt.title(\"Github Daily Active User By Year\")\n",
"df = pd.read_csv(\"github_dau_2011-2018_train.csv\", parse_dates=True, index_col=\"date\")\n",
"test_df = pd.read_csv(\n",
" \"github_dau_2011-2018_test.csv\", parse_dates=True, index_col=\"date\"\n",
")\n",
"plt.plot(df)\n",
"\n",
"plt.subplot(2, 1, 2)\n",
"plt.title(\"Github Daily Active User By Month\")\n",
"groups = df.groupby(df.index.month)\n",
"months = concat([DataFrame(x[1].values) for x in groups], axis=1)\n",
"months = DataFrame(months)\n",
"months.columns = range(1, 49)\n",
"months.boxplot()\n",
"\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hideCode": false,
"hidePrompt": false
},
"outputs": [],
"source": [
"target_column_name = \"count\"\n",
"time_column_name = \"date\"\n",
"time_series_id_column_names = []\n",
"freq = \"D\" # Daily data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Split Training data into Train and Validation set and Upload to Datastores"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hideCode": false,
"hidePrompt": false
},
"outputs": [],
"source": [
"from helper import split_fraction_by_grain\n",
"from helper import split_full_for_forecasting\n",
"\n",
"train, valid = split_full_for_forecasting(df, time_column_name)\n",
"\n",
"# Reset index to create a Tabualr Dataset.\n",
"train.reset_index(inplace=True)\n",
"valid.reset_index(inplace=True)\n",
"test_df.reset_index(inplace=True)\n",
"\n",
"datastore = ws.get_default_datastore()\n",
"train_dataset = Dataset.Tabular.register_pandas_dataframe(\n",
" train, target=(datastore, \"dataset/\"), name=\"Github_DAU_train\"\n",
")\n",
"valid_dataset = Dataset.Tabular.register_pandas_dataframe(\n",
" valid, target=(datastore, \"dataset/\"), name=\"Github_DAU_valid\"\n",
")\n",
"test_dataset = Dataset.Tabular.register_pandas_dataframe(\n",
" test_df, target=(datastore, \"dataset/\"), name=\"Github_DAU_test\"\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {
"hideCode": false,
"hidePrompt": false
},
"source": [
"### Setting forecaster maximum horizon \n",
"\n",
"The forecast horizon is the number of periods into the future that the model should predict. Here, we set the horizon to 14 periods (i.e. 14 days). Notice that this is much shorter than the number of months in the test set; we will need to use a rolling test to evaluate the performance on the whole test set. For more discussion of forecast horizons and guiding principles for setting them, please see the [energy demand notebook](https://github.com/Azure/MachineLearningNotebooks/tree/master/how-to-use-azureml/automated-machine-learning/forecasting-energy-demand). "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hideCode": false,
"hidePrompt": false
},
"outputs": [],
"source": [
"forecast_horizon = 14"
]
},
{
"cell_type": "markdown",
"metadata": {
"hideCode": false,
"hidePrompt": false
},
"source": [
"## Train\n",
"\n",
"Instantiate a AutoMLConfig object. This defines the settings and data used to run the experiment.\n",
"\n",
"|Property|Description|\n",
"|-|-|\n",
"|**task**|forecasting|\n",
"|**primary_metric**|This is the metric that you want to optimize.<br> Forecasting supports the following primary metrics <br><i>spearman_correlation</i><br><i>normalized_root_mean_squared_error</i><br><i>r2_score</i><br><i>normalized_mean_absolute_error</i>\n",
"|**iteration_timeout_minutes**|Time limit in minutes for each iteration.|\n",
"|**training_data**|Input dataset, containing both features and label column.|\n",
"|**label_column_name**|The name of the label column.|\n",
"|**enable_dnn**|Enable Forecasting DNNs|\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hideCode": false,
"hidePrompt": false
},
"outputs": [],
"source": [
"from azureml.automl.core.forecasting_parameters import ForecastingParameters\n",
"\n",
"forecasting_parameters = ForecastingParameters(\n",
" time_column_name=time_column_name,\n",
" forecast_horizon=forecast_horizon,\n",
" freq=\"D\", # Set the forecast frequency to be daily\n",
")\n",
"\n",
"# To only allow the TCNForecaster we set the allowed_models parameter to reflect this.\n",
"automl_config = AutoMLConfig(\n",
" task=\"forecasting\",\n",
" primary_metric=\"normalized_root_mean_squared_error\",\n",
" experiment_timeout_hours=1.5,\n",
" training_data=train_dataset,\n",
" label_column_name=target_column_name,\n",
" validation_data=valid_dataset,\n",
" verbosity=logging.INFO,\n",
" compute_target=compute_target,\n",
" max_concurrent_iterations=4,\n",
" max_cores_per_iteration=-1,\n",
" enable_dnn=True,\n",
" allowed_models=[\"TCNForecaster\"],\n",
" forecasting_parameters=forecasting_parameters,\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {
"hideCode": false,
"hidePrompt": false
},
"source": [
"We will now run the experiment, starting with 10 iterations of model search. The experiment can be continued for more iterations if more accurate results are required. Validation errors and current status will be shown when setting `show_output=True` and the execution will be synchronous."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hideCode": false,
"hidePrompt": false
},
"outputs": [],
"source": [
"remote_run = experiment.submit(automl_config, show_output=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hideCode": false,
"hidePrompt": false
},
"outputs": [],
"source": [
"# If you need to retrieve a run that already started, use the following code\n",
"# from azureml.train.automl.run import AutoMLRun\n",
"# remote_run = AutoMLRun(experiment = experiment, run_id = '<replace with your run id>')"
]
},
{
"cell_type": "markdown",
"metadata": {
"hideCode": false,
"hidePrompt": false
},
"source": [
"Displaying the run objects gives you links to the visual tools in the Azure Portal. Go try them!"
]
},
{
"cell_type": "markdown",
"metadata": {
"hideCode": false,
"hidePrompt": false
},
"source": [
"### Retrieve the Best Model for Each Algorithm\n",
"Below we select the best pipeline from our iterations. The get_output method on automl_classifier returns the best run and the fitted model for the last fit invocation. There are overloads on get_output that allow you to retrieve the best run and fitted model for any logged metric or a particular iteration."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hideCode": false,
"hidePrompt": false
},
"outputs": [],
"source": [
"from helper import get_result_df\n",
"\n",
"summary_df = get_result_df(remote_run)\n",
"summary_df"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hideCode": false,
"hidePrompt": false
},
"outputs": [],
"source": [
"from azureml.core.run import Run\n",
"from azureml.widgets import RunDetails\n",
"\n",
"forecast_model = \"TCNForecaster\"\n",
"if not forecast_model in summary_df[\"run_id\"]:\n",
" forecast_model = \"ForecastTCN\"\n",
"\n",
"best_dnn_run_id = summary_df[summary_df[\"Score\"] == summary_df[\"Score\"].min()][\n",
" \"run_id\"\n",
"][forecast_model]\n",
"best_dnn_run = Run(experiment, best_dnn_run_id)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hideCode": false,
"hidePrompt": false
},
"outputs": [],
"source": [
"best_dnn_run.parent\n",
"RunDetails(best_dnn_run.parent).show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hideCode": false,
"hidePrompt": false
},
"outputs": [],
"source": [
"best_dnn_run\n",
"RunDetails(best_dnn_run).show()"
]
},
{
"cell_type": "markdown",
"metadata": {
"hideCode": false,
"hidePrompt": false
},
"source": [
"## Evaluate on Test Data"
]
},
{
"cell_type": "markdown",
"metadata": {
"hideCode": false,
"hidePrompt": false
},
"source": [
"We now use the best fitted model from the AutoML Run to make forecasts for the test set. \n",
"\n",
"We always score on the original dataset whose schema matches the training set schema."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hideCode": false,
"hidePrompt": false
},
"outputs": [],
"source": [
"# preview the first 3 rows of the dataset\n",
"test_dataset.take(5).to_pandas_dataframe()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"compute_target = ws.compute_targets[\"github-cluster\"]\n",
"test_experiment = Experiment(ws, experiment_name + \"_test\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hideCode": false,
"hidePrompt": false
},
"outputs": [],
"source": [
"import os\n",
"import shutil\n",
"\n",
"script_folder = os.path.join(os.getcwd(), \"inference\")\n",
"os.makedirs(script_folder, exist_ok=True)\n",
"shutil.copy(\"infer.py\", script_folder)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from helper import run_inference\n",
"\n",
"test_run = run_inference(\n",
" test_experiment,\n",
" compute_target,\n",
" script_folder,\n",
" best_dnn_run,\n",
" test_dataset,\n",
" valid_dataset,\n",
" forecast_horizon,\n",
" target_column_name,\n",
" time_column_name,\n",
" freq,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"RunDetails(test_run).show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from helper import run_multiple_inferences\n",
"\n",
"summary_df = run_multiple_inferences(\n",
" summary_df,\n",
" experiment,\n",
" test_experiment,\n",
" compute_target,\n",
" script_folder,\n",
" test_dataset,\n",
" valid_dataset,\n",
" forecast_horizon,\n",
" target_column_name,\n",
" time_column_name,\n",
" freq,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hideCode": false,
"hidePrompt": false
},
"outputs": [],
"source": [
"for run_name, run_summary in summary_df.iterrows():\n",
" print(run_name)\n",
" print(run_summary)\n",
" run_id = run_summary.run_id\n",
" test_run_id = run_summary.test_run_id\n",
" test_run = Run(test_experiment, test_run_id)\n",
" test_run.wait_for_completion()\n",
" test_score = test_run.get_metrics()[run_summary.primary_metric]\n",
" summary_df.loc[summary_df.run_id == run_id, \"Test Score\"] = test_score\n",
" print(\"Test Score: \", test_score)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hideCode": false,
"hidePrompt": false
},
"outputs": [],
"source": [
"summary_df"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"authors": [
{
"name": "jialiu"
}
],
"hide_code_all_hidden": false,
"kernelspec": {
"display_name": "Python 3.8 - AzureML",
"language": "python",
"name": "python38-azureml"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.10"
}
},
"nbformat": 4,
"nbformat_minor": 4
}