mirror of
https://github.com/Azure/MachineLearningNotebooks.git
synced 2025-12-20 09:37:04 -05:00
Compare commits
46 Commits
azureml-sd
...
jeffshep/s
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
d0961b98bf | ||
|
|
302589b7f9 | ||
|
|
cc85949d6d | ||
|
|
3a1824e3ad | ||
|
|
579643326d | ||
|
|
14f76f227e | ||
|
|
25baf5203a | ||
|
|
1178fcb0ba | ||
|
|
e4d84c8e45 | ||
|
|
7a3ab1e44c | ||
|
|
598a293dfa | ||
|
|
40b3068462 | ||
|
|
0ecbbbce75 | ||
|
|
9b1e130d18 | ||
|
|
0e17b33d2a | ||
|
|
34d80abd26 | ||
|
|
249278ab77 | ||
|
|
25fdb17f80 | ||
|
|
3a02a27f1e | ||
|
|
4eed9d529f | ||
|
|
f344d410a2 | ||
|
|
9dc1228063 | ||
|
|
4404e62f58 | ||
|
|
38d5743bbb | ||
|
|
0814eee151 | ||
|
|
f45b815221 | ||
|
|
bd629ae454 | ||
|
|
41de75a584 | ||
|
|
96a426dc36 | ||
|
|
824dd40f7e | ||
|
|
fa2e649fe8 | ||
|
|
e25e8e3a41 | ||
|
|
aa3670a902 | ||
|
|
ef1f9205ac | ||
|
|
3228bbfc63 | ||
|
|
f18a0dfc4d | ||
|
|
badb620261 | ||
|
|
acf46100ae | ||
|
|
cf2e3804d5 | ||
|
|
b7be42357f | ||
|
|
3ac82c07ae | ||
|
|
9743c0a1fa | ||
|
|
ba4dac530e | ||
|
|
7f7f0040fd | ||
|
|
9ca567cd9c | ||
|
|
e0c9376aab |
@@ -1,6 +1,8 @@
|
||||
# Azure Machine Learning Python SDK notebooks
|
||||
|
||||
> a community-driven repository of examples using mlflow for tracking can be found at https://github.com/Azure/azureml-examples
|
||||
|
||||
** **With the introduction of AzureML SDK v2, this samples repository for the v1 SDK is now deprecated and will not be monitored or updated. Users are encouraged to visit the [v2 SDK samples repository](https://github.com/Azure/azureml-examples) instead for up-to-date and enhanced examples of how to build, train, and deploy machine learning models with AzureML's newest features.** **
|
||||
|
||||
|
||||
Welcome to the Azure Machine Learning Python SDK notebooks repository!
|
||||
|
||||
|
||||
41
SECURITY.md
Normal file
41
SECURITY.md
Normal file
@@ -0,0 +1,41 @@
|
||||
<!-- BEGIN MICROSOFT SECURITY.MD V0.0.7 BLOCK -->
|
||||
|
||||
## Security
|
||||
|
||||
Microsoft takes the security of our software products and services seriously, which includes all source code repositories managed through our GitHub organizations, which include [Microsoft](https://github.com/Microsoft), [Azure](https://github.com/Azure), [DotNet](https://github.com/dotnet), [AspNet](https://github.com/aspnet), [Xamarin](https://github.com/xamarin), and [our GitHub organizations](https://opensource.microsoft.com/).
|
||||
|
||||
If you believe you have found a security vulnerability in any Microsoft-owned repository that meets [Microsoft's definition of a security vulnerability](https://aka.ms/opensource/security/definition), please report it to us as described below.
|
||||
|
||||
## Reporting Security Issues
|
||||
|
||||
**Please do not report security vulnerabilities through public GitHub issues.**
|
||||
|
||||
Instead, please report them to the Microsoft Security Response Center (MSRC) at [https://msrc.microsoft.com/create-report](https://aka.ms/opensource/security/create-report).
|
||||
|
||||
If you prefer to submit without logging in, send email to [secure@microsoft.com](mailto:secure@microsoft.com). If possible, encrypt your message with our PGP key; please download it from the [Microsoft Security Response Center PGP Key page](https://aka.ms/opensource/security/pgpkey).
|
||||
|
||||
You should receive a response within 24 hours. If for some reason you do not, please follow up via email to ensure we received your original message. Additional information can be found at [microsoft.com/msrc](https://aka.ms/opensource/security/msrc).
|
||||
|
||||
Please include the requested information listed below (as much as you can provide) to help us better understand the nature and scope of the possible issue:
|
||||
|
||||
* Type of issue (e.g. buffer overflow, SQL injection, cross-site scripting, etc.)
|
||||
* Full paths of source file(s) related to the manifestation of the issue
|
||||
* The location of the affected source code (tag/branch/commit or direct URL)
|
||||
* Any special configuration required to reproduce the issue
|
||||
* Step-by-step instructions to reproduce the issue
|
||||
* Proof-of-concept or exploit code (if possible)
|
||||
* Impact of the issue, including how an attacker might exploit the issue
|
||||
|
||||
This information will help us triage your report more quickly.
|
||||
|
||||
If you are reporting for a bug bounty, more complete reports can contribute to a higher bounty award. Please visit our [Microsoft Bug Bounty Program](https://aka.ms/opensource/security/bounty) page for more details about our active programs.
|
||||
|
||||
## Preferred Languages
|
||||
|
||||
We prefer all communications to be in English.
|
||||
|
||||
## Policy
|
||||
|
||||
Microsoft follows the principle of [Coordinated Vulnerability Disclosure](https://aka.ms/opensource/security/cvd).
|
||||
|
||||
<!-- END MICROSOFT SECURITY.MD BLOCK -->
|
||||
@@ -103,7 +103,7 @@
|
||||
"source": [
|
||||
"import azureml.core\n",
|
||||
"\n",
|
||||
"print(\"This notebook was created using version 1.44.0 of the Azure ML SDK\")\n",
|
||||
"print(\"This notebook was created using version 1.51.0 of the Azure ML SDK\")\n",
|
||||
"print(\"You are currently using version\", azureml.core.VERSION, \"of the Azure ML SDK\")"
|
||||
]
|
||||
},
|
||||
@@ -367,9 +367,9 @@
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
@@ -398,7 +398,7 @@
|
||||
"# run_config.target = gpu_cluster_name\n",
|
||||
"# run_config.environment.docker.enabled = True\n",
|
||||
"# run_config.environment.docker.gpu_support = True\n",
|
||||
"# run_config.environment.docker.base_image = \"rapidsai/rapidsai:cuda9.2-runtime-ubuntu18.04\"\n",
|
||||
"# run_config.environment.docker.base_image = \"rapidsai/rapidsai:cuda9.2-runtime-ubuntu20.04\"\n",
|
||||
"# # run_config.environment.docker.base_image_registry.address = '<registry_url>' # not required if the base_image is in Docker hub\n",
|
||||
"# # run_config.environment.docker.base_image_registry.username = '<user_name>' # needed only for private images\n",
|
||||
"# # run_config.environment.docker.base_image_registry.password = '<password>' # needed only for private images\n",
|
||||
@@ -525,9 +525,9 @@
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
@@ -599,9 +599,9 @@
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
@@ -6,7 +6,8 @@ dependencies:
|
||||
- fairlearn>=0.6.2
|
||||
- joblib
|
||||
- liac-arff
|
||||
- raiwidgets~=0.19.0
|
||||
- raiwidgets~=0.26.0
|
||||
- itsdangerous==2.0.1
|
||||
- markupsafe<2.1.0
|
||||
- protobuf==3.20.0
|
||||
- numpy<1.24.0
|
||||
|
||||
@@ -523,9 +523,9 @@
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
@@ -6,7 +6,8 @@ dependencies:
|
||||
- fairlearn>=0.6.2
|
||||
- joblib
|
||||
- liac-arff
|
||||
- raiwidgets~=0.19.0
|
||||
- raiwidgets~=0.26.0
|
||||
- itsdangerous==2.0.1
|
||||
- markupsafe<2.1.0
|
||||
- protobuf==3.20.0
|
||||
- numpy<1.24.0
|
||||
|
||||
@@ -5,29 +5,21 @@ channels:
|
||||
- main
|
||||
dependencies:
|
||||
# The python interpreter version.
|
||||
# Currently Azure ML only supports 3.6.0 and later.
|
||||
- pip==20.2.4
|
||||
- python>=3.6,<3.9
|
||||
- matplotlib==3.2.1
|
||||
- py-xgboost==1.3.3
|
||||
- pytorch::pytorch=1.4.0
|
||||
# Azure ML only supports 3.7.0 and later.
|
||||
- pip==22.3.1
|
||||
- python>=3.7,<3.9
|
||||
- conda-forge::fbprophet==0.7.1
|
||||
- cudatoolkit=10.1.243
|
||||
- pandas==1.1.5
|
||||
- scipy==1.5.3
|
||||
- notebook
|
||||
- pywin32==227
|
||||
- PySocks==1.7.1
|
||||
- conda-forge::pyqt==5.12.3
|
||||
- jsonschema==4.9.1
|
||||
- Pygments==2.12.0
|
||||
- Cython==0.29.14
|
||||
- tqdm==4.65.0
|
||||
|
||||
- pip:
|
||||
# Required packages for AzureML execution, history, and data preparation.
|
||||
- azureml-widgets~=1.44.0
|
||||
- pytorch-transformers==1.0.0
|
||||
- spacy==2.2.4
|
||||
- pystan==2.19.1.1
|
||||
- https://aka.ms/automl-resources/packages/en_core_web_sm-2.1.0.tar.gz
|
||||
- -r https://automlsdkdataresources.blob.core.windows.net/validated-requirements/1.44.0/validated_win32_requirements.txt [--no-deps]
|
||||
- arch==4.14
|
||||
- wasabi==0.9.1
|
||||
- azureml-widgets~=1.51.0
|
||||
- azureml-defaults~=1.51.0
|
||||
- -r https://automlsdkdataresources.blob.core.windows.net/validated-requirements/1.51.0/validated_win32_requirements.txt [--no-deps]
|
||||
- matplotlib==3.6.2
|
||||
- xgboost==1.3.3
|
||||
- cmdstanpy==0.9.5
|
||||
- setuptools-git==1.2
|
||||
|
||||
@@ -5,11 +5,9 @@ channels:
|
||||
- main
|
||||
dependencies:
|
||||
# The python interpreter version.
|
||||
# Currently Azure ML only supports 3.6.0 and later.
|
||||
- pip==20.2.4
|
||||
- python>=3.6,<3.9
|
||||
- boto3==1.20.19
|
||||
- botocore<=1.23.19
|
||||
# Azure ML only supports 3.7 and later.
|
||||
- pip==22.3.1
|
||||
- python>=3.7,<3.9
|
||||
- matplotlib==3.2.1
|
||||
- numpy>=1.21.6,<=1.22.3
|
||||
- cython==0.29.14
|
||||
@@ -19,15 +17,16 @@ dependencies:
|
||||
- py-xgboost<=1.3.3
|
||||
- holidays==0.10.3
|
||||
- conda-forge::fbprophet==0.7.1
|
||||
- pytorch::pytorch=1.4.0
|
||||
- pytorch::pytorch=1.11.0
|
||||
- cudatoolkit=10.1.243
|
||||
- notebook
|
||||
|
||||
- pip:
|
||||
# Required packages for AzureML execution, history, and data preparation.
|
||||
- azureml-widgets~=1.44.0
|
||||
- azureml-widgets~=1.51.0
|
||||
- azureml-defaults~=1.51.0
|
||||
- pytorch-transformers==1.0.0
|
||||
- spacy==2.2.4
|
||||
- pystan==2.19.1.1
|
||||
- https://aka.ms/automl-resources/packages/en_core_web_sm-2.1.0.tar.gz
|
||||
- -r https://automlsdkdataresources.blob.core.windows.net/validated-requirements/1.44.0/validated_linux_requirements.txt [--no-deps]
|
||||
- arch==4.14
|
||||
- -r https://automlsdkdataresources.blob.core.windows.net/validated-requirements/1.51.0/validated_linux_requirements.txt [--no-deps]
|
||||
|
||||
@@ -5,12 +5,9 @@ channels:
|
||||
- main
|
||||
dependencies:
|
||||
# The python interpreter version.
|
||||
# Currently Azure ML only supports 3.6.0 and later.
|
||||
- pip==20.2.4
|
||||
- nomkl
|
||||
- python>=3.6,<3.9
|
||||
- boto3==1.20.19
|
||||
- botocore<=1.23.19
|
||||
# Currently Azure ML only supports 3.7 and later.
|
||||
- pip==22.3.1
|
||||
- python>=3.7,<3.9
|
||||
- matplotlib==3.2.1
|
||||
- numpy>=1.21.6,<=1.22.3
|
||||
- cython==0.29.14
|
||||
@@ -19,16 +16,17 @@ dependencies:
|
||||
- scikit-learn==0.22.1
|
||||
- py-xgboost<=1.3.3
|
||||
- holidays==0.10.3
|
||||
- conda-forge::fbprophet==0.7.1
|
||||
- pytorch::pytorch=1.4.0
|
||||
- pytorch::pytorch=1.11.0
|
||||
- cudatoolkit=9.0
|
||||
- notebook
|
||||
|
||||
- pip:
|
||||
# Required packages for AzureML execution, history, and data preparation.
|
||||
- azureml-widgets~=1.44.0
|
||||
- azureml-widgets~=1.51.0
|
||||
- azureml-defaults~=1.51.0
|
||||
- pytorch-transformers==1.0.0
|
||||
- spacy==2.2.4
|
||||
- pystan==2.19.1.1
|
||||
- fbprophet==0.7.1
|
||||
- https://aka.ms/automl-resources/packages/en_core_web_sm-2.1.0.tar.gz
|
||||
- -r https://automlsdkdataresources.blob.core.windows.net/validated-requirements/1.44.0/validated_darwin_requirements.txt [--no-deps]
|
||||
- arch==4.14
|
||||
- -r https://automlsdkdataresources.blob.core.windows.net/validated-requirements/1.51.0/validated_darwin_requirements.txt [--no-deps]
|
||||
|
||||
@@ -33,6 +33,8 @@ if not errorlevel 1 (
|
||||
call conda env create -f %automl_env_file% -n %conda_env_name%
|
||||
)
|
||||
|
||||
python "%conda_prefix%\scripts\pywin32_postinstall.py" -install
|
||||
|
||||
call conda activate %conda_env_name% 2>nul:
|
||||
if errorlevel 1 goto ErrorExit
|
||||
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
from distutils.version import LooseVersion
|
||||
from setuptools._vendor.packaging import version
|
||||
import platform
|
||||
|
||||
try:
|
||||
@@ -17,7 +17,7 @@ if architecture != "64bit":
|
||||
|
||||
minimumVersion = "4.7.8"
|
||||
|
||||
versionInvalid = (LooseVersion(conda.__version__) < LooseVersion(minimumVersion))
|
||||
versionInvalid = (version.parse(conda.__version__) < version.parse(minimumVersion))
|
||||
|
||||
if versionInvalid:
|
||||
print('Setup requires conda version ' + minimumVersion + ' or higher.')
|
||||
|
||||
@@ -712,7 +712,9 @@
|
||||
"from azureml.core.model import Model\n",
|
||||
"from azureml.core.environment import Environment\n",
|
||||
"\n",
|
||||
"inference_config = InferenceConfig(entry_script=script_file_name)\n",
|
||||
"inference_config = InferenceConfig(\n",
|
||||
" environment=best_run.get_environment(), entry_script=script_file_name\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"aciconfig = AciWebservice.deploy_configuration(\n",
|
||||
" cpu_cores=2,\n",
|
||||
@@ -1060,9 +1062,9 @@
|
||||
"name": "python3-azureml"
|
||||
},
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
@@ -456,9 +456,9 @@
|
||||
"friendly_name": "Classification of credit card fraudulent transactions using Automated ML",
|
||||
"index_order": 5,
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
@@ -567,9 +567,9 @@
|
||||
"friendly_name": "DNN Text Featurization",
|
||||
"index_order": 2,
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
@@ -564,9 +564,9 @@
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
@@ -97,7 +97,7 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"print(\"This notebook was created using version 1.44.0 of the Azure ML SDK\")\n",
|
||||
"print(\"This notebook was created using version 1.51.0 of the Azure ML SDK\")\n",
|
||||
"print(\"You are currently using version\", azureml.core.VERSION, \"of the Azure ML SDK\")"
|
||||
]
|
||||
},
|
||||
@@ -324,9 +324,9 @@
|
||||
"hash": "adb464b67752e4577e3dc163235ced27038d19b7d88def00d75d1975bde5d9ab"
|
||||
},
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
@@ -97,7 +97,7 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"print(\"This notebook was created using version 1.44.0 of the Azure ML SDK\")\n",
|
||||
"print(\"This notebook was created using version 1.51.0 of the Azure ML SDK\")\n",
|
||||
"print(\"You are currently using version\", azureml.core.VERSION, \"of the Azure ML SDK\")"
|
||||
]
|
||||
},
|
||||
@@ -454,10 +454,13 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"**Note:** Not all datasets produce a y_transformer. The dataset used in the current notebook requires a transformer as the y column data is categorical."
|
||||
"**Note:** Not all datasets produce a y_transformer. The dataset used in the current notebook requires a transformer as the y column data is categorical. \n",
|
||||
"\n",
|
||||
"We will go ahead and download the mlflow transformer model and use it to transform test data that can be used for further experimentation below. To run the commented code, make sure the environment requirement is satisfied. You can go ahead and create the environment from the `conda.yaml` file under `/outputs/featurization/pipeline/` and run the given code in it."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -466,7 +469,7 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.automl.core.shared.constants import Transformers\n",
|
||||
"''' from azureml.automl.core.shared.constants import Transformers\n",
|
||||
"\n",
|
||||
"transformers = mlflow.sklearn.load_model(uri) # Using method 1\n",
|
||||
"data_transformers = transformers.get_transformers()\n",
|
||||
@@ -474,14 +477,15 @@
|
||||
"y_transformer = data_transformers[Transformers.Y_TRANSFORMER]\n",
|
||||
"\n",
|
||||
"X_test = x_transformer.transform(X_test_data)\n",
|
||||
"y_test = y_transformer.transform(y_test_data)"
|
||||
"y_test = y_transformer.transform(y_test_data) '''"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Run the following cell to see the featurization summary of X and y transformers. "
|
||||
"Run the following cell to see the featurization summary of X and y transformers. Uncomment to use. "
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -490,10 +494,10 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"X_data_summary = x_transformer.get_featurization_summary(is_user_friendly=False)\n",
|
||||
"''' X_data_summary = x_transformer.get_featurization_summary(is_user_friendly=False)\n",
|
||||
"\n",
|
||||
"summary_df = pd.DataFrame.from_records(X_data_summary)\n",
|
||||
"summary_df"
|
||||
"summary_df '''"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -544,10 +548,11 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Another way to load the data is to go to the above autofeaturization experiment and check for the featurized dataset ids under `Output datasets`. Uncomment and replace them accordingly below to use."
|
||||
"Another way to load the data is to go to the above autofeaturization experiment and check for the featurized dataset ids under `Output datasets`. Uncomment and replace them accordingly below, to use."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -597,10 +602,20 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Here we are passing our training data to the lightgbm classifier, any custom model can be used with your data."
|
||||
"Here we are passing our training data to the lightgbm classifier, any custom model can be used with your data. Let us first install lightgbm."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"! pip install lightgbm"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -612,11 +627,27 @@
|
||||
"import lightgbm as lgb\n",
|
||||
"\n",
|
||||
"model = lgb.LGBMClassifier(learning_rate=0.08,max_depth=-5,random_state=42)\n",
|
||||
"model.fit(X_train, y_train, sample_weight=sample_weight, eval_set=[(X_test, y_test),(X_train, y_train)],\n",
|
||||
" verbose=20,eval_metric='logloss')\n",
|
||||
"\n",
|
||||
"model.fit(X_train, y_train, sample_weight=sample_weight)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Once training is done, the test data obtained after transforming from the above downloaded transformer can be used to calculate the accuracy "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"print('Training accuracy {:.4f}'.format(model.score(X_train, y_train)))\n",
|
||||
"print('Testing accuracy {:.4f}'.format(model.score(X_test, y_test)))"
|
||||
"\n",
|
||||
"# Uncomment below to test the model on test data \n",
|
||||
"# print('Testing accuracy {:.4f}'.format(model.score(X_test, y_test)))"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -654,45 +685,8 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"y_pred = model.predict(X_test)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Calculate metrics for the prediction\n",
|
||||
"\n",
|
||||
"Now visualize the data on a scatter plot to show what our truth (actual) values are compared to the predicted values \n",
|
||||
"from the trained model that was returned."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from sklearn.metrics import confusion_matrix\n",
|
||||
"from matplotlib import pyplot as plt\n",
|
||||
"import numpy as np\n",
|
||||
"import itertools\n",
|
||||
"\n",
|
||||
"cf =confusion_matrix(y_test,y_pred)\n",
|
||||
"plt.imshow(cf,cmap=plt.cm.Blues,interpolation='nearest')\n",
|
||||
"plt.colorbar()\n",
|
||||
"plt.title('Confusion Matrix')\n",
|
||||
"plt.xlabel('Predicted')\n",
|
||||
"plt.ylabel('Actual')\n",
|
||||
"class_labels = ['False','True']\n",
|
||||
"tick_marks = np.arange(len(class_labels))\n",
|
||||
"plt.xticks(tick_marks,class_labels)\n",
|
||||
"plt.yticks([-0.5,0,1,1.5],['','False','True',''])\n",
|
||||
"# plotting text value inside cells\n",
|
||||
"thresh = cf.max() / 2.\n",
|
||||
"for i,j in itertools.product(range(cf.shape[0]),range(cf.shape[1])):\n",
|
||||
" plt.text(j,i,format(cf[i,j],'d'),horizontalalignment='center',color='white' if cf[i,j] >thresh else 'black')\n",
|
||||
"plt.show()"
|
||||
"# Uncomment below to test the model on test data\n",
|
||||
"# y_pred = model.predict(X_test)"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -713,9 +707,9 @@
|
||||
"hash": "adb464b67752e4577e3dc163235ced27038d19b7d88def00d75d1975bde5d9ab"
|
||||
},
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
@@ -1,21 +1,28 @@
|
||||
name: azure_automl_experimental
|
||||
dependencies:
|
||||
# The python interpreter version.
|
||||
<<<<<<< HEAD
|
||||
# Currently Azure ML only supports 3.6.0 and later.
|
||||
- pip<=20.2.4
|
||||
- python>=3.6.0,<3.9
|
||||
- python>=3.6.0,<3.10
|
||||
- cython==0.29.14
|
||||
- urllib3==1.26.7
|
||||
- PyJWT < 2.0.0
|
||||
- numpy==1.21.6
|
||||
- numpy==1.22.3
|
||||
- pywin32==227
|
||||
- cryptography<37.0.0
|
||||
=======
|
||||
# Currently Azure ML only supports 3.7.0 and later.
|
||||
- pip<=22.3.1
|
||||
- python>=3.7.0,<3.11
|
||||
>>>>>>> 4671acd451ce979c3cebcd3917804861a333b710
|
||||
|
||||
- pip:
|
||||
# Required packages for AzureML execution, history, and data preparation.
|
||||
- azure-core==1.24.1
|
||||
- azure-identity==1.7.0
|
||||
- azureml-defaults
|
||||
- azureml-sdk
|
||||
- azureml-widgets
|
||||
- azureml-mlflow
|
||||
- pandas
|
||||
- mlflow
|
||||
- docker<6.0.0
|
||||
|
||||
@@ -4,14 +4,13 @@ channels:
|
||||
- main
|
||||
dependencies:
|
||||
# The python interpreter version.
|
||||
# Currently Azure ML only supports 3.6.0 and later.
|
||||
# Currently Azure ML only supports 3.7.0 and later.
|
||||
- pip<=20.2.4
|
||||
- nomkl
|
||||
- python>=3.6.0,<3.9
|
||||
- python>=3.7.0,<3.11
|
||||
- urllib3==1.26.7
|
||||
- PyJWT < 2.0.0
|
||||
- numpy>=1.21.6,<=1.22.3
|
||||
- cryptography<37.0.0
|
||||
|
||||
- pip:
|
||||
# Required packages for AzureML execution, history, and data preparation.
|
||||
@@ -20,4 +19,6 @@ dependencies:
|
||||
- azureml-defaults
|
||||
- azureml-sdk
|
||||
- azureml-widgets
|
||||
- azureml-mlflow
|
||||
- pandas
|
||||
- mlflow
|
||||
|
||||
@@ -92,7 +92,7 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"print(\"This notebook was created using version 1.44.0 of the Azure ML SDK\")\n",
|
||||
"print(\"This notebook was created using version 1.51.0 of the Azure ML SDK\")\n",
|
||||
"print(\"You are currently using version\", azureml.core.VERSION, \"of the Azure ML SDK\")"
|
||||
]
|
||||
},
|
||||
@@ -354,7 +354,7 @@
|
||||
"This Credit Card fraud Detection dataset is made available under the Open Database License: http://opendatacommons.org/licenses/odbl/1.0/. Any rights in individual contents of the database are licensed under the Database Contents License: http://opendatacommons.org/licenses/dbcl/1.0/ and is available at: https://www.kaggle.com/mlg-ulb/creditcardfraud\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"The dataset has been collected and analysed during a research collaboration of Worldline and the Machine Learning Group (http://mlg.ulb.ac.be) of ULB (Universit\u00c3\u0192\u00c2\u00a9 Libre de Bruxelles) on big data mining and fraud detection. More details on current and past projects on related topics are available on https://www.researchgate.net/project/Fraud-detection-5 and the page of the DefeatFraud project\n",
|
||||
"The dataset has been collected and analysed during a research collaboration of Worldline and the Machine Learning Group (http://mlg.ulb.ac.be) of ULB (Universit\u00c3\u0192\u00c2\u00a9 Libre de Bruxelles) on big data mining and fraud detection. More details on current and past projects on related topics are available on https://www.researchgate.net and the page of the DefeatFraud project\n",
|
||||
"Please cite the following works: \n",
|
||||
"\u00c3\u00a2\u00e2\u201a\u00ac\u00c2\u00a2\tAndrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015\n",
|
||||
"\u00c3\u00a2\u00e2\u201a\u00ac\u00c2\u00a2\tDal Pozzolo, Andrea; Caelen, Olivier; Le Borgne, Yann-Ael; Waterschoot, Serge; Bontempi, Gianluca. Learned lessons in credit card fraud detection from a practitioner perspective, Expert systems with applications,41,10,4915-4928,2014, Pergamon\n",
|
||||
@@ -389,9 +389,9 @@
|
||||
"friendly_name": "Classification of credit card fraudulent transactions using Automated ML",
|
||||
"index_order": 5,
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
@@ -91,7 +91,7 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"print(\"This notebook was created using version 1.44.0 of the Azure ML SDK\")\n",
|
||||
"print(\"This notebook was created using version 1.51.0 of the Azure ML SDK\")\n",
|
||||
"print(\"You are currently using version\", azureml.core.VERSION, \"of the Azure ML SDK\")"
|
||||
]
|
||||
},
|
||||
@@ -448,9 +448,9 @@
|
||||
"automated-machine-learning"
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
@@ -122,7 +122,10 @@ def calculate_scores_and_build_plots(
|
||||
input_dir: str, output_dir: str, automl_settings: Dict[str, Any]
|
||||
):
|
||||
os.makedirs(output_dir, exist_ok=True)
|
||||
grains = automl_settings.get(constants.TimeSeries.TIME_SERIES_ID_COLUMN_NAMES)
|
||||
grains = automl_settings.get(
|
||||
constants.TimeSeries.TIME_SERIES_ID_COLUMN_NAMES,
|
||||
automl_settings.get(constants.TimeSeries.GRAIN_COLUMN_NAMES, None),
|
||||
)
|
||||
time_column_name = automl_settings.get(constants.TimeSeries.TIME_COLUMN_NAME)
|
||||
if grains is None:
|
||||
grains = []
|
||||
|
||||
@@ -33,6 +33,7 @@
|
||||
"For this notebook we are using a synthetic dataset to demonstrate the back testing in many model scenario. This allows us to check historical performance of AutoML on a historical data. To do that we step back on the backtesting period by the data set several times and split the data to train and test sets. Then these data sets are used for training and evaluation of model.<br>\n",
|
||||
"\n",
|
||||
"Thus, it is a quick way of evaluating AutoML as if it was in production. Here, we do not test historical performance of a particular model, for this see the [notebook](../forecasting-backtest-single-model/auto-ml-forecasting-backtest-single-model.ipynb). Instead, the best model for every backtest iteration can be different since AutoML chooses the best model for a given training set.\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"**NOTE: There are limits on how many runs we can do in parallel per workspace, and we currently recommend to set the parallelism to maximum of 320 runs per experiment per workspace. If users want to have more parallelism and increase this limit they might encounter Too Many Requests errors (HTTP 429).**"
|
||||
@@ -43,7 +44,7 @@
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Prerequisites\n",
|
||||
"You'll need to create a compute Instance by following the instructions in the [EnvironmentSetup.md](../Setup_Resources/EnvironmentSetup.md)."
|
||||
"You'll need to create a compute Instance by following [these](https://learn.microsoft.com/en-us/azure/machine-learning/v1/how-to-create-manage-compute-instance?tabs=python) instructions."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -313,21 +314,37 @@
|
||||
"source": [
|
||||
"### Set up training parameters\n",
|
||||
"\n",
|
||||
"This dictionary defines the AutoML and many models settings. For this forecasting task we need to define several settings including the name of the time column, the maximum forecast horizon, and the partition column name definition. Please note, that in this case we are setting grain_column_names to be the time series ID column plus iteration, because we want to train a separate model for each time series and iteration.\n",
|
||||
"We need to provide ``ForecastingParameters``, ``AutoMLConfig`` and ``ManyModelsTrainParameters`` objects. For the forecasting task we also need to define several settings including the name of the time column, the maximum forecast horizon, and the partition column name(s) definition.\n",
|
||||
"\n",
|
||||
"#### ``ForecastingParameters`` arguments\n",
|
||||
"| Property | Description|\n",
|
||||
"| :--------------- | :------------------- |\n",
|
||||
"| **forecast_horizon** | The forecast horizon is how many periods forward you would like to forecast. This integer horizon is in units of the timeseries frequency (e.g. daily, weekly). Periods are inferred from your data. |\n",
|
||||
"| **time_column_name** | The name of your time column. |\n",
|
||||
"| **time_series_id_column_names** | The column names used to uniquely identify timeseries in data that has multiple rows with the same timestamp. |\n",
|
||||
"| **cv_step_size** | Number of periods between two consecutive cross-validation folds. The default value is \\\"auto\\\", in which case AutoMl determines the cross-validation step size automatically, if a validation set is not provided. Or users could specify an integer value. |\n",
|
||||
"\n",
|
||||
"#### ``AutoMLConfig`` arguments\n",
|
||||
"| Property | Description|\n",
|
||||
"| :--------------- | :------------------- |\n",
|
||||
"| **task** | forecasting |\n",
|
||||
"| **primary_metric** | This is the metric that you want to optimize.<br> Forecasting supports the following primary metrics <br><i>normalized_root_mean_squared_error</i><br><i>normalized_mean_absolute_error</i> |\n",
|
||||
"| **primary_metric** | This is the metric that you want to optimize.<br> Forecasting supports the following primary metrics <br><i>spearman_correlation</i><br><i>normalized_root_mean_squared_error</i><br><i>r2_score</i><br><i>normalized_mean_absolute_error</i> |\n",
|
||||
"| **blocked_models** | Blocked models won't be used by AutoML. |\n",
|
||||
"| **iteration_timeout_minutes** | Maximum amount of time in minutes that the model can train. This is optional but provides customers with greater control on exit criteria. |\n",
|
||||
"| **iterations** | Number of models to train. This is optional but provides customers with greater control on exit criteria. |\n",
|
||||
"| **experiment_timeout_hours** | Maximum amount of time in hours that the experiment can take before it terminates. This is optional but provides customers with greater control on exit criteria. |\n",
|
||||
"| **experiment_timeout_hours** | Maximum amount of time in hours that each experiment can take before it terminates. This is optional but provides customers with greater control on exit criteria. **It does not control the overall timeout for the pipeline run, instead controls the timeout for each training run per partitioned time series.** |\n",
|
||||
"| **label_column_name** | The name of the label column. |\n",
|
||||
"| **forecast_horizon** | The forecast horizon is how many periods forward you would like to forecast. This integer horizon is in units of the timeseries frequency (e.g. daily, weekly). Periods are inferred from your data. |\n",
|
||||
"| **n_cross_validations** | Number of cross validation splits. Rolling Origin Validation is used to split time-series in a temporally consistent way. |\n",
|
||||
"| **time_column_name** | The name of your time column. |\n",
|
||||
"| **time_series_id_column_names** | The column names used to uniquely identify timeseries in data that has multiple rows with the same timestamp. |\n",
|
||||
"| **n_cross_validations** | Number of cross validation splits. The default value is \\\"auto\\\", in which case AutoMl determines the number of cross-validations automatically, if a validation set is not provided. Or users could specify an integer value. Rolling Origin Validation is used to split time-series in a temporally consistent way. |\n",
|
||||
"| **enable_early_stopping** | Flag to enable early termination if the primary metric is no longer improving. |\n",
|
||||
"| **enable_engineered_explanations** | Engineered feature explanations will be downloaded if enable_engineered_explanations flag is set to True. By default it is set to False to save storage space. |\n",
|
||||
"| **track_child_runs** | Flag to disable tracking of child runs. Only best run is tracked if the flag is set to False (this includes the model and metrics of the run). |\n",
|
||||
"| **pipeline_fetch_max_batch_size** | Determines how many pipelines (training algorithms) to fetch at a time for training, this helps reduce throttling when training at large scale. |\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"#### ``ManyModelsTrainParameters`` arguments\n",
|
||||
"| Property | Description|\n",
|
||||
"| :--------------- | :------------------- |\n",
|
||||
"| **automl_settings** | The ``AutoMLConfig`` object defined above. |\n",
|
||||
"| **partition_column_names** | The names of columns used to group your models. For timeseries, the groups must not split up individual time-series. That is, each group must contain one or more whole time-series. |"
|
||||
]
|
||||
},
|
||||
@@ -344,21 +361,30 @@
|
||||
"from azureml.train.automl.runtime._many_models.many_models_parameters import (\n",
|
||||
" ManyModelsTrainParameters,\n",
|
||||
")\n",
|
||||
"from azureml.automl.core.forecasting_parameters import ForecastingParameters\n",
|
||||
"from azureml.train.automl.automlconfig import AutoMLConfig\n",
|
||||
"\n",
|
||||
"partition_column_names = [TIME_SERIES_ID_COLNAME, \"backtest_iteration\"]\n",
|
||||
"automl_settings = {\n",
|
||||
" \"task\": \"forecasting\",\n",
|
||||
" \"primary_metric\": \"normalized_root_mean_squared_error\",\n",
|
||||
" \"iteration_timeout_minutes\": 10, # This needs to be changed based on the dataset. We ask customer to explore how long training is taking before settings this value\n",
|
||||
" \"iterations\": 15,\n",
|
||||
" \"experiment_timeout_hours\": 0.25, # This also needs to be changed based on the dataset. For larger data set this number needs to be bigger.\n",
|
||||
" \"label_column_name\": TARGET_COLNAME,\n",
|
||||
" \"n_cross_validations\": 3,\n",
|
||||
" \"time_column_name\": TIME_COLNAME,\n",
|
||||
" \"forecast_horizon\": 6,\n",
|
||||
" \"time_series_id_column_names\": partition_column_names,\n",
|
||||
" \"track_child_runs\": False,\n",
|
||||
"}\n",
|
||||
"\n",
|
||||
"forecasting_parameters = ForecastingParameters(\n",
|
||||
" time_column_name=TIME_COLNAME,\n",
|
||||
" forecast_horizon=6,\n",
|
||||
" time_series_id_column_names=partition_column_names,\n",
|
||||
" cv_step_size=\"auto\",\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"automl_settings = AutoMLConfig(\n",
|
||||
" task=\"forecasting\",\n",
|
||||
" primary_metric=\"normalized_root_mean_squared_error\",\n",
|
||||
" iteration_timeout_minutes=10,\n",
|
||||
" iterations=15,\n",
|
||||
" experiment_timeout_hours=0.25,\n",
|
||||
" label_column_name=TARGET_COLNAME,\n",
|
||||
" n_cross_validations=\"auto\", # Feel free to set to a small integer (>=2) if runtime is an issue.\n",
|
||||
" track_child_runs=False,\n",
|
||||
" forecasting_parameters=forecasting_parameters,\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"mm_paramters = ManyModelsTrainParameters(\n",
|
||||
" automl_settings=automl_settings, partition_column_names=partition_column_names\n",
|
||||
@@ -385,8 +411,16 @@
|
||||
"| **node_count** | The number of compute nodes to be used for running the user script. We recommend to start with 3 and increase the node_count if the training time is taking too long. |\n",
|
||||
"| **process_count_per_node** | Process count per node, we recommend 2:1 ratio for number of cores: number of processes per node. eg. If node has 16 cores then configure 8 or less process count per node or optimal performance. |\n",
|
||||
"| **train_pipeline_parameters** | The set of configuration parameters defined in the previous section. |\n",
|
||||
"| **run_invocation_timeout** | Maximum amount of time in seconds that the ``ParallelRunStep`` class is allowed. This is optional but provides customers with greater control on exit criteria. This must be greater than ``experiment_timeout_hours`` by at least 300 seconds. |\n",
|
||||
"\n",
|
||||
"Calling this method will create a new aggregated dataset which is generated dynamically on pipeline execution."
|
||||
"Calling this method will create a new aggregated dataset which is generated dynamically on pipeline execution.\n",
|
||||
"\n",
|
||||
"**Note**: Total time taken for the **training step** in the pipeline to complete = $ \\frac{t}{ p \\times n } \\times ts $\n",
|
||||
"where,\n",
|
||||
"- $ t $ is time taken for training one partition (can be viewed in the training logs)\n",
|
||||
"- $ p $ is ``process_count_per_node``\n",
|
||||
"- $ n $ is ``node_count``\n",
|
||||
"- $ ts $ is total number of partitions in time series based on ``partition_column_names``"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -404,7 +438,7 @@
|
||||
" compute_target=compute_target,\n",
|
||||
" node_count=2,\n",
|
||||
" process_count_per_node=2,\n",
|
||||
" run_invocation_timeout=920,\n",
|
||||
" run_invocation_timeout=1200,\n",
|
||||
" train_pipeline_parameters=mm_paramters,\n",
|
||||
")"
|
||||
]
|
||||
@@ -489,25 +523,31 @@
|
||||
"source": [
|
||||
"For many models we need to provide the ManyModelsInferenceParameters object.\n",
|
||||
"\n",
|
||||
"#### ManyModelsInferenceParameters arguments\n",
|
||||
"#### ``ManyModelsInferenceParameters`` arguments\n",
|
||||
"| Property | Description|\n",
|
||||
"| :--------------- | :------------------- |\n",
|
||||
"| **partition_column_names** | List of column names that identifies groups. |\n",
|
||||
"| **target_column_name** | \\[Optional\\] Column name only if the inference dataset has the target. |\n",
|
||||
"| **time_column_name** | Column name only if it is timeseries. |\n",
|
||||
"| **many_models_run_id** | \\[Optional\\] Many models pipeline run id where models were trained. |\n",
|
||||
"| **target_column_name** | \\[Optional] Column name only if the inference dataset has the target. |\n",
|
||||
"| **time_column_name** | \\[Optional] Time column name only if it is timeseries. |\n",
|
||||
"| **inference_type** | \\[Optional] Which inference method to use on the model. Possible values are 'forecast', 'predict_proba', and 'predict'. |\n",
|
||||
"| **forecast_mode** | \\[Optional] The type of forecast to be used, either 'rolling' or 'recursive'; defaults to 'recursive'. |\n",
|
||||
"| **step** | \\[Optional] Number of periods to advance the forecasting window in each iteration **(for rolling forecast only)**; defaults to 1. |\n",
|
||||
"\n",
|
||||
"#### get_many_models_batch_inference_steps arguments\n",
|
||||
"#### ``get_many_models_batch_inference_steps`` arguments\n",
|
||||
"| Property | Description|\n",
|
||||
"| :--------------- | :------------------- |\n",
|
||||
"| **experiment** | The experiment used for inference run. |\n",
|
||||
"| **inference_data** | The data to use for inferencing. It should be the same schema as used for training.\n",
|
||||
"| **compute_target** | The compute target that runs the inference pipeline. |\n",
|
||||
"| **node_count** | The number of compute nodes to be used for running the user script. We recommend to start with the number of cores per node (varies by compute sku). |\n",
|
||||
"| **process_count_per_node** | The number of processes per node.\n",
|
||||
"| **train_run_id** | \\[Optional\\] The run id of the hierarchy training, by default it is the latest successful training many model run in the experiment. |\n",
|
||||
"| **train_experiment_name** | \\[Optional\\] The train experiment that contains the train pipeline. This one is only needed when the train pipeline is not in the same experiement as the inference pipeline. |\n",
|
||||
"| **process_count_per_node** | \\[Optional\\] The number of processes per node, by default it's 4. |"
|
||||
"| **process_count_per_node** | \\[Optional] The number of processes per node. By default it's 2 (should be at most half of the number of cores in a single node of the compute cluster that will be used for the experiment).\n",
|
||||
"| **inference_pipeline_parameters** | \\[Optional] The ``ManyModelsInferenceParameters`` object defined above. |\n",
|
||||
"| **append_row_file_name** | \\[Optional] The name of the output file (optional, default value is 'parallel_run_step.txt'). Supports 'txt' and 'csv' file extension. A 'txt' file extension generates the output in 'txt' format with space as separator without column names. A 'csv' file extension generates the output in 'csv' format with comma as separator and with column names. |\n",
|
||||
"| **train_run_id** | \\[Optional] The run id of the **training pipeline**. By default it is the latest successful training pipeline run in the experiment. |\n",
|
||||
"| **train_experiment_name** | \\[Optional] The train experiment that contains the train pipeline. This one is only needed when the train pipeline is not in the same experiement as the inference pipeline. |\n",
|
||||
"| **run_invocation_timeout** | \\[Optional] Maximum amount of time in seconds that the ``ParallelRunStep`` class is allowed. This is optional but provides customers with greater control on exit criteria. |\n",
|
||||
"| **output_datastore** | \\[Optional] The ``Datastore`` or ``OutputDatasetConfig`` to be used for output. If specified any pipeline output will be written to that location. If unspecified the default datastore will be used. |\n",
|
||||
"| **arguments** | \\[Optional] Arguments to be passed to inference script. Possible argument is '--forecast_quantiles' followed by quantile values. |"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -527,6 +567,8 @@
|
||||
" target_column_name=TARGET_COLNAME,\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"output_file_name = \"parallel_run_step.csv\"\n",
|
||||
"\n",
|
||||
"inference_steps = AutoMLPipelineBuilder.get_many_models_batch_inference_steps(\n",
|
||||
" experiment=experiment,\n",
|
||||
" inference_data=test_data,\n",
|
||||
@@ -538,6 +580,7 @@
|
||||
" train_run_id=training_run.id,\n",
|
||||
" train_experiment_name=training_run.experiment.name,\n",
|
||||
" inference_pipeline_parameters=mm_parameters,\n",
|
||||
" append_row_file_name=output_file_name,\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
@@ -585,18 +628,21 @@
|
||||
"source": [
|
||||
"from azureml.contrib.automl.pipeline.steps.utilities import get_output_from_mm_pipeline\n",
|
||||
"\n",
|
||||
"PREDICTION_COLNAME = \"Predictions\"\n",
|
||||
"forecasting_results_name = \"forecasting_results\"\n",
|
||||
"forecasting_output_name = \"many_models_inference_output\"\n",
|
||||
"forecast_file = get_output_from_mm_pipeline(\n",
|
||||
" inference_run, forecasting_results_name, forecasting_output_name\n",
|
||||
" inference_run, forecasting_results_name, forecasting_output_name, output_file_name\n",
|
||||
")\n",
|
||||
"df = pd.read_csv(forecast_file, delimiter=\" \", header=None, parse_dates=[0])\n",
|
||||
"df.columns = list(X_train.columns) + [\"predicted_level\"]\n",
|
||||
"df = pd.read_csv(forecast_file, parse_dates=[0])\n",
|
||||
"print(\n",
|
||||
" \"Prediction has \", df.shape[0], \" rows. Here the first 10 rows are being displayed.\"\n",
|
||||
")\n",
|
||||
"# Save the scv file with header to read it in the next step.\n",
|
||||
"df.rename(columns={TARGET_COLNAME: \"actual_level\"}, inplace=True)\n",
|
||||
"# Save the csv file to read it in the next step.\n",
|
||||
"df.rename(\n",
|
||||
" columns={TARGET_COLNAME: \"actual_level\", PREDICTION_COLNAME: \"predicted_level\"},\n",
|
||||
" inplace=True,\n",
|
||||
")\n",
|
||||
"df.to_csv(os.path.join(forecasting_results_name, \"forecast.csv\"), index=False)\n",
|
||||
"df.head(10)"
|
||||
]
|
||||
@@ -620,7 +666,9 @@
|
||||
"backtesting_results = \"backtesting_mm_results\"\n",
|
||||
"os.makedirs(backtesting_results, exist_ok=True)\n",
|
||||
"calculate_scores_and_build_plots(\n",
|
||||
" forecasting_results_name, backtesting_results, automl_settings\n",
|
||||
" forecasting_results_name,\n",
|
||||
" backtesting_results,\n",
|
||||
" automl_settings.as_serializable_dict(),\n",
|
||||
")\n",
|
||||
"pd.DataFrame({\"File\": os.listdir(backtesting_results)})"
|
||||
]
|
||||
@@ -704,9 +752,9 @@
|
||||
"automated-machine-learning"
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
@@ -718,7 +766,12 @@
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.9"
|
||||
"version": "3.8.5"
|
||||
},
|
||||
"vscode": {
|
||||
"interpreter": {
|
||||
"hash": "6bd77c88278e012ef31757c15997a7bea8c943977c43d6909403c00ae11d43ca"
|
||||
}
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
|
||||
@@ -43,11 +43,20 @@ def init():
|
||||
global output_dir
|
||||
global automl_settings
|
||||
global model_uid
|
||||
global forecast_quantiles
|
||||
|
||||
logger.info("Initialization of the run.")
|
||||
parser = argparse.ArgumentParser("Parsing input arguments.")
|
||||
parser.add_argument("--output-dir", dest="out", required=True)
|
||||
parser.add_argument("--model-name", dest="model", default=None)
|
||||
parser.add_argument("--model-uid", dest="model_uid", default=None)
|
||||
parser.add_argument(
|
||||
"--forecast_quantiles",
|
||||
nargs="*",
|
||||
type=float,
|
||||
help="forecast quantiles list",
|
||||
default=None,
|
||||
)
|
||||
|
||||
parsed_args, _ = parser.parse_known_args()
|
||||
model_name = parsed_args.model
|
||||
@@ -55,6 +64,7 @@ def init():
|
||||
target_column_name = automl_settings.get("label_column_name")
|
||||
output_dir = parsed_args.out
|
||||
model_uid = parsed_args.model_uid
|
||||
forecast_quantiles = parsed_args.forecast_quantiles
|
||||
os.makedirs(output_dir, exist_ok=True)
|
||||
os.environ["AUTOML_IGNORE_PACKAGE_VERSION_INCOMPATIBILITIES".lower()] = "True"
|
||||
|
||||
@@ -126,23 +136,18 @@ def run_backtest(data_input_name: str, file_name: str, experiment: Experiment):
|
||||
)
|
||||
print(f"The model {best_run.properties['model_name']} was registered.")
|
||||
|
||||
_, x_pred = fitted_model.forecast(X_test)
|
||||
x_pred.reset_index(inplace=True, drop=False)
|
||||
columns = [automl_settings[constants.TimeSeries.TIME_COLUMN_NAME]]
|
||||
if automl_settings.get(constants.TimeSeries.GRAIN_COLUMN_NAMES):
|
||||
# We know that fitted_model.grain_column_names is a list.
|
||||
columns.extend(fitted_model.grain_column_names)
|
||||
columns.append(constants.TimeSeriesInternal.DUMMY_TARGET_COLUMN)
|
||||
# Remove featurized columns.
|
||||
x_pred = x_pred[columns]
|
||||
x_pred.rename(
|
||||
{constants.TimeSeriesInternal.DUMMY_TARGET_COLUMN: "predicted_level"},
|
||||
axis=1,
|
||||
inplace=True,
|
||||
)
|
||||
# By default we will have forecast quantiles of 0.5, which is our target
|
||||
if forecast_quantiles:
|
||||
if 0.5 not in forecast_quantiles:
|
||||
forecast_quantiles.append(0.5)
|
||||
fitted_model.quantiles = forecast_quantiles
|
||||
|
||||
x_pred = fitted_model.forecast_quantiles(X_test)
|
||||
x_pred["actual_level"] = y_test
|
||||
x_pred["backtest_iteration"] = f"iteration_{last_training_date}"
|
||||
x_pred.rename({0.5: "predicted_level"}, axis=1, inplace=True)
|
||||
date_safe = RE_INVALID_SYMBOLS.sub("_", last_training_date)
|
||||
|
||||
x_pred.to_csv(os.path.join(output_dir, f"iteration_{date_safe}.csv"), index=False)
|
||||
return x_pred
|
||||
|
||||
|
||||
@@ -283,7 +283,8 @@
|
||||
"| **experiment_timeout_hours** | Maximum amount of time in hours that the experiment can take before it terminates. This is optional but provides customers with greater control on exit criteria. |\n",
|
||||
"| **label_column_name** | The name of the label column. |\n",
|
||||
"| **max_horizon** | The forecast horizon is how many periods forward you would like to forecast. This integer horizon is in units of the timeseries frequency (e.g. daily, weekly). Periods are inferred from your data. |\n",
|
||||
"| **n_cross_validations** | Number of cross validation splits. Rolling Origin Validation is used to split time-series in a temporally consistent way. |\n",
|
||||
"| **n_cross_validations** | Number of cross validation splits. The default value is \"auto\", in which case AutoMl determines the number of cross-validations automatically, if a validation set is not provided. Or users could specify an integer value. Rolling Origin Validation is used to split time-series in a temporally consistent way. |\n",
|
||||
"|**cv_step_size**|Number of periods between two consecutive cross-validation folds. The default value is \"auto\", in which case AutoMl determines the cross-validation step size automatically, if a validation set is not provided. Or users could specify an integer value.\n",
|
||||
"| **time_column_name** | The name of your time column. |\n",
|
||||
"| **grain_column_names** | The column names used to uniquely identify timeseries in data that has multiple rows with the same timestamp. |"
|
||||
]
|
||||
@@ -301,7 +302,8 @@
|
||||
" \"iterations\": 15,\n",
|
||||
" \"experiment_timeout_hours\": 1, # This also needs to be changed based on the dataset. For larger data set this number needs to be bigger.\n",
|
||||
" \"label_column_name\": LABEL_COLUMN_NAME,\n",
|
||||
" \"n_cross_validations\": 3,\n",
|
||||
" \"n_cross_validations\": \"auto\", # Feel free to set to a small integer (>=2) if runtime is an issue.\n",
|
||||
" \"cv_step_size\": \"auto\",\n",
|
||||
" \"time_column_name\": TIME_COLUMN_NAME,\n",
|
||||
" \"max_horizon\": FORECAST_HORIZON,\n",
|
||||
" \"track_child_runs\": False,\n",
|
||||
@@ -363,6 +365,7 @@
|
||||
" step_size=BACKTESTING_PERIOD,\n",
|
||||
" step_number=NUMBER_OF_BACKTESTS,\n",
|
||||
" model_uid=model_uid,\n",
|
||||
" forecast_quantiles=[0.025, 0.975], # Optional\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
@@ -588,6 +591,7 @@
|
||||
" step_size=BACKTESTING_PERIOD,\n",
|
||||
" step_number=NUMBER_OF_BACKTESTS,\n",
|
||||
" model_name=model_name,\n",
|
||||
" forecast_quantiles=[0.025, 0.975],\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
@@ -698,9 +702,9 @@
|
||||
"Azure ML AutoML"
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
@@ -712,7 +716,12 @@
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.9"
|
||||
"version": "3.8.5"
|
||||
},
|
||||
"vscode": {
|
||||
"interpreter": {
|
||||
"hash": "6bd77c88278e012ef31757c15997a7bea8c943977c43d6909403c00ae11d43ca"
|
||||
}
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
|
||||
@@ -31,6 +31,7 @@ def get_backtest_pipeline(
|
||||
step_number: int,
|
||||
model_name: Optional[str] = None,
|
||||
model_uid: Optional[str] = None,
|
||||
forecast_quantiles: Optional[list] = None,
|
||||
) -> Pipeline:
|
||||
"""
|
||||
:param experiment: The experiment used to run the pipeline.
|
||||
@@ -44,6 +45,7 @@ def get_backtest_pipeline(
|
||||
:param step_size: The number of periods to step back in backtesting.
|
||||
:param step_number: The number of backtesting iterations.
|
||||
:param model_uid: The uid to mark models from this run of the experiment.
|
||||
:param forecast_quantiles: The forecast quantiles that are required in the inference.
|
||||
:return: The pipeline to be used for model retraining.
|
||||
**Note:** The output will be uploaded in the pipeline output
|
||||
called 'score'.
|
||||
@@ -135,6 +137,9 @@ def get_backtest_pipeline(
|
||||
if model_uid is not None:
|
||||
prs_args.append("--model-uid")
|
||||
prs_args.append(model_uid)
|
||||
if forecast_quantiles:
|
||||
prs_args.append("--forecast_quantiles")
|
||||
prs_args.extend(forecast_quantiles)
|
||||
backtest_prs = ParallelRunStep(
|
||||
name=parallel_step_name,
|
||||
parallel_run_config=back_test_config,
|
||||
|
||||
@@ -1,6 +1,7 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
@@ -10,6 +11,7 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
@@ -17,6 +19,7 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
@@ -34,6 +37,7 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
@@ -42,7 +46,7 @@
|
||||
"\n",
|
||||
"AutoML highlights here include built-in holiday featurization, accessing engineered feature names, and working with the `forecast` function. Please also look at the additional forecasting notebooks, which document lagging, rolling windows, forecast quantiles, other ways to use the forecast function, and forecaster deployment.\n",
|
||||
"\n",
|
||||
"Make sure you have executed the [configuration notebook](../../../configuration.ipynb) before running this notebook.\n",
|
||||
"Make sure you have executed the [configuration notebook](https://github.com/Azure/MachineLearningNotebooks/blob/master/configuration.ipynb) before running this notebook.\n",
|
||||
"\n",
|
||||
"Notebook synopsis:\n",
|
||||
"1. Creating an Experiment in an existing Workspace\n",
|
||||
@@ -52,6 +56,7 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
@@ -61,7 +66,11 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"metadata": {
|
||||
"gather": {
|
||||
"logged": 1680248038565
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import json\n",
|
||||
@@ -77,6 +86,7 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
@@ -93,6 +103,7 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
@@ -126,6 +137,7 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
@@ -165,30 +177,12 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Data\n",
|
||||
"\n",
|
||||
"The [Machine Learning service workspace](https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-workspace) is paired with the storage account, which contains the default data store. We will use it to upload the bike share data and create [tabular dataset](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.tabulardataset?view=azure-ml-py) for training. A tabular dataset defines a series of lazily-evaluated, immutable operations to load data from the data source into tabular representation."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"datastore = ws.get_default_datastore()\n",
|
||||
"datastore.upload_files(\n",
|
||||
" files=[\"./bike-no.csv\"], target_path=\"dataset/\", overwrite=True, show_progress=True\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Let's set up what we know about the dataset. \n",
|
||||
"\n",
|
||||
"**Target column** is what we want to forecast.\n",
|
||||
@@ -207,24 +201,51 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"nteract": {
|
||||
"transient": {
|
||||
"deleting": false
|
||||
}
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"dataset = Dataset.Tabular.from_delimited_files(\n",
|
||||
" path=[(datastore, \"dataset/bike-no.csv\")]\n",
|
||||
").with_timestamp_columns(fine_grain_timestamp=time_column_name)\n",
|
||||
"\n",
|
||||
"# Drop the columns 'casual' and 'registered' as these columns are a breakdown of the total and therefore a leak.\n",
|
||||
"dataset = dataset.drop_columns(columns=[\"casual\", \"registered\"])\n",
|
||||
"\n",
|
||||
"dataset.take(5).to_pandas_dataframe().reset_index(drop=True)"
|
||||
"You are now ready to load the historical bike share data. We will load the CSV file into a plain pandas DataFrame."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"jupyter": {
|
||||
"outputs_hidden": false,
|
||||
"source_hidden": false
|
||||
},
|
||||
"nteract": {
|
||||
"transient": {
|
||||
"deleting": false
|
||||
}
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"all_data = pd.read_csv(\"bike-no.csv\", parse_dates=[time_column_name])\n",
|
||||
"\n",
|
||||
"# Drop the columns 'casual' and 'registered' as these columns are a breakdown of the total and therefore a leak.\n",
|
||||
"all_data.drop([\"casual\", \"registered\"], axis=1, inplace=True)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"metadata": {
|
||||
"nteract": {
|
||||
"transient": {
|
||||
"deleting": false
|
||||
}
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"### Split the data\n",
|
||||
"\n",
|
||||
@@ -234,25 +255,68 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"metadata": {
|
||||
"gather": {
|
||||
"logged": 1680247376789
|
||||
},
|
||||
"jupyter": {
|
||||
"outputs_hidden": false,
|
||||
"source_hidden": false
|
||||
},
|
||||
"nteract": {
|
||||
"transient": {
|
||||
"deleting": false
|
||||
}
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# select data that occurs before a specified date\n",
|
||||
"train = dataset.time_before(datetime(2012, 8, 31), include_boundary=True)\n",
|
||||
"train.to_pandas_dataframe().tail(5).reset_index(drop=True)"
|
||||
"train = all_data[all_data[time_column_name] <= pd.Timestamp(\"2012-08-31\")].copy()\n",
|
||||
"test = all_data[all_data[time_column_name] >= pd.Timestamp(\"2012-09-01\")].copy()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Upload data to datastore\n",
|
||||
"\n",
|
||||
"The [Machine Learning service workspace](https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-workspace) is paired with the storage account, which contains the default data store. We will use it to upload the bike share data and create [tabular dataset](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.tabulardataset?view=azure-ml-py) for training. A tabular dataset defines a series of lazily-evaluated, immutable operations to load data from the data source into tabular representation."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"metadata": {
|
||||
"jupyter": {
|
||||
"outputs_hidden": false,
|
||||
"source_hidden": false
|
||||
},
|
||||
"nteract": {
|
||||
"transient": {
|
||||
"deleting": false
|
||||
}
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"test = dataset.time_after(datetime(2012, 9, 1), include_boundary=True)\n",
|
||||
"test.to_pandas_dataframe().head(5).reset_index(drop=True)"
|
||||
"from azureml.data.dataset_factory import TabularDatasetFactory\n",
|
||||
"\n",
|
||||
"datastore = ws.get_default_datastore()\n",
|
||||
"\n",
|
||||
"train_dataset = TabularDatasetFactory.register_pandas_dataframe(\n",
|
||||
" train, target=(datastore, \"dataset/\"), name=\"bike_no_train\"\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"test_dataset = TabularDatasetFactory.register_pandas_dataframe(\n",
|
||||
" test, target=(datastore, \"dataset/\"), name=\"bike_no_test\"\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
@@ -265,10 +329,12 @@
|
||||
"|**forecast_horizon**|The forecast horizon is how many periods forward you would like to forecast. This integer horizon is in units of the timeseries frequency (e.g. daily, weekly).|\n",
|
||||
"|**country_or_region_for_holidays**|The country/region used to generate holiday features. These should be ISO 3166 two-letter country/region codes (i.e. 'US', 'GB').|\n",
|
||||
"|**target_lags**|The target_lags specifies how far back we will construct the lags of the target variable.|\n",
|
||||
"|**freq**|Forecast frequency. This optional parameter represents the period with which the forecast is desired, for example, daily, weekly, yearly, etc. Use this parameter for the correction of time series containing irregular data points or for padding of short time series. The frequency needs to be a pandas offset alias. Please refer to [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects) for more information."
|
||||
"|**freq**|Forecast frequency. This optional parameter represents the period with which the forecast is desired, for example, daily, weekly, yearly, etc. Use this parameter for the correction of time series containing irregular data points or for padding of short time series. The frequency needs to be a pandas offset alias. Please refer to [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects) for more information.\n",
|
||||
"|**cv_step_size**|Number of periods between two consecutive cross-validation folds. The default value is \"auto\", in which case AutoMl determines the cross-validation step size automatically, if a validation set is not provided. Or users could specify an integer value."
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
@@ -285,7 +351,7 @@
|
||||
"|**training_data**|Input dataset, containing both features and label column.|\n",
|
||||
"|**label_column_name**|The name of the label column.|\n",
|
||||
"|**compute_target**|The remote compute for training.|\n",
|
||||
"|**n_cross_validations**|Number of cross validation splits.|\n",
|
||||
"|**n_cross_validations**|Number of cross-validation folds to use for model/pipeline selection. The default value is \"auto\", in which case AutoMl determines the number of cross-validations automatically, if a validation set is not provided. Or users could specify an integer value.\n",
|
||||
"|**enable_early_stopping**|If early stopping is on, training will stop when the primary metric is no longer improving.|\n",
|
||||
"|**forecasting_parameters**|A class that holds all the forecasting related parameters.|\n",
|
||||
"\n",
|
||||
@@ -293,6 +359,7 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
@@ -311,6 +378,7 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
@@ -330,6 +398,7 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
@@ -350,6 +419,7 @@
|
||||
" country_or_region_for_holidays=\"US\", # set country_or_region will trigger holiday featurizer\n",
|
||||
" target_lags=\"auto\", # use heuristic based lag setting\n",
|
||||
" freq=\"D\", # Set the forecast frequency to be daily\n",
|
||||
" cv_step_size=\"auto\",\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"automl_config = AutoMLConfig(\n",
|
||||
@@ -358,11 +428,11 @@
|
||||
" featurization=featurization_config,\n",
|
||||
" blocked_models=[\"ExtremeRandomTrees\"],\n",
|
||||
" experiment_timeout_hours=0.3,\n",
|
||||
" training_data=train,\n",
|
||||
" training_data=train_dataset,\n",
|
||||
" label_column_name=target_column_name,\n",
|
||||
" compute_target=compute_target,\n",
|
||||
" enable_early_stopping=True,\n",
|
||||
" n_cross_validations=3,\n",
|
||||
" n_cross_validations=\"auto\", # Feel free to set to a small integer (>=2) if runtime is an issue.\n",
|
||||
" max_concurrent_iterations=4,\n",
|
||||
" max_cores_per_iteration=-1,\n",
|
||||
" verbosity=logging.INFO,\n",
|
||||
@@ -371,6 +441,7 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
@@ -396,6 +467,7 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
@@ -414,6 +486,7 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
@@ -439,6 +512,7 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
@@ -482,6 +556,7 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
@@ -489,6 +564,7 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
@@ -507,6 +583,7 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
@@ -529,6 +606,7 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
@@ -544,7 +622,7 @@
|
||||
"from run_forecast import run_rolling_forecast\n",
|
||||
"\n",
|
||||
"remote_run = run_rolling_forecast(\n",
|
||||
" test_experiment, compute_target, best_run, test, target_column_name\n",
|
||||
" test_experiment, compute_target, best_run, test_dataset, target_column_name\n",
|
||||
")\n",
|
||||
"remote_run"
|
||||
]
|
||||
@@ -559,6 +637,7 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
@@ -573,7 +652,34 @@
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"remote_run.download_file(\"outputs/predictions.csv\", \"predictions.csv\")\n",
|
||||
"df_all = pd.read_csv(\"predictions.csv\")"
|
||||
"fcst_df = pd.read_csv(\"predictions.csv\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Note that the rolling forecast can contain multiple predictions for each date, each from a different forecast origin. For example, consider 2012-09-05:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"fcst_df[fcst_df.date == \"2012-09-05\"]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Here, the forecast origin refers to the latest date of actuals available for a given forecast. The earliest origin in the rolling forecast, 2012-08-31, is the last day in the training data. For origin date 2012-09-01, the forecasts use actual recorded counts from the training data *and* the actual count recorded on 2012-09-01. Note that the model is not retrained for origin dates later than 2012-08-31, but the values for model features, such as lagged values of daily count, are updated.\n",
|
||||
"\n",
|
||||
"Let's calculate the metrics over all rolling forecasts:"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -585,67 +691,36 @@
|
||||
"from azureml.automl.core.shared import constants\n",
|
||||
"from azureml.automl.runtime.shared.score import scoring\n",
|
||||
"from sklearn.metrics import mean_absolute_error, mean_squared_error\n",
|
||||
"from matplotlib import pyplot as plt\n",
|
||||
"\n",
|
||||
"# use automl metrics module\n",
|
||||
"scores = scoring.score_regression(\n",
|
||||
" y_test=df_all[target_column_name],\n",
|
||||
" y_pred=df_all[\"predicted\"],\n",
|
||||
" y_test=fcst_df[target_column_name],\n",
|
||||
" y_pred=fcst_df[\"predicted\"],\n",
|
||||
" metrics=list(constants.Metric.SCALAR_REGRESSION_SET),\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"print(\"[Test data scores]\\n\")\n",
|
||||
"for key, value in scores.items():\n",
|
||||
" print(\"{}: {:.3f}\".format(key, value))\n",
|
||||
"\n",
|
||||
"# Plot outputs\n",
|
||||
"%matplotlib inline\n",
|
||||
"test_pred = plt.scatter(df_all[target_column_name], df_all[\"predicted\"], color=\"b\")\n",
|
||||
"test_test = plt.scatter(\n",
|
||||
" df_all[target_column_name], df_all[target_column_name], color=\"g\"\n",
|
||||
")\n",
|
||||
"plt.legend(\n",
|
||||
" (test_pred, test_test), (\"prediction\", \"truth\"), loc=\"upper left\", fontsize=8\n",
|
||||
")\n",
|
||||
"plt.show()"
|
||||
" print(\"{}: {:.3f}\".format(key, value))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"For more details on what metrics are included and how they are calculated, please refer to [supported metrics](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-understand-automated-ml#regressionforecasting-metrics). You could also calculate residuals, like described [here](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-understand-automated-ml#residuals).\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"Since we did a rolling evaluation on the test set, we can analyze the predictions by their forecast horizon relative to the rolling origin. The model was initially trained at a forecast horizon of 14, so each prediction from the model is associated with a horizon value from 1 to 14. The horizon values are in a column named, \"horizon_origin,\" in the prediction set. For example, we can calculate some of the error metrics grouped by the horizon:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from metrics_helper import MAPE, APE\n",
|
||||
"\n",
|
||||
"df_all.groupby(\"horizon_origin\").apply(\n",
|
||||
" lambda df: pd.Series(\n",
|
||||
" {\n",
|
||||
" \"MAPE\": MAPE(df[target_column_name], df[\"predicted\"]),\n",
|
||||
" \"RMSE\": np.sqrt(\n",
|
||||
" mean_squared_error(df[target_column_name], df[\"predicted\"])\n",
|
||||
" ),\n",
|
||||
" \"MAE\": mean_absolute_error(df[target_column_name], df[\"predicted\"]),\n",
|
||||
" }\n",
|
||||
" )\n",
|
||||
")"
|
||||
"The rolling forecast metric values are very high in comparison to the validation metrics reported by the AutoML job. What's going on here? We will investigate in the following cells!"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To drill down more, we can look at the distributions of APE (absolute percentage error) by horizon. From the chart, it is clear that the overall MAPE is being skewed by one particular point where the actual value is of small absolute value."
|
||||
"### Forecast versus actuals plot\n",
|
||||
"We will plot predictions and actuals on a time series plot. Since there are many forecasts for each date, we select the 14-day-ahead forecast from each forecast origin for our comparison."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -654,21 +729,56 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"df_all_APE = df_all.assign(APE=APE(df_all[target_column_name], df_all[\"predicted\"]))\n",
|
||||
"APEs = [\n",
|
||||
" df_all_APE[df_all[\"horizon_origin\"] == h].APE.values\n",
|
||||
" for h in range(1, forecast_horizon + 1)\n",
|
||||
"]\n",
|
||||
"from matplotlib import pyplot as plt\n",
|
||||
"\n",
|
||||
"%matplotlib inline\n",
|
||||
"plt.boxplot(APEs)\n",
|
||||
"plt.yscale(\"log\")\n",
|
||||
"plt.xlabel(\"horizon\")\n",
|
||||
"plt.ylabel(\"APE (%)\")\n",
|
||||
"plt.title(\"Absolute Percentage Errors by Forecast Horizon\")\n",
|
||||
"\n",
|
||||
"fcst_df_h14 = (\n",
|
||||
" fcst_df.groupby(\"forecast_origin\", as_index=False)\n",
|
||||
" .last()\n",
|
||||
" .drop(columns=[\"forecast_origin\"])\n",
|
||||
")\n",
|
||||
"fcst_df_h14.set_index(time_column_name, inplace=True)\n",
|
||||
"plt.plot(fcst_df_h14[[target_column_name, \"predicted\"]])\n",
|
||||
"plt.xticks(rotation=45)\n",
|
||||
"plt.title(f\"Predicted vs. Actuals\")\n",
|
||||
"plt.legend([\"actual\", \"14-day-ahead forecast\"])\n",
|
||||
"plt.show()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Looking at the plot, there are two clear issues:\n",
|
||||
"1. An anomalously low count value on October 29th, 2012.\n",
|
||||
"2. End-of-year holidays (Thanksgiving and Christmas) in late November and late December.\n",
|
||||
"\n",
|
||||
"What happened on Oct. 29th, 2012? That day, Hurricane Sandy brought severe storm surge flooding to the east coast of the United States, particularly around New York City. This is certainly an anomalous event that the model did not account for!\n",
|
||||
"\n",
|
||||
"As for the late year holidays, the model apparently did not learn to account for the full reduction of bike share rentals on these major holidays. The training data covers 2011 and early 2012, so the model fit only had access to a single occurrence of these holidays. This makes it challenging to resolve holiday effects; however, a larger AutoML model search may result in a better model that is more holiday-aware.\n",
|
||||
"\n",
|
||||
"If we filter the predictions prior to the Thanksgiving holiday and remove the anomalous day of 2012-10-29, the metrics are closer to validation levels:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"date_filter = (fcst_df.date != \"2012-10-29\") & (fcst_df.date < \"2012-11-22\")\n",
|
||||
"scores = scoring.score_regression(\n",
|
||||
" y_test=fcst_df[date_filter][target_column_name],\n",
|
||||
" y_pred=fcst_df[date_filter][\"predicted\"],\n",
|
||||
" metrics=list(constants.Metric.SCALAR_REGRESSION_SET),\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"print(\"[Test data scores (filtered)]\\n\")\n",
|
||||
"for key, value in scores.items():\n",
|
||||
" print(\"{}: {:.3f}\".format(key, value))"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
@@ -694,10 +804,13 @@
|
||||
],
|
||||
"friendly_name": "Forecasting BikeShare Demand",
|
||||
"index_order": 1,
|
||||
"kernel_info": {
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
@@ -709,17 +822,30 @@
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.7"
|
||||
"version": "3.10.9"
|
||||
},
|
||||
"microsoft": {
|
||||
"ms_spell_check": {
|
||||
"ms_spell_check_language": "en"
|
||||
}
|
||||
},
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"npconvert_exporter": "python",
|
||||
"nteract": {
|
||||
"version": "nteract-front-end@1.0.0"
|
||||
},
|
||||
"pygments_lexer": "ipython3",
|
||||
"tags": [
|
||||
"Forecasting"
|
||||
],
|
||||
"task": "Forecasting",
|
||||
"version": 3
|
||||
"version": 3,
|
||||
"vscode": {
|
||||
"interpreter": {
|
||||
"hash": "6bd77c88278e012ef31757c15997a7bea8c943977c43d6909403c00ae11d43ca"
|
||||
}
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 4
|
||||
|
||||
@@ -36,18 +36,18 @@ y_test_df = (
|
||||
|
||||
fitted_model = joblib.load("model.pkl")
|
||||
|
||||
y_pred, X_trans = fitted_model.rolling_evaluation(X_test_df, y_test_df.values)
|
||||
X_rf = fitted_model.rolling_forecast(X_test_df, y_test_df.values, step=1)
|
||||
|
||||
# Add predictions, actuals, and horizon relative to rolling origin to the test feature data
|
||||
assign_dict = {
|
||||
"horizon_origin": X_trans["horizon_origin"].values,
|
||||
"predicted": y_pred,
|
||||
target_column_name: y_test_df[target_column_name].values,
|
||||
fitted_model.forecast_origin_column_name: "forecast_origin",
|
||||
fitted_model.forecast_column_name: "predicted",
|
||||
fitted_model.actual_column_name: target_column_name,
|
||||
}
|
||||
df_all = X_test_df.assign(**assign_dict)
|
||||
X_rf.rename(columns=assign_dict, inplace=True)
|
||||
|
||||
file_name = "outputs/predictions.csv"
|
||||
export_csv = df_all.to_csv(file_name, header=True)
|
||||
export_csv = X_rf.to_csv(file_name, header=True)
|
||||
|
||||
# Upload the predictions into artifacts
|
||||
run.upload_file(name=file_name, path_or_stream=file_name)
|
||||
|
||||
@@ -2,23 +2,22 @@
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Copyright (c) Microsoft Corporation. All rights reserved.\n",
|
||||
"\n",
|
||||
"Licensed under the MIT License."
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Automated Machine Learning\n",
|
||||
"_**Forecasting using the Energy Demand Dataset**_\n",
|
||||
@@ -33,17 +32,17 @@
|
||||
"Advanced Forecasting\n",
|
||||
"1. [Advanced Training](#advanced_training)\n",
|
||||
"1. [Advanced Results](#advanced_results)"
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Introduction<a id=\"introduction\"></a>\n",
|
||||
"\n",
|
||||
"In this example we use the associated New York City energy demand dataset to showcase how you can use AutoML for a simple forecasting problem and explore the results. The goal is predict the energy demand for the next 48 hours based on historic time-series data.\n",
|
||||
"\n",
|
||||
"If you are using an Azure Machine Learning Compute Instance, you are all set. Otherwise, go through the [configuration notebook](../../../configuration.ipynb) first, if you haven't already, to establish your connection to the AzureML Workspace.\n",
|
||||
"If you are using an Azure Machine Learning Compute Instance, you are all set. Otherwise, go through the [configuration notebook](https://github.com/Azure/MachineLearningNotebooks/blob/master/configuration.ipynb) first, if you haven't already, to establish your connection to the AzureML Workspace.\n",
|
||||
"\n",
|
||||
"In this notebook you will learn how to:\n",
|
||||
"1. Creating an Experiment using an existing Workspace\n",
|
||||
@@ -53,20 +52,18 @@
|
||||
"1. Generate the forecast and compute the out-of-sample accuracy metrics\n",
|
||||
"1. Configuration and remote run of AutoML for a time-series model with lag and rolling window features\n",
|
||||
"1. Run and explore the forecast with lagging features"
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Setup<a id=\"setup\"></a>"
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import json\n",
|
||||
"import logging\n",
|
||||
@@ -85,36 +82,36 @@
|
||||
"from azureml.core import Experiment, Workspace, Dataset\n",
|
||||
"from azureml.train.automl import AutoMLConfig\n",
|
||||
"from datetime import datetime"
|
||||
]
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"This notebook is compatible with Azure ML SDK version 1.35.0 or later."
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"print(\"You are currently using version\", azureml.core.VERSION, \"of the Azure ML SDK\")"
|
||||
]
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"As part of the setup you have already created an Azure ML `Workspace` object. For Automated ML you will need to create an `Experiment` object, which is a named object in a `Workspace` used to run experiments."
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"ws = Workspace.from_config()\n",
|
||||
"\n",
|
||||
@@ -136,11 +133,13 @@
|
||||
"pd.set_option(\"display.max_colwidth\", None)\n",
|
||||
"outputDf = pd.DataFrame(data=output, index=[\"\"])\n",
|
||||
"outputDf.T"
|
||||
]
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Create or Attach existing AmlCompute\n",
|
||||
"A compute target is required to execute a remote Automated ML run. \n",
|
||||
@@ -150,13 +149,11 @@
|
||||
"#### Creation of AmlCompute takes approximately 5 minutes. \n",
|
||||
"If the AmlCompute with that name is already in your workspace this code will skip the creation process.\n",
|
||||
"As with other Azure services, there are limits on certain resources (e.g. AmlCompute) associated with the Azure Machine Learning service. Please read [this article](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-quotas) on the default limits and how to request more quota."
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.core.compute import ComputeTarget, AmlCompute\n",
|
||||
"from azureml.core.compute_target import ComputeTargetException\n",
|
||||
@@ -175,22 +172,24 @@
|
||||
" compute_target = ComputeTarget.create(ws, amlcompute_cluster_name, compute_config)\n",
|
||||
"\n",
|
||||
"compute_target.wait_for_completion(show_output=True)"
|
||||
]
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Data<a id=\"data\"></a>\n",
|
||||
"\n",
|
||||
"We will use energy consumption [data from New York City](http://mis.nyiso.com/public/P-58Blist.htm) for model training. The data is stored in a tabular format and includes energy demand and basic weather data at an hourly frequency. \n",
|
||||
"\n",
|
||||
"With Azure Machine Learning datasets you can keep a single copy of data in your storage, easily access data during model training, share data and collaborate with other users. Below, we will upload the datatset and create a [tabular dataset](https://docs.microsoft.com/bs-latn-ba/azure/machine-learning/service/how-to-create-register-datasets#dataset-types) to be used training and prediction."
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Let's set up what we know about the dataset.\n",
|
||||
"\n",
|
||||
@@ -198,86 +197,122 @@
|
||||
"<b>Time column</b> is the time axis along which to predict.\n",
|
||||
"\n",
|
||||
"The other columns, \"temp\" and \"precip\", are implicitly designated as features."
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"target_column_name = \"demand\"\n",
|
||||
"time_column_name = \"timeStamp\""
|
||||
]
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dataset = Dataset.Tabular.from_delimited_files(\n",
|
||||
" path=\"https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/nyc_energy.csv\"\n",
|
||||
").with_timestamp_columns(fine_grain_timestamp=time_column_name)\n",
|
||||
"dataset.take(5).to_pandas_dataframe().reset_index(drop=True)"
|
||||
]
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The NYC Energy dataset is missing energy demand values for all datetimes later than August 10th, 2017 5AM. Below, we trim the rows containing these missing values from the end of the dataset."
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Cut off the end of the dataset due to large number of nan values\n",
|
||||
"dataset = dataset.time_before(datetime(2017, 10, 10, 5))"
|
||||
]
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Split the data into train and test sets"
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The first split we make is into train and test sets. Note that we are splitting on time. Data before and including August 8th, 2017 5AM will be used for training, and data after will be used for testing."
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# split into train based on time\n",
|
||||
"train = dataset.time_before(datetime(2017, 8, 8, 5), include_boundary=True)\n",
|
||||
"train.to_pandas_dataframe().reset_index(drop=True).sort_values(time_column_name).tail(5)"
|
||||
]
|
||||
"train = (\n",
|
||||
" dataset.time_before(datetime(2017, 8, 8, 5), include_boundary=True)\n",
|
||||
" .to_pandas_dataframe()\n",
|
||||
" .reset_index(drop=True)\n",
|
||||
")\n",
|
||||
"train.sort_values(time_column_name).tail(5)"
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# split into test based on time\n",
|
||||
"test = dataset.time_between(datetime(2017, 8, 8, 6), datetime(2017, 8, 10, 5))\n",
|
||||
"test.to_pandas_dataframe().reset_index(drop=True).head(5)"
|
||||
]
|
||||
"test = (\n",
|
||||
" dataset.time_between(datetime(2017, 8, 8, 6), datetime(2017, 8, 10, 5))\n",
|
||||
" .to_pandas_dataframe()\n",
|
||||
" .reset_index(drop=True)\n",
|
||||
")\n",
|
||||
"test.head(5)"
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"source": [
|
||||
"# register the splitted train and test data in workspace storage\n",
|
||||
"from azureml.data.dataset_factory import TabularDatasetFactory\n",
|
||||
"\n",
|
||||
"datastore = ws.get_default_datastore()\n",
|
||||
"train_dataset = TabularDatasetFactory.register_pandas_dataframe(\n",
|
||||
" train, target=(datastore, \"dataset/\"), name=\"nyc_energy_train\"\n",
|
||||
")\n",
|
||||
"test_dataset = TabularDatasetFactory.register_pandas_dataframe(\n",
|
||||
" test, target=(datastore, \"dataset/\"), name=\"nyc_energy_test\"\n",
|
||||
")"
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"jupyter": {
|
||||
"source_hidden": false,
|
||||
"outputs_hidden": false
|
||||
},
|
||||
"nteract": {
|
||||
"transient": {
|
||||
"deleting": false
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Setting the maximum forecast horizon\n",
|
||||
"\n",
|
||||
@@ -286,20 +321,20 @@
|
||||
"Learn more about forecast horizons in our [Auto-train a time-series forecast model](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-auto-train-forecast#configure-and-run-experiment) guide.\n",
|
||||
"\n",
|
||||
"In this example, we set the horizon to 48 hours."
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"forecast_horizon = 48"
|
||||
]
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Forecasting Parameters\n",
|
||||
"To define forecasting parameters for your experiment training, you can leverage the ForecastingParameters class. The table below details the forecasting parameter we will be passing into our experiment.\n",
|
||||
@@ -308,12 +343,13 @@
|
||||
"|-|-|\n",
|
||||
"|**time_column_name**|The name of your time column.|\n",
|
||||
"|**forecast_horizon**|The forecast horizon is how many periods forward you would like to forecast. This integer horizon is in units of the timeseries frequency (e.g. daily, weekly).|\n",
|
||||
"|**freq**|Forecast frequency. This optional parameter represents the period with which the forecast is desired, for example, daily, weekly, yearly, etc. Use this parameter for the correction of time series containing irregular data points or for padding of short time series. The frequency needs to be a pandas offset alias. Please refer to [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects) for more information."
|
||||
]
|
||||
"|**freq**|Forecast frequency. This optional parameter represents the period with which the forecast is desired, for example, daily, weekly, yearly, etc. Use this parameter for the correction of time series containing irregular data points or for padding of short time series. The frequency needs to be a pandas offset alias. Please refer to [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects) for more information.\n",
|
||||
"|**cv_step_size**|Number of periods between two consecutive cross-validation folds. The default value is \"auto\", in which case AutoMl determines the cross-validation step size automatically, if a validation set is not provided. Or users could specify an integer value."
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Train<a id=\"train\"></a>\n",
|
||||
"\n",
|
||||
@@ -328,23 +364,21 @@
|
||||
"|**training_data**|The training data to be used within the experiment.|\n",
|
||||
"|**label_column_name**|The name of the label column.|\n",
|
||||
"|**compute_target**|The remote compute for training.|\n",
|
||||
"|**n_cross_validations**|Number of cross validation splits. Rolling Origin Validation is used to split time-series in a temporally consistent way.|\n",
|
||||
"|**n_cross_validations**|Number of cross-validation folds to use for model/pipeline selection. The default value is \"auto\", in which case AutoMl determines the number of cross-validations automatically, if a validation set is not provided. Or users could specify an integer value.\n",
|
||||
"|**enable_early_stopping**|Flag to enble early termination if the score is not improving in the short term.|\n",
|
||||
"|**forecasting_parameters**|A class holds all the forecasting related parameters.|\n"
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"This notebook uses the blocked_models parameter to exclude some models that take a longer time to train on this dataset. You can choose to remove models from the blocked_models list but you may need to increase the experiment_timeout_hours parameter value to get results."
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.automl.core.forecasting_parameters import ForecastingParameters\n",
|
||||
"\n",
|
||||
@@ -352,6 +386,7 @@
|
||||
" time_column_name=time_column_name,\n",
|
||||
" forecast_horizon=forecast_horizon,\n",
|
||||
" freq=\"H\", # Set the forecast frequency to be hourly\n",
|
||||
" cv_step_size=\"auto\",\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"automl_config = AutoMLConfig(\n",
|
||||
@@ -359,73 +394,73 @@
|
||||
" primary_metric=\"normalized_root_mean_squared_error\",\n",
|
||||
" blocked_models=[\"ExtremeRandomTrees\", \"AutoArima\", \"Prophet\"],\n",
|
||||
" experiment_timeout_hours=0.3,\n",
|
||||
" training_data=train,\n",
|
||||
" training_data=train_dataset,\n",
|
||||
" label_column_name=target_column_name,\n",
|
||||
" compute_target=compute_target,\n",
|
||||
" enable_early_stopping=True,\n",
|
||||
" n_cross_validations=3,\n",
|
||||
" n_cross_validations=\"auto\", # Feel free to set to a small integer (>=2) if runtime is an issue.\n",
|
||||
" verbosity=logging.INFO,\n",
|
||||
" forecasting_parameters=forecasting_parameters,\n",
|
||||
")"
|
||||
]
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Call the `submit` method on the experiment object and pass the run configuration. Depending on the data and the number of iterations this can run for a while.\n",
|
||||
"One may specify `show_output = True` to print currently running iterations to the console."
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"remote_run = experiment.submit(automl_config, show_output=False)"
|
||||
]
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"remote_run.wait_for_completion()"
|
||||
]
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Retrieve the Best Run details\n",
|
||||
"Below we retrieve the best Run object from among all the runs in the experiment."
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"best_run = remote_run.get_best_child()\n",
|
||||
"best_run"
|
||||
]
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Featurization\n",
|
||||
"We can look at the engineered feature names generated in time-series featurization via. the JSON file named 'engineered_feature_names.json' under the run outputs."
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Download the JSON file locally\n",
|
||||
"best_run.download_file(\n",
|
||||
@@ -435,11 +470,13 @@
|
||||
" records = json.load(f)\n",
|
||||
"\n",
|
||||
"records"
|
||||
]
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### View featurization summary\n",
|
||||
"You can also see what featurization steps were performed on different raw features in the user data. For each raw feature in the user data, the following information is displayed:\n",
|
||||
@@ -449,13 +486,11 @@
|
||||
"+ Type detected\n",
|
||||
"+ If feature was dropped\n",
|
||||
"+ List of feature transformations for the raw feature"
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Download the featurization summary JSON file locally\n",
|
||||
"best_run.download_file(\n",
|
||||
@@ -477,41 +512,41 @@
|
||||
" \"Transformations\",\n",
|
||||
" ]\n",
|
||||
"]"
|
||||
]
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Forecasting<a id=\"forecast\"></a>\n",
|
||||
"\n",
|
||||
"Now that we have retrieved the best pipeline/model, it can be used to make predictions on test data. We will do batch scoring on the test dataset which should have the same schema as training dataset.\n",
|
||||
"\n",
|
||||
"The inference will run on a remote compute. In this example, it will re-use the training compute."
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"test_experiment = Experiment(ws, experiment_name + \"_inference\")"
|
||||
]
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Retrieving forecasts from the model\n",
|
||||
"We have created a function called `run_forecast` that submits the test data to the best model determined during the training run and retrieves forecasts. This function uses a helper script `forecasting_script` which is uploaded and expecuted on the remote compute."
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from run_forecast import run_remote_inference\n",
|
||||
"\n",
|
||||
@@ -519,39 +554,39 @@
|
||||
" test_experiment=test_experiment,\n",
|
||||
" compute_target=compute_target,\n",
|
||||
" train_run=best_run,\n",
|
||||
" test_dataset=test,\n",
|
||||
" test_dataset=test_dataset,\n",
|
||||
" target_column_name=target_column_name,\n",
|
||||
")\n",
|
||||
"remote_run_infer.wait_for_completion(show_output=False)\n",
|
||||
"\n",
|
||||
"# download the inference output file to the local machine\n",
|
||||
"remote_run_infer.download_file(\"outputs/predictions.csv\", \"predictions.csv\")"
|
||||
]
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Evaluate\n",
|
||||
"To evaluate the accuracy of the forecast, we'll compare against the actual sales quantities for some select metrics, included the mean absolute percentage error (MAPE). For more metrics that can be used for evaluation after training, please see [supported metrics](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-understand-automated-ml#regressionforecasting-metrics), and [how to calculate residuals](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-understand-automated-ml#residuals)."
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# load forecast data frame\n",
|
||||
"fcst_df = pd.read_csv(\"predictions.csv\", parse_dates=[time_column_name])\n",
|
||||
"fcst_df.head()"
|
||||
]
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.automl.core.shared import constants\n",
|
||||
"from azureml.automl.runtime.shared.score import scoring\n",
|
||||
@@ -578,37 +613,38 @@
|
||||
" (test_pred, test_test), (\"prediction\", \"truth\"), loc=\"upper left\", fontsize=8\n",
|
||||
")\n",
|
||||
"plt.show()"
|
||||
]
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Advanced Training <a id=\"advanced_training\"></a>\n",
|
||||
"We did not use lags in the previous model specification. In effect, the prediction was the result of a simple regression on date, time series identifier columns and any additional features. This is often a very good prediction as common time series patterns like seasonality and trends can be captured in this manner. Such simple regression is horizon-less: it doesn't matter how far into the future we are predicting, because we are not using past data. In the previous example, the horizon was only used to split the data for cross-validation."
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Using lags and rolling window features\n",
|
||||
"Now we will configure the target lags, that is the previous values of the target variables, meaning the prediction is no longer horizon-less. We therefore must still specify the `forecast_horizon` that the model will learn to forecast. The `target_lags` keyword specifies how far back we will construct the lags of the target variable, and the `target_rolling_window_size` specifies the size of the rolling window over which we will generate the `max`, `min` and `sum` features.\n",
|
||||
"\n",
|
||||
"This notebook uses the blocked_models parameter to exclude some models that take a longer time to train on this dataset. You can choose to remove models from the blocked_models list but you may need to increase the iteration_timeout_minutes parameter value to get results."
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"advanced_forecasting_parameters = ForecastingParameters(\n",
|
||||
" time_column_name=time_column_name,\n",
|
||||
" forecast_horizon=forecast_horizon,\n",
|
||||
" target_lags=12,\n",
|
||||
" target_rolling_window_size=4,\n",
|
||||
" cv_step_size=\"auto\",\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"automl_config = AutoMLConfig(\n",
|
||||
@@ -624,78 +660,78 @@
|
||||
" \"Prophet\",\n",
|
||||
" ], # These models are blocked for tutorial purposes, remove this for real use cases.\n",
|
||||
" experiment_timeout_hours=0.3,\n",
|
||||
" training_data=train,\n",
|
||||
" training_data=train_dataset,\n",
|
||||
" label_column_name=target_column_name,\n",
|
||||
" compute_target=compute_target,\n",
|
||||
" enable_early_stopping=True,\n",
|
||||
" n_cross_validations=3,\n",
|
||||
" n_cross_validations=\"auto\", # Feel free to set to a small integer (>=2) if runtime is an issue.\n",
|
||||
" verbosity=logging.INFO,\n",
|
||||
" forecasting_parameters=advanced_forecasting_parameters,\n",
|
||||
")"
|
||||
]
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We now start a new remote run, this time with lag and rolling window featurization. AutoML applies featurizations in the setup stage, prior to iterating over ML models. The full training set is featurized first, followed by featurization of each of the CV splits. Lag and rolling window features introduce additional complexity, so the run will take longer than in the previous example that lacked these featurizations."
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"advanced_remote_run = experiment.submit(automl_config, show_output=False)"
|
||||
]
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"advanced_remote_run.wait_for_completion()"
|
||||
]
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Retrieve the Best Run details"
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"best_run_lags = remote_run.get_best_child()\n",
|
||||
"best_run_lags"
|
||||
]
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Advanced Results<a id=\"advanced_results\"></a>\n",
|
||||
"We did not use lags in the previous model specification. In effect, the prediction was the result of a simple regression on date, time series identifier columns and any additional features. This is often a very good prediction as common time series patterns like seasonality and trends can be captured in this manner. Such simple regression is horizon-less: it doesn't matter how far into the future we are predicting, because we are not using past data. In the previous example, the horizon was only used to split the data for cross-validation."
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"test_experiment_advanced = Experiment(ws, experiment_name + \"_inference_advanced\")\n",
|
||||
"advanced_remote_run_infer = run_remote_inference(\n",
|
||||
" test_experiment=test_experiment_advanced,\n",
|
||||
" compute_target=compute_target,\n",
|
||||
" train_run=best_run_lags,\n",
|
||||
" test_dataset=test,\n",
|
||||
" test_dataset=test_dataset,\n",
|
||||
" target_column_name=target_column_name,\n",
|
||||
" inference_folder=\"./forecast_advanced\",\n",
|
||||
")\n",
|
||||
@@ -705,23 +741,23 @@
|
||||
"advanced_remote_run_infer.download_file(\n",
|
||||
" \"outputs/predictions.csv\", \"predictions_advanced.csv\"\n",
|
||||
")"
|
||||
]
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"fcst_adv_df = pd.read_csv(\"predictions_advanced.csv\", parse_dates=[time_column_name])\n",
|
||||
"fcst_adv_df.head()"
|
||||
]
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.automl.core.shared import constants\n",
|
||||
"from azureml.automl.runtime.shared.score import scoring\n",
|
||||
@@ -750,7 +786,10 @@
|
||||
" (test_pred, test_test), (\"prediction\", \"truth\"), loc=\"upper left\", fontsize=8\n",
|
||||
")\n",
|
||||
"plt.show()"
|
||||
]
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"metadata": {}
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
@@ -764,21 +803,37 @@
|
||||
"automated-machine-learning"
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"name": "python38-azureml",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"display_name": "Python 3.8 - AzureML"
|
||||
},
|
||||
"language_info": {
|
||||
"name": "python",
|
||||
"version": "3.8.5",
|
||||
"mimetype": "text/x-python",
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.9"
|
||||
"nbconvert_exporter": "python",
|
||||
"file_extension": ".py"
|
||||
},
|
||||
"vscode": {
|
||||
"interpreter": {
|
||||
"hash": "6bd77c88278e012ef31757c15997a7bea8c943977c43d6909403c00ae11d43ca"
|
||||
}
|
||||
},
|
||||
"microsoft": {
|
||||
"ms_spell_check": {
|
||||
"ms_spell_check_language": "en"
|
||||
}
|
||||
},
|
||||
"kernel_info": {
|
||||
"name": "python3"
|
||||
},
|
||||
"nteract": {
|
||||
"version": "nteract-front-end@1.0.0"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
|
||||
@@ -52,7 +52,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Please make sure you have followed the `configuration.ipynb` notebook so that your ML workspace information is saved in the config file."
|
||||
"Please make sure you have followed the [configuration notebook](https://github.com/Azure/MachineLearningNotebooks/blob/master/configuration.ipynb) so that your ML workspace information is saved in the config file."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -335,7 +335,8 @@
|
||||
" forecast_horizon=forecast_horizon,\n",
|
||||
" time_series_id_column_names=[TIME_SERIES_ID_COLUMN_NAME],\n",
|
||||
" target_lags=lags,\n",
|
||||
" freq=\"H\", # Set the forecast frequency to be hourly\n",
|
||||
" freq=\"H\", # Set the forecast frequency to be hourly,\n",
|
||||
" cv_step_size=\"auto\",\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
@@ -365,7 +366,7 @@
|
||||
" enable_early_stopping=True,\n",
|
||||
" training_data=train_data,\n",
|
||||
" compute_target=compute_target,\n",
|
||||
" n_cross_validations=3,\n",
|
||||
" n_cross_validations=\"auto\", # Feel free to set to a small integer (>=2) if runtime is an issue.\n",
|
||||
" verbosity=logging.INFO,\n",
|
||||
" max_concurrent_iterations=4,\n",
|
||||
" max_cores_per_iteration=-1,\n",
|
||||
@@ -757,7 +758,15 @@
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Forecasting farther than the forecast horizon <a id=\"recursive forecasting\"></a>\n",
|
||||
"When the forecast destination, or the latest date in the prediction data frame, is farther into the future than the specified forecast horizon, the `forecast()` function will still make point predictions out to the later date using a recursive operation mode. Internally, the method recursively applies the regular forecaster to generate context so that we can forecast further into the future. \n",
|
||||
"When the forecast destination, or the latest date in the prediction data frame, is farther into the future than the specified forecast horizon, the forecaster must be iteratively applied. Here, we advance the forecast origin on each iteration over the prediction window, predicting `max_horizon` periods ahead on each iteration. There are two choices for the context data to use as the forecaster advances into the prediction window:\n",
|
||||
"\n",
|
||||
"1. We can use forecasted values from previous iterations (recursive forecast),\n",
|
||||
"2. We can use known, actual values of the target if they are available (rolling forecast).\n",
|
||||
"\n",
|
||||
"The first method is useful in a true forecasting scenario when we do not yet know the actual target values while the second is useful in an evaluation scenario where we want to compute accuracy metrics for the `max_horizon`-period-ahead forecaster over a long test set. We refer to the first as a **recursive forecast** since we apply the forecaster recursively over the prediction window and the second as a **rolling forecast** since we roll forward over known actuals.\n",
|
||||
"\n",
|
||||
"### Recursive forecasting\n",
|
||||
"By default, the `forecast()` function will make point predictions out to the later date using a recursive operation mode. Internally, the method recursively applies the regular forecaster to generate context so that we can forecast further into the future. \n",
|
||||
"\n",
|
||||
"To illustrate the use-case and operation of recursive forecasting, we'll consider an example with a single time-series where the forecasting period directly follows the training period and is twice as long as the forecasting horizon given at training time.\n",
|
||||
"\n",
|
||||
@@ -817,6 +826,35 @@
|
||||
"np.array_equal(y_pred_all, y_pred_long)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Rolling forecasts\n",
|
||||
"A rolling forecast is a similar concept to the recursive forecasts described above except that we use known actual values of the target for our context data. We have provided a different, public method for this called `rolling_forecast`. In addition to test data and actuals (`X_test` and `y_test`), `rolling_forecast` also accepts an optional `step` parameter that controls how far the origin advances on each iteration. The recursive forecast mode uses a fixed step of `max_horizon` while `rolling_forecast` defaults to a step size of 1, but can be set to any integer from 1 to `max_horizon`, inclusive.\n",
|
||||
"\n",
|
||||
"Let's see what the rolling forecast looks like on the long test set with the step set to 1:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"X_rf = fitted_model.rolling_forecast(X_test_long, y_test_long, step=1)\n",
|
||||
"X_rf.head(n=12)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Notice that `rolling_forecast` has returned a single DataFrame containing all results and has generated some new columns: `_automl_forecast_origin`, `_automl_forecast_y`, and `_automl_actual_y`. These are the origin date for each forecast, the forecasted value and the actual value, respectively. Note that \"y\" in the forecast and actual column names will generally be replaced by the target column name supplied to AutoML.\n",
|
||||
"\n",
|
||||
"The output above shows forecasts for two prediction windows, the first with origin at the end of the training set and the second including the first observation in the test set (2000-01-01 06:00:00). Since the forecast windows overlap, there are multiple forecasts for most dates which are associated with different origin dates."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
@@ -865,9 +903,9 @@
|
||||
"friendly_name": "Forecasting away from training data",
|
||||
"index_order": 3,
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
@@ -879,14 +917,19 @@
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.8"
|
||||
"version": "3.7.13"
|
||||
},
|
||||
"tags": [
|
||||
"Forecasting",
|
||||
"Confidence Intervals"
|
||||
],
|
||||
"task": "Forecasting"
|
||||
"task": "Forecasting",
|
||||
"vscode": {
|
||||
"interpreter": {
|
||||
"hash": "6bd77c88278e012ef31757c15997a7bea8c943977c43d6909403c00ae11d43ca"
|
||||
}
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
"nbformat_minor": 4
|
||||
}
|
||||
@@ -52,7 +52,7 @@
|
||||
"\n",
|
||||
"AutoML highlights here include using Deep Learning forecasts, Arima, Prophet, Remote Execution and Remote Inferencing, and working with the `forecast` function. Please also look at the additional forecasting notebooks, which document lagging, rolling windows, forecast quantiles, other ways to use the forecast function, and forecaster deployment.\n",
|
||||
"\n",
|
||||
"Make sure you have executed the [configuration](../../../configuration.ipynb) before running this notebook.\n",
|
||||
"Make sure you have executed the [configuration](https://github.com/Azure/MachineLearningNotebooks/blob/master/configuration.ipynb) before running this notebook.\n",
|
||||
"\n",
|
||||
"Notebook synopsis:\n",
|
||||
"\n",
|
||||
@@ -325,7 +325,7 @@
|
||||
"source": [
|
||||
"### Setting forecaster maximum horizon \n",
|
||||
"\n",
|
||||
"The forecast horizon is the number of periods into the future that the model should predict. Here, we set the horizon to 12 periods (i.e. 12 months). Notice that this is much shorter than the number of months in the test set; we will need to use a rolling test to evaluate the performance on the whole test set. For more discussion of forecast horizons and guiding principles for setting them, please see the [energy demand notebook](https://github.com/Azure/MachineLearningNotebooks/tree/master/how-to-use-azureml/automated-machine-learning/forecasting-energy-demand). "
|
||||
"The forecast horizon is the number of periods into the future that the model should predict. Here, we set the horizon to 14 periods (i.e. 14 days). Notice that this is much shorter than the number of months in the test set; we will need to use a rolling test to evaluate the performance on the whole test set. For more discussion of forecast horizons and guiding principles for setting them, please see the [energy demand notebook](https://github.com/Azure/MachineLearningNotebooks/tree/master/how-to-use-azureml/automated-machine-learning/forecasting-energy-demand). "
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -337,7 +337,7 @@
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"forecast_horizon = 12"
|
||||
"forecast_horizon = 14"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -681,9 +681,9 @@
|
||||
],
|
||||
"hide_code_all_hidden": false,
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
@@ -699,5 +699,5 @@
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
"nbformat_minor": 4
|
||||
}
|
||||
@@ -4,7 +4,6 @@ import os
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
|
||||
from pandas.tseries.frequencies import to_offset
|
||||
from sklearn.externals import joblib
|
||||
from sklearn.metrics import mean_absolute_error, mean_squared_error
|
||||
|
||||
@@ -19,219 +18,8 @@ except ImportError:
|
||||
_torch_present = False
|
||||
|
||||
|
||||
def align_outputs(
|
||||
y_predicted,
|
||||
X_trans,
|
||||
X_test,
|
||||
y_test,
|
||||
predicted_column_name="predicted",
|
||||
horizon_colname="horizon_origin",
|
||||
):
|
||||
"""
|
||||
Demonstrates how to get the output aligned to the inputs
|
||||
using pandas indexes. Helps understand what happened if
|
||||
the output's shape differs from the input shape, or if
|
||||
the data got re-sorted by time and grain during forecasting.
|
||||
|
||||
Typical causes of misalignment are:
|
||||
* we predicted some periods that were missing in actuals -> drop from eval
|
||||
* model was asked to predict past max_horizon -> increase max horizon
|
||||
* data at start of X_test was needed for lags -> provide previous periods
|
||||
"""
|
||||
if horizon_colname in X_trans:
|
||||
df_fcst = pd.DataFrame(
|
||||
{
|
||||
predicted_column_name: y_predicted,
|
||||
horizon_colname: X_trans[horizon_colname],
|
||||
}
|
||||
)
|
||||
else:
|
||||
df_fcst = pd.DataFrame({predicted_column_name: y_predicted})
|
||||
|
||||
# y and X outputs are aligned by forecast() function contract
|
||||
df_fcst.index = X_trans.index
|
||||
|
||||
# align original X_test to y_test
|
||||
X_test_full = X_test.copy()
|
||||
X_test_full[target_column_name] = y_test
|
||||
|
||||
# X_test_full's index does not include origin, so reset for merge
|
||||
df_fcst.reset_index(inplace=True)
|
||||
X_test_full = X_test_full.reset_index().drop(columns="index")
|
||||
together = df_fcst.merge(X_test_full, how="right")
|
||||
|
||||
# drop rows where prediction or actuals are nan
|
||||
# happens because of missing actuals
|
||||
# or at edges of time due to lags/rolling windows
|
||||
clean = together[
|
||||
together[[target_column_name, predicted_column_name]].notnull().all(axis=1)
|
||||
]
|
||||
return clean
|
||||
|
||||
|
||||
def do_rolling_forecast_with_lookback(
|
||||
fitted_model, X_test, y_test, max_horizon, X_lookback, y_lookback, freq="D"
|
||||
):
|
||||
"""
|
||||
Produce forecasts on a rolling origin over the given test set.
|
||||
|
||||
Each iteration makes a forecast for the next 'max_horizon' periods
|
||||
with respect to the current origin, then advances the origin by the
|
||||
horizon time duration. The prediction context for each forecast is set so
|
||||
that the forecaster uses the actual target values prior to the current
|
||||
origin time for constructing lag features.
|
||||
|
||||
This function returns a concatenated DataFrame of rolling forecasts.
|
||||
"""
|
||||
print("Using lookback of size: ", y_lookback.size)
|
||||
df_list = []
|
||||
origin_time = X_test[time_column_name].min()
|
||||
X = X_lookback.append(X_test)
|
||||
y = np.concatenate((y_lookback, y_test), axis=0)
|
||||
while origin_time <= X_test[time_column_name].max():
|
||||
# Set the horizon time - end date of the forecast
|
||||
horizon_time = origin_time + max_horizon * to_offset(freq)
|
||||
|
||||
# Extract test data from an expanding window up-to the horizon
|
||||
expand_wind = X[time_column_name] < horizon_time
|
||||
X_test_expand = X[expand_wind]
|
||||
y_query_expand = np.zeros(len(X_test_expand)).astype(float)
|
||||
y_query_expand.fill(np.NaN)
|
||||
|
||||
if origin_time != X[time_column_name].min():
|
||||
# Set the context by including actuals up-to the origin time
|
||||
test_context_expand_wind = X[time_column_name] < origin_time
|
||||
context_expand_wind = X_test_expand[time_column_name] < origin_time
|
||||
y_query_expand[context_expand_wind] = y[test_context_expand_wind]
|
||||
|
||||
# Print some debug info
|
||||
print(
|
||||
"Horizon_time:",
|
||||
horizon_time,
|
||||
" origin_time: ",
|
||||
origin_time,
|
||||
" max_horizon: ",
|
||||
max_horizon,
|
||||
" freq: ",
|
||||
freq,
|
||||
)
|
||||
print("expand_wind: ", expand_wind)
|
||||
print("y_query_expand")
|
||||
print(y_query_expand)
|
||||
print("X_test")
|
||||
print(X)
|
||||
print("X_test_expand")
|
||||
print(X_test_expand)
|
||||
print("Type of X_test_expand: ", type(X_test_expand))
|
||||
print("Type of y_query_expand: ", type(y_query_expand))
|
||||
|
||||
print("y_query_expand")
|
||||
print(y_query_expand)
|
||||
|
||||
# Make a forecast out to the maximum horizon
|
||||
# y_fcst, X_trans = y_query_expand, X_test_expand
|
||||
y_fcst, X_trans = fitted_model.forecast(X_test_expand, y_query_expand)
|
||||
|
||||
print("y_fcst")
|
||||
print(y_fcst)
|
||||
|
||||
# Align forecast with test set for dates within
|
||||
# the current rolling window
|
||||
trans_tindex = X_trans.index.get_level_values(time_column_name)
|
||||
trans_roll_wind = (trans_tindex >= origin_time) & (trans_tindex < horizon_time)
|
||||
test_roll_wind = expand_wind & (X[time_column_name] >= origin_time)
|
||||
df_list.append(
|
||||
align_outputs(
|
||||
y_fcst[trans_roll_wind],
|
||||
X_trans[trans_roll_wind],
|
||||
X[test_roll_wind],
|
||||
y[test_roll_wind],
|
||||
)
|
||||
)
|
||||
|
||||
# Advance the origin time
|
||||
origin_time = horizon_time
|
||||
|
||||
return pd.concat(df_list, ignore_index=True)
|
||||
|
||||
|
||||
def do_rolling_forecast(fitted_model, X_test, y_test, max_horizon, freq="D"):
|
||||
"""
|
||||
Produce forecasts on a rolling origin over the given test set.
|
||||
|
||||
Each iteration makes a forecast for the next 'max_horizon' periods
|
||||
with respect to the current origin, then advances the origin by the
|
||||
horizon time duration. The prediction context for each forecast is set so
|
||||
that the forecaster uses the actual target values prior to the current
|
||||
origin time for constructing lag features.
|
||||
|
||||
This function returns a concatenated DataFrame of rolling forecasts.
|
||||
"""
|
||||
df_list = []
|
||||
origin_time = X_test[time_column_name].min()
|
||||
while origin_time <= X_test[time_column_name].max():
|
||||
# Set the horizon time - end date of the forecast
|
||||
horizon_time = origin_time + max_horizon * to_offset(freq)
|
||||
|
||||
# Extract test data from an expanding window up-to the horizon
|
||||
expand_wind = X_test[time_column_name] < horizon_time
|
||||
X_test_expand = X_test[expand_wind]
|
||||
y_query_expand = np.zeros(len(X_test_expand)).astype(float)
|
||||
y_query_expand.fill(np.NaN)
|
||||
|
||||
if origin_time != X_test[time_column_name].min():
|
||||
# Set the context by including actuals up-to the origin time
|
||||
test_context_expand_wind = X_test[time_column_name] < origin_time
|
||||
context_expand_wind = X_test_expand[time_column_name] < origin_time
|
||||
y_query_expand[context_expand_wind] = y_test[test_context_expand_wind]
|
||||
|
||||
# Print some debug info
|
||||
print(
|
||||
"Horizon_time:",
|
||||
horizon_time,
|
||||
" origin_time: ",
|
||||
origin_time,
|
||||
" max_horizon: ",
|
||||
max_horizon,
|
||||
" freq: ",
|
||||
freq,
|
||||
)
|
||||
print("expand_wind: ", expand_wind)
|
||||
print("y_query_expand")
|
||||
print(y_query_expand)
|
||||
print("X_test")
|
||||
print(X_test)
|
||||
print("X_test_expand")
|
||||
print(X_test_expand)
|
||||
print("Type of X_test_expand: ", type(X_test_expand))
|
||||
print("Type of y_query_expand: ", type(y_query_expand))
|
||||
print("y_query_expand")
|
||||
print(y_query_expand)
|
||||
|
||||
# Make a forecast out to the maximum horizon
|
||||
y_fcst, X_trans = fitted_model.forecast(X_test_expand, y_query_expand)
|
||||
|
||||
print("y_fcst")
|
||||
print(y_fcst)
|
||||
|
||||
# Align forecast with test set for dates within the
|
||||
# current rolling window
|
||||
trans_tindex = X_trans.index.get_level_values(time_column_name)
|
||||
trans_roll_wind = (trans_tindex >= origin_time) & (trans_tindex < horizon_time)
|
||||
test_roll_wind = expand_wind & (X_test[time_column_name] >= origin_time)
|
||||
df_list.append(
|
||||
align_outputs(
|
||||
y_fcst[trans_roll_wind],
|
||||
X_trans[trans_roll_wind],
|
||||
X_test[test_roll_wind],
|
||||
y_test[test_roll_wind],
|
||||
)
|
||||
)
|
||||
|
||||
# Advance the origin time
|
||||
origin_time = horizon_time
|
||||
|
||||
return pd.concat(df_list, ignore_index=True)
|
||||
def map_location_cuda(storage, loc):
|
||||
return storage.cuda()
|
||||
|
||||
|
||||
def APE(actual, pred):
|
||||
@@ -254,10 +42,6 @@ def MAPE(actual, pred):
|
||||
return np.mean(APE(actual_safe, pred_safe))
|
||||
|
||||
|
||||
def map_location_cuda(storage, loc):
|
||||
return storage.cuda()
|
||||
|
||||
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument(
|
||||
"--max_horizon",
|
||||
@@ -303,7 +87,6 @@ print(model_path)
|
||||
run = Run.get_context()
|
||||
# get input dataset by name
|
||||
test_dataset = run.input_datasets["test_data"]
|
||||
lookback_dataset = run.input_datasets["lookback_data"]
|
||||
|
||||
grain_column_names = []
|
||||
|
||||
@@ -312,15 +95,8 @@ df = test_dataset.to_pandas_dataframe()
|
||||
print("Read df")
|
||||
print(df)
|
||||
|
||||
X_test_df = test_dataset.drop_columns(columns=[target_column_name])
|
||||
y_test_df = test_dataset.with_timestamp_columns(None).keep_columns(
|
||||
columns=[target_column_name]
|
||||
)
|
||||
|
||||
X_lookback_df = lookback_dataset.drop_columns(columns=[target_column_name])
|
||||
y_lookback_df = lookback_dataset.with_timestamp_columns(None).keep_columns(
|
||||
columns=[target_column_name]
|
||||
)
|
||||
X_test_df = df
|
||||
y_test = df.pop(target_column_name).to_numpy()
|
||||
|
||||
_, ext = os.path.splitext(model_path)
|
||||
if ext == ".pt":
|
||||
@@ -336,37 +112,20 @@ else:
|
||||
# Load the sklearn pipeline.
|
||||
fitted_model = joblib.load(model_path)
|
||||
|
||||
if hasattr(fitted_model, "get_lookback"):
|
||||
lookback = fitted_model.get_lookback()
|
||||
df_all = do_rolling_forecast_with_lookback(
|
||||
fitted_model,
|
||||
X_test_df.to_pandas_dataframe(),
|
||||
y_test_df.to_pandas_dataframe().values.T[0],
|
||||
max_horizon,
|
||||
X_lookback_df.to_pandas_dataframe()[-lookback:],
|
||||
y_lookback_df.to_pandas_dataframe().values.T[0][-lookback:],
|
||||
freq,
|
||||
)
|
||||
else:
|
||||
df_all = do_rolling_forecast(
|
||||
fitted_model,
|
||||
X_test_df.to_pandas_dataframe(),
|
||||
y_test_df.to_pandas_dataframe().values.T[0],
|
||||
max_horizon,
|
||||
freq,
|
||||
)
|
||||
X_rf = fitted_model.rolling_forecast(X_test_df, y_test, step=1)
|
||||
assign_dict = {
|
||||
fitted_model.forecast_origin_column_name: "forecast_origin",
|
||||
fitted_model.forecast_column_name: "predicted",
|
||||
fitted_model.actual_column_name: target_column_name,
|
||||
}
|
||||
X_rf.rename(columns=assign_dict, inplace=True)
|
||||
|
||||
print(df_all)
|
||||
|
||||
print("target values:::")
|
||||
print(df_all[target_column_name])
|
||||
print("predicted values:::")
|
||||
print(df_all["predicted"])
|
||||
print(X_rf.head())
|
||||
|
||||
# Use the AutoML scoring module
|
||||
regression_metrics = list(constants.REGRESSION_SCALAR_SET)
|
||||
y_test = np.array(df_all[target_column_name])
|
||||
y_pred = np.array(df_all["predicted"])
|
||||
y_test = np.array(X_rf[target_column_name])
|
||||
y_pred = np.array(X_rf["predicted"])
|
||||
scores = scoring.score_regression(y_test, y_pred, regression_metrics)
|
||||
|
||||
print("scores:")
|
||||
@@ -376,11 +135,11 @@ for key, value in scores.items():
|
||||
run.log(key, value)
|
||||
|
||||
print("Simple forecasting model")
|
||||
rmse = np.sqrt(mean_squared_error(df_all[target_column_name], df_all["predicted"]))
|
||||
rmse = np.sqrt(mean_squared_error(X_rf[target_column_name], X_rf["predicted"]))
|
||||
print("[Test Data] \nRoot Mean squared error: %.2f" % rmse)
|
||||
mae = mean_absolute_error(df_all[target_column_name], df_all["predicted"])
|
||||
mae = mean_absolute_error(X_rf[target_column_name], X_rf["predicted"])
|
||||
print("mean_absolute_error score: %.2f" % mae)
|
||||
print("MAPE: %.2f" % MAPE(df_all[target_column_name], df_all["predicted"]))
|
||||
print("MAPE: %.2f" % MAPE(X_rf[target_column_name], X_rf["predicted"]))
|
||||
|
||||
run.log("rmse", rmse)
|
||||
run.log("mae", mae)
|
||||
|
||||
@@ -40,7 +40,7 @@
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Prerequisites\n",
|
||||
"You'll need to create a compute Instance by following the instructions in the [EnvironmentSetup.md](../Setup_Resources/EnvironmentSetup.md)."
|
||||
"You'll need to create a compute Instance by following [these](https://learn.microsoft.com/en-us/azure/machine-learning/v1/how-to-create-manage-compute-instance?tabs=python) instructions."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -251,8 +251,17 @@
|
||||
"source": [
|
||||
"### Set up training parameters\n",
|
||||
"\n",
|
||||
"This dictionary defines the AutoML and hierarchy settings. For this forecasting task we need to define several settings inncluding the name of the time column, the maximum forecast horizon, the hierarchy definition, and the level of the hierarchy at which to train.\n",
|
||||
"We need to provide ``ForecastingParameters``, ``AutoMLConfig`` and ``HTSTrainParameters`` objects. For the forecasting task we need to define several settings including the name of the time column, the maximum forecast horizon, the hierarchy definition, and the level of the hierarchy at which to train.\n",
|
||||
"\n",
|
||||
"#### ``ForecastingParameters`` arguments\n",
|
||||
"| Property | Description|\n",
|
||||
"| :--------------- | :------------------- |\n",
|
||||
"| **forecast_horizon** | The forecast horizon is how many periods forward you would like to forecast. This integer horizon is in units of the timeseries frequency (e.g. daily, weekly). Periods are inferred from your data. |\n",
|
||||
"| **time_column_name** | The name of your time column. |\n",
|
||||
"| **time_series_id_column_names** | The column names used to uniquely identify timeseries in data that has multiple rows with the same timestamp. |\n",
|
||||
"| **cv_step_size** | Number of periods between two consecutive cross-validation folds. The default value is \\\"auto\\\", in which case AutoMl determines the cross-validation step size automatically, if a validation set is not provided. Or users could specify an integer value. |\n",
|
||||
"\n",
|
||||
"#### ``AutoMLConfig`` arguments\n",
|
||||
"| Property | Description|\n",
|
||||
"| :--------------- | :------------------- |\n",
|
||||
"| **task** | forecasting |\n",
|
||||
@@ -260,19 +269,22 @@
|
||||
"| **blocked_models** | Blocked models won't be used by AutoML. |\n",
|
||||
"| **iteration_timeout_minutes** | Maximum amount of time in minutes that the model can train. This is optional but provides customers with greater control on exit criteria. |\n",
|
||||
"| **iterations** | Number of models to train. This is optional but provides customers with greater control on exit criteria. |\n",
|
||||
"| **experiment_timeout_hours** | Maximum amount of time in hours that the experiment can take before it terminates. This is optional but provides customers with greater control on exit criteria. |\n",
|
||||
"| **experiment_timeout_hours** | Maximum amount of time in hours that each experiment can take before it terminates. This is optional but provides customers with greater control on exit criteria. **It does not control the overall timeout for the pipeline run, instead controls the timeout for each training run per partitioned time series.** |\n",
|
||||
"| **label_column_name** | The name of the label column. |\n",
|
||||
"| **forecast_horizon** | The forecast horizon is how many periods forward you would like to forecast. This integer horizon is in units of the timeseries frequency (e.g. daily, weekly). Periods are inferred from your data. |\n",
|
||||
"| **n_cross_validations** | Number of cross validation splits. Rolling Origin Validation is used to split time-series in a temporally consistent way. |\n",
|
||||
"| **enable_early_stopping** | Flag to enable early termination if the score is not improving in the short term. |\n",
|
||||
"| **time_column_name** | The name of your time column. |\n",
|
||||
"| **hierarchy_column_names** | The names of columns that define the hierarchical structure of the data from highest level to most granular. |\n",
|
||||
"| **training_level** | The level of the hierarchy to be used for training models. |\n",
|
||||
"| **n_cross_validations** | Number of cross validation splits. The default value is \\\"auto\\\", in which case AutoMl determines the number of cross-validations automatically, if a validation set is not provided. Or users could specify an integer value. Rolling Origin Validation is used to split time-series in a temporally consistent way. |\n",
|
||||
"| **enable_early_stopping** | Flag to enable early termination if the primary metric is no longer improving. |\n",
|
||||
"| **enable_engineered_explanations** | Engineered feature explanations will be downloaded if enable_engineered_explanations flag is set to True. By default it is set to False to save storage space. |\n",
|
||||
"| **time_series_id_column_name** | The column names used to uniquely identify timeseries in data that has multiple rows with the same timestamp. |\n",
|
||||
"| **track_child_runs** | Flag to disable tracking of child runs. Only best run is tracked if the flag is set to False (this includes the model and metrics of the run). |\n",
|
||||
"| **pipeline_fetch_max_batch_size** | Determines how many pipelines (training algorithms) to fetch at a time for training, this helps reduce throttling when training at large scale. |\n",
|
||||
"| **model_explainability** | Flag to disable explaining the best automated ML model at the end of all training iterations. The default is True and will block non-explainable models which may impact the forecast accuracy. For more information, see [Interpretability: model explanations in automated machine learning](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-machine-learning-interpretability-automl). |"
|
||||
"| **model_explainability** | Flag to disable explaining the best automated ML model at the end of all training iterations. The default is True and will block non-explainable models which may impact the forecast accuracy. For more information, see [Interpretability: model explanations in automated machine learning](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-machine-learning-interpretability-automl). |\n",
|
||||
"\n",
|
||||
"#### ``HTSTrainParameters`` arguments\n",
|
||||
"| Property | Description|\n",
|
||||
"| :--------------- | :------------------- |\n",
|
||||
"| **automl_settings** | The ``AutoMLConfig`` object defined above. |\n",
|
||||
"| **hierarchy_column_names** | The names of columns that define the hierarchical structure of the data from highest level to most granular. |\n",
|
||||
"| **training_level** | The level of the hierarchy to be used for training models. |\n",
|
||||
"| **enable_engineered_explanations** | The switch controls engineered explanations. |"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -286,6 +298,9 @@
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.train.automl.runtime._hts.hts_parameters import HTSTrainParameters\n",
|
||||
"from azureml.automl.core.forecasting_parameters import ForecastingParameters\n",
|
||||
"from azureml.train.automl.automlconfig import AutoMLConfig\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"model_explainability = True\n",
|
||||
"\n",
|
||||
@@ -299,23 +314,26 @@
|
||||
"label_column_name = \"quantity\"\n",
|
||||
"forecast_horizon = 7\n",
|
||||
"\n",
|
||||
"forecasting_parameters = ForecastingParameters(\n",
|
||||
" time_column_name=time_column_name,\n",
|
||||
" forecast_horizon=forecast_horizon,\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"automl_settings = {\n",
|
||||
" \"task\": \"forecasting\",\n",
|
||||
" \"primary_metric\": \"normalized_root_mean_squared_error\",\n",
|
||||
" \"label_column_name\": label_column_name,\n",
|
||||
" \"time_column_name\": time_column_name,\n",
|
||||
" \"forecast_horizon\": forecast_horizon,\n",
|
||||
" \"hierarchy_column_names\": hierarchy,\n",
|
||||
" \"hierarchy_training_level\": training_level,\n",
|
||||
" \"track_child_runs\": False,\n",
|
||||
" \"pipeline_fetch_max_batch_size\": 15,\n",
|
||||
" \"model_explainability\": model_explainability,\n",
|
||||
"automl_settings = AutoMLConfig(\n",
|
||||
" task=\"forecasting\",\n",
|
||||
" primary_metric=\"normalized_root_mean_squared_error\",\n",
|
||||
" experiment_timeout_hours=1,\n",
|
||||
" label_column_name=label_column_name,\n",
|
||||
" track_child_runs=False,\n",
|
||||
" forecasting_parameters=forecasting_parameters,\n",
|
||||
" pipeline_fetch_max_batch_size=15,\n",
|
||||
" model_explainability=model_explainability,\n",
|
||||
" n_cross_validations=\"auto\", # Feel free to set to a small integer (>=2) if runtime is an issue.\n",
|
||||
" cv_step_size=\"auto\",\n",
|
||||
" # The following settings are specific to this sample and should be adjusted according to your own needs.\n",
|
||||
" \"iteration_timeout_minutes\": 10,\n",
|
||||
" \"iterations\": 10,\n",
|
||||
" \"n_cross_validations\": 2,\n",
|
||||
"}\n",
|
||||
" iteration_timeout_minutes=10,\n",
|
||||
" iterations=15,\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"hts_parameters = HTSTrainParameters(\n",
|
||||
" automl_settings=automl_settings,\n",
|
||||
@@ -336,15 +354,25 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Parallel run step is leveraged to train the hierarchy. To configure the ParallelRunConfig you will need to determine the appropriate number of workers and nodes for your use case. The `process_count_per_node` is based off the number of cores of the compute VM. The node_count will determine the number of master nodes to use, increasing the node count will speed up the training process.\n",
|
||||
"Parallel run step is leveraged to train multiple models at once. To configure the ParallelRunConfig you will need to determine the appropriate number of workers and nodes for your use case. The ``process_count_per_node`` is based off the number of cores of the compute VM. The node_count will determine the number of master nodes to use, increasing the node count will speed up the training process.\n",
|
||||
"\n",
|
||||
"* **experiment:** The experiment used for training.\n",
|
||||
"* **train_data:** The tabular dataset to be used as input to the training run.\n",
|
||||
"* **node_count:** The number of compute nodes to be used for running the user script. We recommend to start with 3 and increase the node_count if the training time is taking too long.\n",
|
||||
"* **process_count_per_node:** Process count per node, we recommend 2:1 ratio for number of cores: number of processes per node. eg. If node has 16 cores then configure 8 or less process count per node or optimal performance.\n",
|
||||
"* **train_pipeline_parameters:** The set of configuration parameters defined in the previous section. \n",
|
||||
"| Property | Description|\n",
|
||||
"| :--------------- | :------------------- |\n",
|
||||
"| **experiment** | The experiment used for training. |\n",
|
||||
"| **train_data** | The file dataset to be used as input to the training run. |\n",
|
||||
"| **node_count** | The number of compute nodes to be used for running the user script. We recommend to start with 3 and increase the node_count if the training time is taking too long. |\n",
|
||||
"| **process_count_per_node** | Process count per node, we recommend 2:1 ratio for number of cores: number of processes per node. eg. If node has 16 cores then configure 8 or less process count per node for optimal performance. |\n",
|
||||
"| **train_pipeline_parameters** | The set of configuration parameters defined in the previous section. |\n",
|
||||
"| **run_invocation_timeout** | Maximum amount of time in seconds that the ``ParallelRunStep`` class is allowed. This is optional but provides customers with greater control on exit criteria. This must be greater than ``experiment_timeout_hours`` by at least 300 seconds. |\n",
|
||||
"\n",
|
||||
"Calling this method will create a new aggregated dataset which is generated dynamically on pipeline execution."
|
||||
"Calling this method will create a new aggregated dataset which is generated dynamically on pipeline execution.\n",
|
||||
"\n",
|
||||
"**Note**: Total time taken for the **training step** in the pipeline to complete = $ \\frac{t}{ p \\times n } \\times ts $\n",
|
||||
"where,\n",
|
||||
"- $ t $ is time taken for training one partition (can be viewed in the training logs)\n",
|
||||
"- $ p $ is ``process_count_per_node``\n",
|
||||
"- $ n $ is ``node_count``\n",
|
||||
"- $ ts $ is total number of partitions in time series based on ``partition_column_names``"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -363,6 +391,7 @@
|
||||
" node_count=2,\n",
|
||||
" process_count_per_node=8,\n",
|
||||
" train_pipeline_parameters=hts_parameters,\n",
|
||||
" run_invocation_timeout=3900,\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
@@ -507,19 +536,24 @@
|
||||
"source": [
|
||||
"## 5.0 Forecasting\n",
|
||||
"For hierarchical forecasting we need to provide the HTSInferenceParameters object.\n",
|
||||
"#### HTSInferenceParameters arguments\n",
|
||||
"* **hierarchy_forecast_level:** The default level of the hierarchy to produce prediction/forecast on.\n",
|
||||
"* **allocation_method:** \\[Optional] The disaggregation method to use if the hierarchy forecast level specified is below the define hierarchy training level. <br><i>(average historical proportions) 'average_historical_proportions'</i><br><i>(proportions of the historical averages) 'proportions_of_historical_average'</i>\n",
|
||||
"#### ``HTSInferenceParameters`` arguments\n",
|
||||
"| Property | Description|\n",
|
||||
"| :--------------- | :------------------- |\n",
|
||||
"| **hierarchy_forecast_level:** | The default level of the hierarchy to produce prediction/forecast on. |\n",
|
||||
"| **allocation_method:** | \\[Optional] The disaggregation method to use if the hierarchy forecast level specified is below the define hierarchy training level. <br><i>(average historical proportions) 'average_historical_proportions'</i><br><i>(proportions of the historical averages) 'proportions_of_historical_average'</i> |\n",
|
||||
"\n",
|
||||
"#### get_many_models_batch_inference_steps arguments\n",
|
||||
"* **experiment:** The experiment used for inference run.\n",
|
||||
"* **inference_data:** The data to use for inferencing. It should be the same schema as used for training.\n",
|
||||
"* **compute_target:** The compute target that runs the inference pipeline.\n",
|
||||
"* **node_count:** The number of compute nodes to be used for running the user script. We recommend to start with the number of cores per node (varies by compute sku).\n",
|
||||
"* **process_count_per_node:** The number of processes per node.\n",
|
||||
"* **train_run_id:** \\[Optional] The run id of the hierarchy training, by default it is the latest successful training hts run in the experiment.\n",
|
||||
"* **train_experiment_name:** \\[Optional] The train experiment that contains the train pipeline. This one is only needed when the train pipeline is not in the same experiement as the inference pipeline.\n",
|
||||
"* **process_count_per_node:** \\[Optional] The number of processes per node, by default it's 4."
|
||||
"#### ``get_many_models_batch_inference_steps`` arguments\n",
|
||||
"| Property | Description|\n",
|
||||
"| :--------------- | :------------------- |\n",
|
||||
"| **experiment** | The experiment used for inference run. |\n",
|
||||
"| **inference_data** | The data to use for inferencing. It should be the same schema as used for training.\n",
|
||||
"| **compute_target** | The compute target that runs the inference pipeline. |\n",
|
||||
"| **node_count** | The number of compute nodes to be used for running the user script. We recommend to start with the number of cores per node (varies by compute sku). |\n",
|
||||
"| **process_count_per_node** | \\[Optional] The number of processes per node. By default it's 2 (should be at most half of the number of cores in a single node of the compute cluster that will be used for the experiment).\n",
|
||||
"| **inference_pipeline_parameters** | \\[Optional] The ``HTSInferenceParameters`` object defined above. |\n",
|
||||
"| **train_run_id** | \\[Optional] The run id of the **training pipeline**. By default it is the latest successful training pipeline run in the experiment. |\n",
|
||||
"| **train_experiment_name** | \\[Optional] The train experiment that contains the train pipeline. This one is only needed when the train pipeline is not in the same experiement as the inference pipeline. |\n",
|
||||
"| **run_invocation_timeout** | \\[Optional] Maximum amount of time in seconds that the ``ParallelRunStep`` class is allowed. This is optional but provides customers with greater control on exit criteria. |"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -618,9 +652,9 @@
|
||||
"automated-machine-learning"
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
@@ -40,7 +40,7 @@
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Prerequisites\n",
|
||||
"You'll need to create a compute Instance by following the instructions in the [EnvironmentSetup.md](../Setup_Resources/EnvironmentSetup.md)."
|
||||
"You'll need to create a compute Instance by following [these](https://learn.microsoft.com/en-us/azure/machine-learning/v1/how-to-create-manage-compute-instance?tabs=python) instructions."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -379,8 +379,17 @@
|
||||
"source": [
|
||||
"### Set up training parameters\n",
|
||||
"\n",
|
||||
"This dictionary defines the AutoML and many models settings. For this forecasting task we need to define several settings inncluding the name of the time column, the maximum forecast horizon, and the partition column name definition.\n",
|
||||
"We need to provide ``ForecastingParameters``, ``AutoMLConfig`` and ``ManyModelsTrainParameters`` objects. For the forecasting task we also need to define several settings including the name of the time column, the maximum forecast horizon, and the partition column name(s) definition.\n",
|
||||
"\n",
|
||||
"#### ``ForecastingParameters`` arguments\n",
|
||||
"| Property | Description|\n",
|
||||
"| :--------------- | :------------------- |\n",
|
||||
"| **forecast_horizon** | The forecast horizon is how many periods forward you would like to forecast. This integer horizon is in units of the timeseries frequency (e.g. daily, weekly). Periods are inferred from your data. |\n",
|
||||
"| **time_column_name** | The name of your time column. |\n",
|
||||
"| **time_series_id_column_names** | The column names used to uniquely identify timeseries in data that has multiple rows with the same timestamp. |\n",
|
||||
"| **cv_step_size** | Number of periods between two consecutive cross-validation folds. The default value is \\\"auto\\\", in which case AutoMl determines the cross-validation step size automatically, if a validation set is not provided. Or users could specify an integer value. |\n",
|
||||
"\n",
|
||||
"#### ``AutoMLConfig`` arguments\n",
|
||||
"| Property | Description|\n",
|
||||
"| :--------------- | :------------------- |\n",
|
||||
"| **task** | forecasting |\n",
|
||||
@@ -388,16 +397,19 @@
|
||||
"| **blocked_models** | Blocked models won't be used by AutoML. |\n",
|
||||
"| **iteration_timeout_minutes** | Maximum amount of time in minutes that the model can train. This is optional but provides customers with greater control on exit criteria. |\n",
|
||||
"| **iterations** | Number of models to train. This is optional but provides customers with greater control on exit criteria. |\n",
|
||||
"| **experiment_timeout_hours** | Maximum amount of time in hours that the experiment can take before it terminates. This is optional but provides customers with greater control on exit criteria. |\n",
|
||||
"| **experiment_timeout_hours** | Maximum amount of time in hours that each experiment can take before it terminates. This is optional but provides customers with greater control on exit criteria. **It does not control the overall timeout for the pipeline run, instead controls the timeout for each training run per partitioned time series.** |\n",
|
||||
"| **label_column_name** | The name of the label column. |\n",
|
||||
"| **forecast_horizon** | The forecast horizon is how many periods forward you would like to forecast. This integer horizon is in units of the timeseries frequency (e.g. daily, weekly). Periods are inferred from your data. |\n",
|
||||
"| **n_cross_validations** | Number of cross validation splits. Rolling Origin Validation is used to split time-series in a temporally consistent way. |\n",
|
||||
"| **enable_early_stopping** | Flag to enable early termination if the score is not improving in the short term. |\n",
|
||||
"| **time_column_name** | The name of your time column. |\n",
|
||||
"| **n_cross_validations** | Number of cross validation splits. The default value is \\\"auto\\\", in which case AutoMl determines the number of cross-validations automatically, if a validation set is not provided. Or users could specify an integer value. Rolling Origin Validation is used to split time-series in a temporally consistent way. |\n",
|
||||
"| **enable_early_stopping** | Flag to enable early termination if the primary metric is no longer improving. |\n",
|
||||
"| **enable_engineered_explanations** | Engineered feature explanations will be downloaded if enable_engineered_explanations flag is set to True. By default it is set to False to save storage space. |\n",
|
||||
"| **time_series_id_column_names** | The column names used to uniquely identify timeseries in data that has multiple rows with the same timestamp. |\n",
|
||||
"| **track_child_runs** | Flag to disable tracking of child runs. Only best run is tracked if the flag is set to False (this includes the model and metrics of the run). |\n",
|
||||
"| **pipeline_fetch_max_batch_size** | Determines how many pipelines (training algorithms) to fetch at a time for training, this helps reduce throttling when training at large scale. |\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"#### ``ManyModelsTrainParameters`` arguments\n",
|
||||
"| Property | Description|\n",
|
||||
"| :--------------- | :------------------- |\n",
|
||||
"| **automl_settings** | The ``AutoMLConfig`` object defined above. |\n",
|
||||
"| **partition_column_names** | The names of columns used to group your models. For timeseries, the groups must not split up individual time-series. That is, each group must contain one or more whole time-series. |"
|
||||
]
|
||||
},
|
||||
@@ -414,22 +426,29 @@
|
||||
"from azureml.train.automl.runtime._many_models.many_models_parameters import (\n",
|
||||
" ManyModelsTrainParameters,\n",
|
||||
")\n",
|
||||
"from azureml.automl.core.forecasting_parameters import ForecastingParameters\n",
|
||||
"from azureml.train.automl.automlconfig import AutoMLConfig\n",
|
||||
"\n",
|
||||
"partition_column_names = [\"Store\", \"Brand\"]\n",
|
||||
"automl_settings = {\n",
|
||||
" \"task\": \"forecasting\",\n",
|
||||
" \"primary_metric\": \"normalized_root_mean_squared_error\",\n",
|
||||
" \"iteration_timeout_minutes\": 10, # This needs to be changed based on the dataset. We ask customer to explore how long training is taking before settings this value\n",
|
||||
" \"iterations\": 15,\n",
|
||||
" \"experiment_timeout_hours\": 0.25,\n",
|
||||
" \"label_column_name\": \"Quantity\",\n",
|
||||
" \"n_cross_validations\": 3,\n",
|
||||
" \"time_column_name\": \"WeekStarting\",\n",
|
||||
" \"drop_column_names\": \"Revenue\",\n",
|
||||
" \"forecast_horizon\": 6,\n",
|
||||
" \"time_series_id_column_names\": partition_column_names,\n",
|
||||
" \"track_child_runs\": False,\n",
|
||||
"}\n",
|
||||
"\n",
|
||||
"forecasting_parameters = ForecastingParameters(\n",
|
||||
" time_column_name=\"WeekStarting\",\n",
|
||||
" forecast_horizon=6,\n",
|
||||
" time_series_id_column_names=partition_column_names,\n",
|
||||
" cv_step_size=\"auto\",\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"automl_settings = AutoMLConfig(\n",
|
||||
" task=\"forecasting\",\n",
|
||||
" primary_metric=\"normalized_root_mean_squared_error\",\n",
|
||||
" iteration_timeout_minutes=10,\n",
|
||||
" iterations=15,\n",
|
||||
" experiment_timeout_hours=0.25,\n",
|
||||
" label_column_name=\"Quantity\",\n",
|
||||
" n_cross_validations=\"auto\", # Feel free to set to a small integer (>=2) if runtime is an issue.\n",
|
||||
" track_child_runs=False,\n",
|
||||
" forecasting_parameters=forecasting_parameters,\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"mm_paramters = ManyModelsTrainParameters(\n",
|
||||
" automl_settings=automl_settings, partition_column_names=partition_column_names\n",
|
||||
@@ -449,7 +468,9 @@
|
||||
"\n",
|
||||
"Reuse of previous results (``allow_reuse``) is key when using pipelines in a collaborative environment since eliminating unnecessary reruns offers agility. Reuse is the default behavior when the ``script_name``, ``inputs``, and the parameters of a step remain the same. When reuse is allowed, results from the previous run are immediately sent to the next step. If ``allow_reuse`` is set to False, a new run will always be generated for this step during pipeline execution.\n",
|
||||
"\n",
|
||||
"> Note that we only support partitioned FileDataset and TabularDataset without partition when using such output as input."
|
||||
"> Note that we only support partitioned FileDataset and TabularDataset without partition when using such output as input.\n",
|
||||
"\n",
|
||||
"> Note that we **drop column** \"Revenue\" from the dataset in this step to avoid information leak as \"Quantity\" = \"Revenue\" / \"Price\". **Please modify the logic based on your data**."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -487,17 +508,25 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Parallel run step is leveraged to train multiple models at once. To configure the ParallelRunConfig you will need to determine the appropriate number of workers and nodes for your use case. The process_count_per_node is based off the number of cores of the compute VM. The node_count will determine the number of master nodes to use, increasing the node count will speed up the training process.\n",
|
||||
"Parallel run step is leveraged to train multiple models at once. To configure the ParallelRunConfig you will need to determine the appropriate number of workers and nodes for your use case. The ``process_count_per_node`` is based off the number of cores of the compute VM. The node_count will determine the number of master nodes to use, increasing the node count will speed up the training process.\n",
|
||||
"\n",
|
||||
"| Property | Description|\n",
|
||||
"| :--------------- | :------------------- |\n",
|
||||
"| **experiment** | The experiment used for training. |\n",
|
||||
"| **train_data** | The file dataset to be used as input to the training run. |\n",
|
||||
"| **node_count** | The number of compute nodes to be used for running the user script. We recommend to start with 3 and increase the node_count if the training time is taking too long. |\n",
|
||||
"| **process_count_per_node** | Process count per node, we recommend 2:1 ratio for number of cores: number of processes per node. eg. If node has 16 cores then configure 8 or less process count per node or optimal performance. |\n",
|
||||
"| **process_count_per_node** | Process count per node, we recommend 2:1 ratio for number of cores: number of processes per node. eg. If node has 16 cores then configure 8 or less process count per node for optimal performance. |\n",
|
||||
"| **train_pipeline_parameters** | The set of configuration parameters defined in the previous section. |\n",
|
||||
"| **run_invocation_timeout** | Maximum amount of time in seconds that the ``ParallelRunStep`` class is allowed. This is optional but provides customers with greater control on exit criteria. This must be greater than ``experiment_timeout_hours`` by at least 300 seconds. |\n",
|
||||
"\n",
|
||||
"Calling this method will create a new aggregated dataset which is generated dynamically on pipeline execution."
|
||||
"Calling this method will create a new aggregated dataset which is generated dynamically on pipeline execution.\n",
|
||||
"\n",
|
||||
"**Note**: Total time taken for the **training step** in the pipeline to complete = $ \\frac{t}{ p \\times n } \\times ts $\n",
|
||||
"where,\n",
|
||||
"- $ t $ is time taken for training one partition (can be viewed in the training logs)\n",
|
||||
"- $ p $ is ``process_count_per_node``\n",
|
||||
"- $ n $ is ``node_count``\n",
|
||||
"- $ ts $ is total number of partitions in time series based on ``partition_column_names``"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -515,7 +544,7 @@
|
||||
" compute_target=compute_target,\n",
|
||||
" node_count=2,\n",
|
||||
" process_count_per_node=8,\n",
|
||||
" run_invocation_timeout=920,\n",
|
||||
" run_invocation_timeout=1200,\n",
|
||||
" train_pipeline_parameters=mm_paramters,\n",
|
||||
")"
|
||||
]
|
||||
@@ -596,7 +625,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### 7.2 Schedule the pipeline\n",
|
||||
"### 5.2 Schedule the pipeline\n",
|
||||
"You can also [schedule the pipeline](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-schedule-pipelines) to run on a time-based or change-based schedule. This could be used to automatically retrain models every month or based on another trigger such as data drift."
|
||||
]
|
||||
},
|
||||
@@ -652,25 +681,31 @@
|
||||
"source": [
|
||||
"For many models we need to provide the ManyModelsInferenceParameters object.\n",
|
||||
"\n",
|
||||
"#### ManyModelsInferenceParameters arguments\n",
|
||||
"#### ``ManyModelsInferenceParameters`` arguments\n",
|
||||
"| Property | Description|\n",
|
||||
"| :--------------- | :------------------- |\n",
|
||||
"| **partition_column_names** | List of column names that identifies groups. |\n",
|
||||
"| **target_column_name** | \\[Optional] Column name only if the inference dataset has the target. |\n",
|
||||
"| **time_column_name** | \\[Optional] Column name only if it is timeseries. |\n",
|
||||
"| **many_models_run_id** | \\[Optional] Many models run id where models were trained. |\n",
|
||||
"| **time_column_name** | \\[Optional] Time column name only if it is timeseries. |\n",
|
||||
"| **inference_type** | \\[Optional] Which inference method to use on the model. Possible values are 'forecast', 'predict_proba', and 'predict'. |\n",
|
||||
"| **forecast_mode** | \\[Optional] The type of forecast to be used, either 'rolling' or 'recursive'; defaults to 'recursive'. |\n",
|
||||
"| **step** | \\[Optional] Number of periods to advance the forecasting window in each iteration **(for rolling forecast only)**; defaults to 1. |\n",
|
||||
"\n",
|
||||
"#### get_many_models_batch_inference_steps arguments\n",
|
||||
"#### ``get_many_models_batch_inference_steps`` arguments\n",
|
||||
"| Property | Description|\n",
|
||||
"| :--------------- | :------------------- |\n",
|
||||
"| **experiment** | The experiment used for inference run. |\n",
|
||||
"| **inference_data** | The data to use for inferencing. It should be the same schema as used for training.\n",
|
||||
"| **compute_target** The compute target that runs the inference pipeline.|\n",
|
||||
"| **compute_target** | The compute target that runs the inference pipeline. |\n",
|
||||
"| **node_count** | The number of compute nodes to be used for running the user script. We recommend to start with the number of cores per node (varies by compute sku). |\n",
|
||||
"| **process_count_per_node** The number of processes per node.\n",
|
||||
"| **train_run_id** | \\[Optional] The run id of the hierarchy training, by default it is the latest successful training many model run in the experiment. |\n",
|
||||
"| **process_count_per_node** | \\[Optional] The number of processes per node. By default it's 2 (should be at most half of the number of cores in a single node of the compute cluster that will be used for the experiment).\n",
|
||||
"| **inference_pipeline_parameters** | \\[Optional] The ``ManyModelsInferenceParameters`` object defined above. |\n",
|
||||
"| **append_row_file_name** | \\[Optional] The name of the output file (optional, default value is 'parallel_run_step.txt'). Supports 'txt' and 'csv' file extension. A 'txt' file extension generates the output in 'txt' format with space as separator without column names. A 'csv' file extension generates the output in 'csv' format with comma as separator and with column names. |\n",
|
||||
"| **train_run_id** | \\[Optional] The run id of the **training pipeline**. By default it is the latest successful training pipeline run in the experiment. |\n",
|
||||
"| **train_experiment_name** | \\[Optional] The train experiment that contains the train pipeline. This one is only needed when the train pipeline is not in the same experiement as the inference pipeline. |\n",
|
||||
"| **process_count_per_node** | \\[Optional] The number of processes per node, by default it's 4. |"
|
||||
"| **run_invocation_timeout** | \\[Optional] Maximum amount of time in seconds that the ``ParallelRunStep`` class is allowed. This is optional but provides customers with greater control on exit criteria. |\n",
|
||||
"| **output_datastore** | \\[Optional] The ``Datastore`` or ``OutputDatasetConfig`` to be used for output. If specified any pipeline output will be written to that location. If unspecified the default datastore will be used. |\n",
|
||||
"| **arguments** | \\[Optional] Arguments to be passed to inference script. Possible argument is '--forecast_quantiles' followed by quantile values. |"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -690,6 +725,8 @@
|
||||
" target_column_name=\"Quantity\",\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"output_file_name = \"parallel_run_step.csv\"\n",
|
||||
"\n",
|
||||
"inference_steps = AutoMLPipelineBuilder.get_many_models_batch_inference_steps(\n",
|
||||
" experiment=experiment,\n",
|
||||
" inference_data=inference_ds_small,\n",
|
||||
@@ -701,6 +738,8 @@
|
||||
" train_run_id=training_run.id,\n",
|
||||
" train_experiment_name=training_run.experiment.name,\n",
|
||||
" inference_pipeline_parameters=mm_parameters,\n",
|
||||
" append_row_file_name=output_file_name,\n",
|
||||
" arguments=[\"--forecast_quantiles\", 0.1, 0.9],\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
@@ -735,7 +774,7 @@
|
||||
"\n",
|
||||
"The following code snippet:\n",
|
||||
"1. Downloads the contents of the output folder that is passed in the parallel run step \n",
|
||||
"2. Reads the parallel_run_step.txt file that has the predictions as pandas dataframe and \n",
|
||||
"2. Reads the output file that has the predictions as pandas dataframe and \n",
|
||||
"3. Displays the top 10 rows of the predictions"
|
||||
]
|
||||
},
|
||||
@@ -750,19 +789,9 @@
|
||||
"forecasting_results_name = \"forecasting_results\"\n",
|
||||
"forecasting_output_name = \"many_models_inference_output\"\n",
|
||||
"forecast_file = get_output_from_mm_pipeline(\n",
|
||||
" inference_run, forecasting_results_name, forecasting_output_name\n",
|
||||
" inference_run, forecasting_results_name, forecasting_output_name, output_file_name\n",
|
||||
")\n",
|
||||
"df = pd.read_csv(forecast_file, delimiter=\" \", header=None)\n",
|
||||
"df.columns = [\n",
|
||||
" \"Week Starting\",\n",
|
||||
" \"Store\",\n",
|
||||
" \"Brand\",\n",
|
||||
" \"Quantity\",\n",
|
||||
" \"Advert\",\n",
|
||||
" \"Price\",\n",
|
||||
" \"Revenue\",\n",
|
||||
" \"Predicted\",\n",
|
||||
"]\n",
|
||||
"df = pd.read_csv(forecast_file)\n",
|
||||
"print(\n",
|
||||
" \"Prediction has \", df.shape[0], \" rows. Here the first 10 rows are being displayed.\"\n",
|
||||
")\n",
|
||||
@@ -835,9 +864,9 @@
|
||||
"automated-machine-learning"
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
@@ -849,7 +878,12 @@
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.8"
|
||||
"version": "3.8.5"
|
||||
},
|
||||
"vscode": {
|
||||
"interpreter": {
|
||||
"hash": "6bd77c88278e012ef31757c15997a7bea8c943977c43d6909403c00ae11d43ca"
|
||||
}
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
|
||||
|
Before Width: | Height: | Size: 2.6 MiB After Width: | Height: | Size: 2.6 MiB |
@@ -11,6 +11,12 @@ def main(args):
|
||||
dataset = run_context.input_datasets["train_10_models"]
|
||||
df = dataset.to_pandas_dataframe()
|
||||
|
||||
# Drop the column "Revenue" from the dataset to avoid information leak as
|
||||
# "Quantity" = "Revenue" / "Price". Please modify the logic based on your data.
|
||||
drop_column_name = "Revenue"
|
||||
if drop_column_name in df.columns:
|
||||
df.drop(drop_column_name, axis=1, inplace=True)
|
||||
|
||||
# Apply any data pre-processing techniques here
|
||||
|
||||
df.to_parquet(output / "data_prepared_result.parquet", compression=None)
|
||||
|
||||
@@ -1,6 +1,7 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
@@ -10,6 +11,7 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
@@ -17,6 +19,7 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
@@ -34,18 +37,20 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Introduction<a id=\"introduction\"></a>\n",
|
||||
"In this example, we use AutoML to train, select, and operationalize a time-series forecasting model for multiple time-series.\n",
|
||||
"\n",
|
||||
"Make sure you have executed the [configuration notebook](../../../configuration.ipynb) before running this notebook.\n",
|
||||
"Make sure you have executed the [configuration notebook](https://github.com/Azure/MachineLearningNotebooks/blob/master/configuration.ipynb) before running this notebook.\n",
|
||||
"\n",
|
||||
"The examples in the follow code samples use the University of Chicago's Dominick's Finer Foods dataset to forecast orange juice sales. Dominick's was a grocery chain in the Chicago metropolitan area."
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
@@ -70,6 +75,7 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
@@ -86,6 +92,7 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
@@ -119,6 +126,7 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
@@ -158,6 +166,7 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
@@ -181,6 +190,7 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
@@ -201,6 +211,7 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
@@ -220,6 +231,7 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
@@ -250,6 +262,7 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
@@ -275,6 +288,7 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
@@ -291,6 +305,7 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
@@ -318,6 +333,7 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
@@ -356,6 +372,7 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
@@ -368,10 +385,12 @@
|
||||
"|**time_column_name**|The name of your time column.|\n",
|
||||
"|**forecast_horizon**|The forecast horizon is how many periods forward you would like to forecast. This integer horizon is in units of the timeseries frequency (e.g. daily, weekly).|\n",
|
||||
"|**time_series_id_column_names**|The column names used to uniquely identify the time series in data that has multiple rows with the same timestamp. If the time series identifiers are not defined, the data set is assumed to be one time series.|\n",
|
||||
"|**freq**|Forecast frequency. This optional parameter represents the period with which the forecast is desired, for example, daily, weekly, yearly, etc. Use this parameter for the correction of time series containing irregular data points or for padding of short time series. The frequency needs to be a pandas offset alias. Please refer to [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects) for more information."
|
||||
"|**freq**|Forecast frequency. This optional parameter represents the period with which the forecast is desired, for example, daily, weekly, yearly, etc. Use this parameter for the correction of time series containing irregular data points or for padding of short time series. The frequency needs to be a pandas offset alias. Please refer to [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects) for more information.\n",
|
||||
"|**cv_step_size**|Number of periods between two consecutive cross-validation folds. The default value is \"auto\", in which case AutoMl determines the cross-validation step size automatically, if a validation set is not provided. Or users could specify an integer value."
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
@@ -390,7 +409,7 @@
|
||||
"In the first case, AutoML loops over all time-series in your dataset and trains one model (e.g. AutoArima or Prophet, as the case may be) for each series. This can result in long runtimes to train these models if there are a lot of series in the data. One way to mitigate this problem is to fit models for different series in parallel if you have multiple compute cores available. To enable this behavior, set the `max_cores_per_iteration` parameter in your AutoMLConfig as shown in the example in the next cell. \n",
|
||||
"\n",
|
||||
"\n",
|
||||
"Finally, a note about the cross-validation (CV) procedure for time-series data. AutoML uses out-of-sample error estimates to select a best pipeline/model, so it is important that the CV fold splitting is done correctly. Time-series can violate the basic statistical assumptions of the canonical K-Fold CV strategy, so AutoML implements a [rolling origin validation](https://robjhyndman.com/hyndsight/tscv/) procedure to create CV folds for time-series data. To use this procedure, you just need to specify the desired number of CV folds in the AutoMLConfig object. It is also possible to bypass CV and use your own validation set by setting the *validation_data* parameter of AutoMLConfig.\n",
|
||||
"Finally, a note about the cross-validation (CV) procedure for time-series data. AutoML uses out-of-sample error estimates to select a best pipeline/model, so it is important that the CV fold splitting is done correctly. Time-series can violate the basic statistical assumptions of the canonical K-Fold CV strategy, so AutoML implements a [rolling origin validation](https://robjhyndman.com/hyndsight/tscv/) procedure to create CV folds for time-series data. To use this procedure, you could specify the desired number of CV folds and the number of periods between two consecutive folds in the AutoMLConfig object, or AutoMl could set them automatically if you don't specify them. It is also possible to bypass CV and use your own validation set by setting the *validation_data* parameter of AutoMLConfig.\n",
|
||||
"\n",
|
||||
"Here is a summary of AutoMLConfig parameters used for training the OJ model:\n",
|
||||
"\n",
|
||||
@@ -403,7 +422,7 @@
|
||||
"|**training_data**|Input dataset, containing both features and label column.|\n",
|
||||
"|**label_column_name**|The name of the label column.|\n",
|
||||
"|**compute_target**|The remote compute for training.|\n",
|
||||
"|**n_cross_validations**|Number of cross-validation folds to use for model/pipeline selection|\n",
|
||||
"|**n_cross_validations**|Number of cross-validation folds to use for model/pipeline selection. The default value is \"auto\", in which case AutoMl determines the number of cross-validations automatically, if a validation set is not provided. Or users could specify an integer value.\n",
|
||||
"|**enable_voting_ensemble**|Allow AutoML to create a Voting ensemble of the best performing models|\n",
|
||||
"|**enable_stack_ensemble**|Allow AutoML to create a Stack ensemble of the best performing models|\n",
|
||||
"|**debug_log**|Log file path for writing debugging information|\n",
|
||||
@@ -424,6 +443,7 @@
|
||||
" forecast_horizon=n_test_periods,\n",
|
||||
" time_series_id_column_names=time_series_id_column_names,\n",
|
||||
" freq=\"W-THU\", # Set the forecast frequency to be weekly (start on each Thursday)\n",
|
||||
" cv_step_size=\"auto\",\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"automl_config = AutoMLConfig(\n",
|
||||
@@ -436,7 +456,7 @@
|
||||
" compute_target=compute_target,\n",
|
||||
" enable_early_stopping=True,\n",
|
||||
" featurization=featurization_config,\n",
|
||||
" n_cross_validations=3,\n",
|
||||
" n_cross_validations=\"auto\", # Feel free to set to a small integer (>=2) if runtime is an issue.\n",
|
||||
" verbosity=logging.INFO,\n",
|
||||
" max_cores_per_iteration=-1,\n",
|
||||
" forecasting_parameters=forecasting_parameters,\n",
|
||||
@@ -444,6 +464,7 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
@@ -470,6 +491,7 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
@@ -489,6 +511,7 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
@@ -526,6 +549,7 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
@@ -546,6 +570,7 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
@@ -554,6 +579,7 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
@@ -582,6 +608,7 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
@@ -637,6 +664,7 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
@@ -644,6 +672,7 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
@@ -666,6 +695,7 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
@@ -685,6 +715,7 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
@@ -713,7 +744,7 @@
|
||||
" description=\"Automl forecasting sample service\",\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"aci_service_name = \"automl-oj-forecast-01\"\n",
|
||||
"aci_service_name = \"automl-oj-forecast-03\"\n",
|
||||
"print(aci_service_name)\n",
|
||||
"aci_service = Model.deploy(ws, aci_service_name, [model], inference_config, aciconfig)\n",
|
||||
"aci_service.wait_for_deployment(True)\n",
|
||||
@@ -730,6 +761,7 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
@@ -778,6 +810,7 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
@@ -790,7 +823,7 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"serv = Webservice(ws, \"automl-oj-forecast-01\")\n",
|
||||
"serv = Webservice(ws, \"automl-oj-forecast-03\")\n",
|
||||
"serv.delete() # don't do it accidentally"
|
||||
]
|
||||
}
|
||||
@@ -819,9 +852,9 @@
|
||||
"friendly_name": "Forecasting orange juice sales with deployment",
|
||||
"index_order": 1,
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
@@ -833,12 +866,17 @@
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.9"
|
||||
"version": "3.8.5"
|
||||
},
|
||||
"tags": [
|
||||
"None"
|
||||
],
|
||||
"task": "Forecasting"
|
||||
"task": "Forecasting",
|
||||
"vscode": {
|
||||
"interpreter": {
|
||||
"hash": "6bd77c88278e012ef31757c15997a7bea8c943977c43d6909403c00ae11d43ca"
|
||||
}
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 4
|
||||
|
||||
@@ -13,7 +13,7 @@
|
||||
"source": [
|
||||
"## Introduction\n",
|
||||
"\n",
|
||||
"In this notebook, we demonstrate how to use piplines to train and inference on AutoML Forecasting model. Two pipelines will be created: one for training AutoML model, and the other is for inference on AutoML model. We'll also demonstrate how to schedule the inference pipeline so you can get inference results periodically (with refreshed test dataset). Make sure you have executed the configuration notebook before running this notebook. In this notebook you will learn how to:\n",
|
||||
"In this notebook, we demonstrate how to use piplines to train and inference on AutoML Forecasting model. Two pipelines will be created: one for training AutoML model, and the other is for inference on AutoML model. We'll also demonstrate how to schedule the inference pipeline so you can get inference results periodically (with refreshed test dataset). Make sure you have executed the [configuration notebook](https://github.com/Azure/MachineLearningNotebooks/blob/master/configuration.ipynb) before running this notebook. In this notebook you will learn how to:\n",
|
||||
"\n",
|
||||
"- Configure AutoML using AutoMLConfig for forecasting tasks using pipeline AutoMLSteps.\n",
|
||||
"- Create and register an AutoML model using AzureML pipeline.\n",
|
||||
@@ -292,7 +292,8 @@
|
||||
"|**time_column_name**|The name of your time column.|\n",
|
||||
"|**forecast_horizon**|The forecast horizon is how many periods forward you would like to forecast. This integer horizon is in units of the timeseries frequency (e.g. daily, weekly).|\n",
|
||||
"|**time_series_id_column_names**|The column names used to uniquely identify the time series in data that has multiple rows with the same timestamp. If the time series identifiers are not defined, the data set is assumed to be one time series.|\n",
|
||||
"|**freq**|Forecast frequency. This optional parameter represents the period with which the forecast is desired, for example, daily, weekly, yearly, etc. Use this parameter for the correction of time series containing irregular data points or for padding of short time series. The frequency needs to be a pandas offset alias. Please refer to [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects) for more information."
|
||||
"|**freq**|Forecast frequency. This optional parameter represents the period with which the forecast is desired, for example, daily, weekly, yearly, etc. Use this parameter for the correction of time series containing irregular data points or for padding of short time series. The frequency needs to be a pandas offset alias. Please refer to [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects) for more information.\n",
|
||||
"|**cv_step_size**|Number of periods between two consecutive cross-validation folds. The default value is \"auto\", in which case AutoMl determines the cross-validation step size automatically, if a validation set is not provided. Or users could specify an integer value."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -307,7 +308,8 @@
|
||||
" time_column_name=time_column_name,\n",
|
||||
" forecast_horizon=n_test_periods,\n",
|
||||
" time_series_id_column_names=time_series_id_column_names,\n",
|
||||
" freq=\"W-THU\", # Set the forecast frequency to be weekly (start on each Thursday)\n",
|
||||
" freq=\"W-THU\", # Set the forecast frequency to be weekly (start on each Thursday),\n",
|
||||
" cv_step_size=\"auto\",\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"automl_config = AutoMLConfig(\n",
|
||||
@@ -319,7 +321,7 @@
|
||||
" label_column_name=target_column_name,\n",
|
||||
" compute_target=compute_target,\n",
|
||||
" enable_early_stopping=True,\n",
|
||||
" n_cross_validations=5,\n",
|
||||
" n_cross_validations=\"auto\", # Feel free to set to a small integer (>=2) if runtime is an issue.\n",
|
||||
" verbosity=logging.INFO,\n",
|
||||
" max_cores_per_iteration=-1,\n",
|
||||
" forecasting_parameters=forecasting_parameters,\n",
|
||||
@@ -797,9 +799,9 @@
|
||||
"friendly_name": "Forecasting orange juice sales with deployment",
|
||||
"index_order": 1,
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
@@ -811,12 +813,17 @@
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.9"
|
||||
"version": "3.8.5"
|
||||
},
|
||||
"tags": [
|
||||
"None"
|
||||
],
|
||||
"task": "Forecasting"
|
||||
"task": "Forecasting",
|
||||
"vscode": {
|
||||
"interpreter": {
|
||||
"hash": "6bd77c88278e012ef31757c15997a7bea8c943977c43d6909403c00ae11d43ca"
|
||||
}
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 4
|
||||
|
||||
@@ -2,25 +2,24 @@
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Copyright (c) Microsoft Corporation. All rights reserved.\n",
|
||||
"\n",
|
||||
"Licensed under the MIT License."
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In this notebook we will explore the univaraite time-series data to determine the settings for an automated ML experiment. We will follow the thought process depicted in the following diagram:<br/>\n",
|
||||
"In this notebook we will explore the univariate time-series data to determine the settings for an automated ML experiment. We will follow the thought process depicted in the following diagram:<br/>\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"The objective is to answer the following questions:\n",
|
||||
@@ -32,22 +31,20 @@
|
||||
" </ul>\n",
|
||||
" <li>Is the data stationary? </li>\n",
|
||||
" <ul style=\"margin-top:-1px; list-style-type:none\"> \n",
|
||||
" <li> Importance: In the absense of features that capture trend behavior, ML models (regression and tree based) are not well equiped to predict stochastic trends. Working with stationary data solves this problem. </li>\n",
|
||||
" <li> Importance: In the absence of features that capture trend behavior, ML models (regression and tree based) are not well equipped to predict stochastic trends. Working with stationary data solves this problem. </li>\n",
|
||||
" </ul>\n",
|
||||
" <li>Is there a detectable auto-regressive pattern in the stationary data? </li>\n",
|
||||
" <ul style=\"margin-top:-1px; list-style-type:none\"> \n",
|
||||
" <li> Importance: The accuracy of ML models can be improved if serial correlation is modeled by including lags of the dependent/target varaible as features. Including target lags in every experiment by default will result in a regression in accuracy scores if such setting is not warranted. </li>\n",
|
||||
" <li> Importance: The accuracy of ML models can be improved if serial correlation is modeled by including lags of the dependent/target variable as features. Including target lags in every experiment by default will result in a regression in accuracy scores if such setting is not warranted. </li>\n",
|
||||
" </ul>\n",
|
||||
"</ol>\n",
|
||||
"\n",
|
||||
"The answers to these questions will help determine the appropriate settings for the automated ML experiment.\n"
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import os\n",
|
||||
"import warnings\n",
|
||||
@@ -68,13 +65,13 @@
|
||||
"# set printing options\n",
|
||||
"pd.set_option(\"display.max_columns\", 500)\n",
|
||||
"pd.set_option(\"display.width\", 1000)"
|
||||
]
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# load data\n",
|
||||
"main_data_loc = \"data\"\n",
|
||||
@@ -89,13 +86,13 @@
|
||||
"df.sort_values(by=TIME_COLNAME, inplace=True)\n",
|
||||
"df.set_index(TIME_COLNAME, inplace=True)\n",
|
||||
"df.head(2)"
|
||||
]
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# plot the entire dataset\n",
|
||||
"fig, ax = plt.subplots(figsize=(6, 2), dpi=180)\n",
|
||||
@@ -103,20 +100,20 @@
|
||||
"ax.title.set_text(\"Original Data Series\")\n",
|
||||
"locs, labels = plt.xticks()\n",
|
||||
"plt.xticks(rotation=45)"
|
||||
]
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The graph plots the alcohol sales in the United States. Because the data is trending, it can be difficult to see cycles, seasonality or other interestng behaviors due to the scaling issues. For example, if there is a seasonal pattern, which we will discuss later, we cannot see them on the trending data. In such case, it is worth plotting the same data in first differences."
|
||||
]
|
||||
"The graph plots the alcohol sales in the United States. Because the data is trending, it can be difficult to see cycles, seasonality or other interesting behaviors due to the scaling issues. For example, if there is a seasonal pattern, which we will discuss later, we cannot see them on the trending data. In such case, it is worth plotting the same data in first differences."
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# plot the entire dataset in first differences\n",
|
||||
"fig, ax = plt.subplots(figsize=(6, 2), dpi=180)\n",
|
||||
@@ -124,18 +121,20 @@
|
||||
"ax.title.set_text(\"Data in first differences\")\n",
|
||||
"locs, labels = plt.xticks()\n",
|
||||
"plt.xticks(rotation=45)"
|
||||
]
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In the previous plot we observe that the data is more volatile towards the end of the series. This period coincides with the Covid-19 period, so we will exclude it from our experiment. Since in this example there are no user-provided features it is hard to make an argument that a model trained on the less volatile pre-covid data will be able to accurately predict the covid period."
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# 1. Seasonality\n",
|
||||
"\n",
|
||||
@@ -144,13 +143,11 @@
|
||||
"2. If it's seasonal, does the data exhibit a trend (up or down)?\n",
|
||||
"\n",
|
||||
"It is hard to visually detect seasonality when the data is trending. The reason being is scale of seasonal fluctuations is dwarfed by the range of the trend in the data. One way to deal with this is to de-trend the data by taking the first differences. We will discuss this in more detail in the next section."
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# plot the entire dataset in first differences\n",
|
||||
"fig, ax = plt.subplots(figsize=(6, 2), dpi=180)\n",
|
||||
@@ -158,20 +155,20 @@
|
||||
"ax.title.set_text(\"Data in first differences\")\n",
|
||||
"locs, labels = plt.xticks()\n",
|
||||
"plt.xticks(rotation=45)"
|
||||
]
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"For the next plot, we will exclude the Covid period again. We will also shorten the length of data because plotting a very long time series may prevent us from seeing seasonal patterns, if there are any, because the plot may look like a random walk."
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# remove COVID period\n",
|
||||
"df = df[:COVID_PERIOD_START]\n",
|
||||
@@ -182,11 +179,13 @@
|
||||
"ax.title.set_text(\"Data in first differences\")\n",
|
||||
"locs, labels = plt.xticks()\n",
|
||||
"plt.xticks(rotation=45)"
|
||||
]
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<p style=\"font-size:150%; color:blue\"> Conclusion </p>\n",
|
||||
"\n",
|
||||
@@ -205,11 +204,11 @@
|
||||
" <li> In the first case, by taking first differences we are removing stochastic trend, but we do not remove seasonal patterns. In the second case, we do not remove the stochastic trend and it can be captured by the trend component of the STL decomposition. It is hard to say which option will work best in your case, hence you will need to run both options to see which one results in more accurate forecasts. </li>\n",
|
||||
" </ul>\n",
|
||||
"</ol>"
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# 2. Stationarity\n",
|
||||
"If the data does not exhibit seasonal patterns, we would like to see if the data is non-stationary. Particularly, we want to see if there is a clear trending behavior. If such behavior is observed, we would like to first difference the data and examine the plot of an auto-correlation function (ACF) known as correlogram. If the data is seasonal, differencing it will not get rid off the seasonality and this will be shown on the correlogram as well.\n",
|
||||
@@ -237,13 +236,11 @@
|
||||
"</ol>\n",
|
||||
"\n",
|
||||
"To answer the first question, we run a series of tests (we call them unit root tests)."
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# unit root tests\n",
|
||||
"test = unit_root_test_wrapper(df[TARGET_COLNAME])\n",
|
||||
@@ -251,11 +248,13 @@
|
||||
"print(\"Summary table\", \"\\n\", test[\"summary\"], \"\\n\")\n",
|
||||
"print(\"Is the {} series stationary?: {}\".format(TARGET_COLNAME, test[\"stationary\"]))\n",
|
||||
"print(\"---------------\", \"\\n\")"
|
||||
]
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In the previous cell, we ran a series of unit root tests. The summary table contains the following columns:\n",
|
||||
"<ul> \n",
|
||||
@@ -277,13 +276,11 @@
|
||||
"Each of the tests shows that the original time series is non-stationary. The final decision is based on the majority rule. If, there is a split decision, the algorithm will claim it is stationary. We run a series of tests because each test by itself may not be accurate. In many cases when there are conflicting test results, the user needs to make determination if the series is stationary or not.\n",
|
||||
"\n",
|
||||
"Since we found the series to be non-stationary, we will difference it and then test if the differenced series is stationary."
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# unit root tests\n",
|
||||
"test = unit_root_test_wrapper(df[TARGET_COLNAME].diff().dropna())\n",
|
||||
@@ -291,20 +288,20 @@
|
||||
"print(\"Summary table\", \"\\n\", test[\"summary\"], \"\\n\")\n",
|
||||
"print(\"Is the {} series stationary?: {}\".format(TARGET_COLNAME, test[\"stationary\"]))\n",
|
||||
"print(\"---------------\", \"\\n\")"
|
||||
]
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Four out of five tests show that the series in first differences is stationary. Notice that this decision is not unanimous. Next, let's plot the original series in first-differences to illustrate the difference between non-stationary (unit root) process vs the stationary one."
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# plot original and stationary data\n",
|
||||
"fig = plt.figure(figsize=(10, 10))\n",
|
||||
@@ -314,29 +311,31 @@
|
||||
"ax2.plot(df[TARGET_COLNAME].diff().dropna(), \"-b\")\n",
|
||||
"ax1.title.set_text(\"Original data\")\n",
|
||||
"ax2.title.set_text(\"Data in first differences\")"
|
||||
]
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"If you were asked a question \"What is the mean of the series before and after 2008?\", for the series titled \"Original data\" the mean values will be significantly different. This implies that the first moment of the series (in this case, it is the mean) is time dependent, i.e., mean changes depending on the interval one is looking at. Thus, the series is deemed to be non-stationary. On the other hand, for the series titled \"Data in first differences\" the means for both periods are roughly the same. Hence, the first moment is time invariant; meaning it does not depend on the interval of time one is looking at. In this example it is easy to visually distinguish between stationary and non-stationary data. Often this distinction is not easy to make, therefore we rely on the statistical tests described above to help us make an informed decision. "
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<p style=\"font-size:150%; color:blue\"> Conclusion </p>\n",
|
||||
"Since we found the original process to be non-stationary (contains unit root), we will have to model the data in first differences. As a result, we will set the DIFFERENCE_SERIES parameter to True."
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# 3 Check if there is a clear autoregressive pattern\n",
|
||||
"We need to determine if we should include lags of the target variable as features in order to improve forecast accuracy. To do this, we will examine the ACF and partial ACF (PACF) plots of the stationary series. In our case, it is a series in first diffrences.\n",
|
||||
"# 3 Check if there is a clear auto-regressive pattern\n",
|
||||
"We need to determine if we should include lags of the target variable as features in order to improve forecast accuracy. To do this, we will examine the ACF and partial ACF (PACF) plots of the stationary series. In our case, it is a series in first differences.\n",
|
||||
"\n",
|
||||
"<ul>\n",
|
||||
" <li> Question: What is an Auto-regressive pattern? What are we looking for? </li>\n",
|
||||
@@ -347,11 +346,11 @@
|
||||
" The lag order is on the x-axis while the auto- and partial-correlation coefficients are on the y-axis. Vertical lines that are outside the shaded area represent statistically significant lags. Notice, the ACF function decays to zero and the PACF shows 2 significant spikes (we ignore the first spike for lag 0 in both plots since the linear relationship of any series with itself is always 1). <li/>\n",
|
||||
" </ul>\n",
|
||||
"<ul/>"
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<ul>\n",
|
||||
" <li> Question: What do I do if I observe an auto-regressive behavior? </li>\n",
|
||||
@@ -365,32 +364,32 @@
|
||||
" <br/>\n",
|
||||
" <li> Next, let's examine the ACF and PACF plots of the stationary target variable (depicted below). Here, we do not see a decay in the ACF, instead we see a decay in PACF. It is hard to make an argument the the target variable exhibits auto-regressive behavior. </li>\n",
|
||||
" </ul>"
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Plot the ACF/PACF for the series in differences\n",
|
||||
"fig, ax = plt.subplots(1, 2, figsize=(10, 5))\n",
|
||||
"plot_acf(df[TARGET_COLNAME].diff().dropna().values.squeeze(), ax=ax[0])\n",
|
||||
"plot_pacf(df[TARGET_COLNAME].diff().dropna().values.squeeze(), ax=ax[1])\n",
|
||||
"plt.show()"
|
||||
]
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<p style=\"font-size:150%; color:blue\"> Conclusion </p>\n",
|
||||
"Since we do not see a clear indication of an AR(p) process, we will not be using target lags and will set the TARGET_LAGS parameter to None."
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<p style=\"font-size:150%; color:blue; font-weight: bold\"> AutoML Experiment Settings </p>\n",
|
||||
"Based on the analysis performed, we should try the following settings for the AutoML experiment and use them in the \"2_run_experiment\" notebook.\n",
|
||||
@@ -399,11 +398,11 @@
|
||||
" <li> DIFFERENCE_SERIES=True </li>\n",
|
||||
" <li> TARGET_LAGS=None </li>\n",
|
||||
"</ul>"
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Appendix: ACF, PACF and Lag Selection\n",
|
||||
"To do this, we will examine the ACF and partial ACF (PACF) plots of the differenced series. \n",
|
||||
@@ -418,23 +417,23 @@
|
||||
" </li>\n",
|
||||
" where $\\sigma_{xzy}$ is the covariance between two random variables $X$ and $Z$; $\\sigma_x$ and $\\sigma_z$ is the variance for $X$ and $Z$, respectively. The correlation coefficient measures the strength of linear relationship between two random variables. This metric can take any value from -1 to 1. <li/>\n",
|
||||
" <br/>\n",
|
||||
" <li> The auto-correlation coefficient $\\rho_{Y_{t} Y_{t-k}}$ is the time series equivalent of the correlation coefficient, except instead of measuring linear association between two random variables $X$ and $Z$, it measures the strength of a linear relationship between a random variable $Y_t$ and its lag $Y_{t-k}$ for any positive interger value of $k$. </li> \n",
|
||||
" <li> The auto-correlation coefficient $\\rho_{Y_{t} Y_{t-k}}$ is the time series equivalent of the correlation coefficient, except instead of measuring linear association between two random variables $X$ and $Z$, it measures the strength of a linear relationship between a random variable $Y_t$ and its lag $Y_{t-k}$ for any positive integer value of $k$. </li> \n",
|
||||
" <br />\n",
|
||||
" <li> To visualize the ACF for a particular lag, say lag 2, plot the second lag of a series $y_{t-2}$ on the x-axis, and plot the series itself $y_t$ on the y-axis. The autocorrelation coefficient is the slope of the best fitted regression line and can be interpreted as follows. A one unit increase in the lag of a variable one period ago leads to a $\\rho_{Y_{t} Y_{t-2}}$ units change in the variable in the current period. This interpreation can be applied to any lag. </li> \n",
|
||||
" <li> To visualize the ACF for a particular lag, say lag 2, plot the second lag of a series $y_{t-2}$ on the x-axis, and plot the series itself $y_t$ on the y-axis. The autocorrelation coefficient is the slope of the best fitted regression line and can be interpreted as follows. A one unit increase in the lag of a variable one period ago leads to a $\\rho_{Y_{t} Y_{t-2}}$ units change in the variable in the current period. This interpretation can be applied to any lag. </li> \n",
|
||||
" <br />\n",
|
||||
" <li> In the interpretation posted above we need to be careful not to confuse the word \"leads\" with \"causes\" since these are not the same thing. We do not know the lagged value of the varaible causes it to change. Afterall, there are probably many other features that may explain the movement in $Y_t$. All we are trying to do in this section is to identify situations when the variable contains the strong auto-regressive components that needs to be included in the model to improve forecast accuracy. </li>\n",
|
||||
" <li> In the interpretation posted above we need to be careful not to confuse the word \"leads\" with \"causes\" since these are not the same thing. We do not know the lagged value of the variable causes it to change. After all, there are probably many other features that may explain the movement in $Y_t$. All we are trying to do in this section is to identify situations when the variable contains the strong auto-regressive components that needs to be included in the model to improve forecast accuracy. </li>\n",
|
||||
" </ul>\n",
|
||||
"</ul>"
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<ul>\n",
|
||||
" <li> Question: What is the PACF? </li>\n",
|
||||
" <ul style=\"list-style-type:none;\">\n",
|
||||
" <li> When describing the ACF we essentially running a regression between a partigular lag of a series, say, lag 4, and the series itself. What this implies is the regression coefficient for lag 4 captures the impact of everything that happens in lags 1, 2 and 3. In other words, if lag 1 is the most important lag and we exclude it from the regression, naturally, the regression model will assign the importance of the 1st lag to the 4th one. Partial auto-correlation function fixes this problem since it measures the contribution of each lag accounting for the information added by the intermediary lags. If we were to illustrate ACF and PACF for the fourth lag using the regression analogy, the difference is a follows: \n",
|
||||
" <li> When describing the ACF we essentially running a regression between a particular lag of a series, say, lag 4, and the series itself. What this implies is the regression coefficient for lag 4 captures the impact of everything that happens in lags 1, 2 and 3. In other words, if lag 1 is the most important lag and we exclude it from the regression, naturally, the regression model will assign the importance of the 1st lag to the 4th one. Partial auto-correlation function fixes this problem since it measures the contribution of each lag accounting for the information added by the intermediary lags. If we were to illustrate ACF and PACF for the fourth lag using the regression analogy, the difference is a follows: \n",
|
||||
" \\begin{align}\n",
|
||||
" Y_{t} &= a_{0} + a_{4} Y_{t-4} + e_{t} \\\\\n",
|
||||
" Y_{t} &= b_{0} + b_{1} Y_{t-1} + b_{2} Y_{t-2} + b_{3} Y_{t-3} + b_{4} Y_{t-4} + \\varepsilon_{t} \\\\\n",
|
||||
@@ -442,27 +441,28 @@
|
||||
" </li>\n",
|
||||
" <br/>\n",
|
||||
" <li>\n",
|
||||
" Here, you can think of $a_4$ and $b_{4}$ as the auto- and partial auto-correlation coefficients for lag 4. Notice, in the second equation we explicitely accounting for the intermediate lags by adding them as regrerssors.\n",
|
||||
" Here, you can think of $a_4$ and $b_{4}$ as the auto- and partial auto-correlation coefficients for lag 4. Notice, in the second equation we explicitly accounting for the intermediate lags by adding them as regressors.\n",
|
||||
" </li>\n",
|
||||
" </ul>\n",
|
||||
"</ul>"
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<ul>\n",
|
||||
" <li> Question: Auto-regressive pattern? What are we looking for? </li>\n",
|
||||
" <ul style=\"list-style-type:none;\">\n",
|
||||
" <li> We are looking for a classical profiles for an AR(p) process such as an exponential decay of an ACF and a the first $p$ significant lags of the PACF. Let's examine the ACF/PACF profiles of the same simulated AR(2) shown in Section 3, and check if the ACF/PACF explanation are refelcted in these plots. <li/>\n",
|
||||
" <li> We are looking for a classical profiles for an AR(p) process such as an exponential decay of an ACF and a the first $p$ significant lags of the PACF. Let's examine the ACF/PACF profiles of the same simulated AR(2) shown in Section 3, and check if the ACF/PACF explanation are reflected in these plots. <li/>\n",
|
||||
" <li><img src=\"figures/ACF_PACF_for_AR2.png\" class=\"img_class\">\n",
|
||||
" <li> The autocorrelation coefficient for the 3rd lag is 0.6, which can be interpreted that a one unit increase in the value of the target varaible three periods ago leads to 0.6 units increase in the current period. However, the PACF plot shows that the partial autocorrealtion coefficient is zero (from a statistical point of view since it lies within the shaded region). This is happening because the 1st and 2nd lags are good predictors of the target variable. Ommiting these two lags from the regression results in the misleading conclusion that the third lag is a good prediciton. <li/>\n",
|
||||
" <li> The autocorrelation coefficient for the 3rd lag is 0.6, which can be interpreted that a one unit increase in the value of the target variable three periods ago leads to 0.6 units increase in the current period. However, the PACF plot shows that the partial autocorrelation coefficient is zero (from a statistical point of view since it lies within the shaded region). This is happening because the 1st and 2nd lags are good predictors of the target variable. Omitting these two lags from the regression results in the misleading conclusion that the third lag is a good prediction. <li/>\n",
|
||||
" <br/>\n",
|
||||
" <li> This is why it is important to examine both the ACF and the PACF plots when tring to determine the auto regressive order for the variable in question. <li/>\n",
|
||||
" <li> This is why it is important to examine both the ACF and the PACF plots when trying to determine the auto regressive order for the variable in question. <li/>\n",
|
||||
" </ul>\n",
|
||||
"</ul> "
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
@@ -472,21 +472,32 @@
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"name": "python38-azureml",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"display_name": "Python 3.8 - AzureML"
|
||||
},
|
||||
"language_info": {
|
||||
"name": "python",
|
||||
"version": "3.8.10",
|
||||
"mimetype": "text/x-python",
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.9"
|
||||
"nbconvert_exporter": "python",
|
||||
"file_extension": ".py"
|
||||
},
|
||||
"microsoft": {
|
||||
"ms_spell_check": {
|
||||
"ms_spell_check_language": "en"
|
||||
}
|
||||
},
|
||||
"kernel_info": {
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"nteract": {
|
||||
"version": "nteract-front-end@1.0.0"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
|
||||
@@ -2,23 +2,22 @@
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Copyright (c) Microsoft Corporation. All rights reserved.\n",
|
||||
"\n",
|
||||
"Licensed under the MIT License."
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Running AutoML experiments\n",
|
||||
"\n",
|
||||
@@ -27,20 +26,18 @@
|
||||
"<br/>\n",
|
||||
"\n",
|
||||
"The output generated by this notebook is saved in the `experiment_output`folder."
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Setup"
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import os\n",
|
||||
"import logging\n",
|
||||
@@ -63,21 +60,21 @@
|
||||
"np.set_printoptions(precision=4, suppress=True, linewidth=100)\n",
|
||||
"pd.set_option(\"display.max_columns\", 500)\n",
|
||||
"pd.set_option(\"display.width\", 1000)"
|
||||
]
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"As part of the setup you have already created a **Workspace**. You will also need to create a [compute target](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-set-up-training-targets#amlcompute) for your AutoML run. In this tutorial, you create AmlCompute as your training compute resource.\n",
|
||||
"> Note that if you have an AzureML Data Scientist role, you will not have permission to create compute resources. Talk to your workspace or IT admin to create the compute targets described in this section, if they do not already exist."
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"ws = Workspace.from_config()\n",
|
||||
"amlcompute_cluster_name = \"recipe-cluster\"\n",
|
||||
@@ -107,22 +104,22 @@
|
||||
"compute_target.wait_for_completion(\n",
|
||||
" show_output=True, min_node_count=None, timeout_in_minutes=20\n",
|
||||
")"
|
||||
]
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Data\n",
|
||||
"\n",
|
||||
"Here, we will load the data from the csv file and drop the Covid period."
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"main_data_loc = \"data\"\n",
|
||||
"train_file_name = \"S4248SM144SCEN.csv\"\n",
|
||||
@@ -140,32 +137,34 @@
|
||||
"\n",
|
||||
"# remove the Covid period\n",
|
||||
"df = df.query('{} <= \"{}\"'.format(TIME_COLNAME, COVID_PERIOD_START))"
|
||||
]
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Set parameters\n",
|
||||
"\n",
|
||||
"The first set of parameters is based on the analysis performed in the `auto-ml-forecasting-univariate-recipe-experiment-settings` notebook. "
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# set parameters based on the settings notebook analysis\n",
|
||||
"DIFFERENCE_SERIES = True\n",
|
||||
"TARGET_LAGS = None\n",
|
||||
"STL_TYPE = None"
|
||||
]
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Next, define additional parameters to be used in the <a href=\"https://docs.microsoft.com/en-us/python/api/azureml-train-automl-client/azureml.train.automl.automlconfig?view=azure-ml-py\"> AutoML config </a> class.\n",
|
||||
"\n",
|
||||
@@ -180,32 +179,30 @@
|
||||
" </ul>\n",
|
||||
" </li>\n",
|
||||
"</ul>\n"
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# set other parameters\n",
|
||||
"FORECAST_HORIZON = 12\n",
|
||||
"TIME_SERIES_ID_COLNAMES = []\n",
|
||||
"BLOCKED_MODELS = []"
|
||||
]
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To run AutoML, you also need to create an **Experiment**. An Experiment corresponds to a prediction problem you are trying to solve, while a Run corresponds to a specific approach to the problem."
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# choose a name for the run history container in the workspace\n",
|
||||
"if isinstance(TARGET_LAGS, list):\n",
|
||||
@@ -232,38 +229,38 @@
|
||||
"pd.set_option(\"display.max_colwidth\", None)\n",
|
||||
"outputDf = pd.DataFrame(data=output, index=[\"\"])\n",
|
||||
"print(outputDf.T)"
|
||||
]
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# create output directory\n",
|
||||
"output_dir = \"experiment_output/{}\".format(experiment_desc)\n",
|
||||
"if not os.path.exists(output_dir):\n",
|
||||
" os.makedirs(output_dir)"
|
||||
]
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# difference data and test for unit root\n",
|
||||
"if DIFFERENCE_SERIES:\n",
|
||||
" df_delta = df.copy()\n",
|
||||
" df_delta[TARGET_COLNAME] = df[TARGET_COLNAME].diff()\n",
|
||||
" df_delta.dropna(axis=0, inplace=True)"
|
||||
]
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# split the data into train and test set\n",
|
||||
"if DIFFERENCE_SERIES:\n",
|
||||
@@ -281,64 +278,51 @@
|
||||
" time_colname=TIME_COLNAME,\n",
|
||||
" ts_id_colnames=TIME_SERIES_ID_COLNAMES,\n",
|
||||
" )"
|
||||
]
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Upload files to the Datastore\n",
|
||||
"The [Machine Learning service workspace](https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-workspace) is paired with the storage account, which contains the default data store. We will use it to upload the bike share data and create [tabular dataset](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.tabulardataset?view=azure-ml-py) for training. A tabular dataset defines a series of lazily-evaluated, immutable operations to load data from the data source into tabular representation."
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"df_train.to_csv(\"train.csv\", index=False)\n",
|
||||
"df_test.to_csv(\"test.csv\", index=False)\n",
|
||||
"\n",
|
||||
"from azureml.data.dataset_factory import TabularDatasetFactory\n",
|
||||
"\n",
|
||||
"datastore = ws.get_default_datastore()\n",
|
||||
"datastore.upload_files(\n",
|
||||
" files=[\"./train.csv\"],\n",
|
||||
" target_path=\"uni-recipe-dataset/tabular/\",\n",
|
||||
" overwrite=True,\n",
|
||||
" show_progress=True,\n",
|
||||
"train_dataset = TabularDatasetFactory.register_pandas_dataframe(\n",
|
||||
" df_train, target=(datastore, \"dataset/\"), name=\"train\"\n",
|
||||
")\n",
|
||||
"datastore.upload_files(\n",
|
||||
" files=[\"./test.csv\"],\n",
|
||||
" target_path=\"uni-recipe-dataset/tabular/\",\n",
|
||||
" overwrite=True,\n",
|
||||
" show_progress=True,\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"from azureml.core import Dataset\n",
|
||||
"\n",
|
||||
"train_dataset = Dataset.Tabular.from_delimited_files(\n",
|
||||
" path=[(datastore, \"uni-recipe-dataset/tabular/train.csv\")]\n",
|
||||
")\n",
|
||||
"test_dataset = Dataset.Tabular.from_delimited_files(\n",
|
||||
" path=[(datastore, \"uni-recipe-dataset/tabular/test.csv\")]\n",
|
||||
"test_dataset = TabularDatasetFactory.register_pandas_dataframe(\n",
|
||||
" df_test, target=(datastore, \"dataset/\"), name=\"test\"\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"# print the first 5 rows of the Dataset\n",
|
||||
"train_dataset.to_pandas_dataframe().reset_index(drop=True).head(5)"
|
||||
]
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Config AutoML"
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"time_series_settings = {\n",
|
||||
" \"time_column_name\": TIME_COLNAME,\n",
|
||||
@@ -358,82 +342,83 @@
|
||||
" enable_early_stopping=True,\n",
|
||||
" training_data=train_dataset,\n",
|
||||
" label_column_name=TARGET_COLNAME,\n",
|
||||
" n_cross_validations=5,\n",
|
||||
" n_cross_validations=\"auto\", # Feel free to set to a small integer (>=2) if runtime is an issue.\n",
|
||||
" cv_step_size=\"auto\",\n",
|
||||
" verbosity=logging.INFO,\n",
|
||||
" max_cores_per_iteration=-1,\n",
|
||||
" compute_target=compute_target,\n",
|
||||
" **time_series_settings,\n",
|
||||
")"
|
||||
]
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We will now run the experiment, you can go to Azure ML portal to view the run details."
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"remote_run = experiment.submit(automl_config, show_output=False)\n",
|
||||
"remote_run.wait_for_completion()"
|
||||
]
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Retrieve the Best Run details\n",
|
||||
"Below we retrieve the best Run object from among all the runs in the experiment."
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"best_run = remote_run.get_best_child()\n",
|
||||
"best_run"
|
||||
]
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Inference\n",
|
||||
"\n",
|
||||
"We now use the best fitted model from the AutoML Run to make forecasts for the test set. We will do batch scoring on the test dataset which should have the same schema as training dataset.\n",
|
||||
"\n",
|
||||
"The inference will run on a remote compute. In this example, it will re-use the training compute."
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"test_experiment = Experiment(ws, experiment_name + \"_inference\")"
|
||||
]
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Retreiving forecasts from the model\n",
|
||||
"We have created a function called `run_forecast` that submits the test data to the best model determined during the training run and retrieves forecasts. This function uses a helper script `forecasting_script` which is uploaded and expecuted on the remote compute."
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from run_forecast import run_remote_inference\n",
|
||||
"\n",
|
||||
@@ -447,31 +432,31 @@
|
||||
"remote_run.wait_for_completion(show_output=False)\n",
|
||||
"\n",
|
||||
"remote_run.download_file(\"outputs/predictions.csv\", f\"{output_dir}/predictions.csv\")"
|
||||
]
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Download the prediction result for metrics calcuation\n",
|
||||
"The test data with predictions are saved in artifact `outputs/predictions.csv`. We will use it to calculate accuracy metrics and vizualize predictions versus actuals."
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"X_trans = pd.read_csv(f\"{output_dir}/predictions.csv\", parse_dates=[TIME_COLNAME])\n",
|
||||
"X_trans.head()"
|
||||
]
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# convert forecast in differences to levels\n",
|
||||
"def convert_fcst_diff_to_levels(fcst, yt, df_orig):\n",
|
||||
@@ -485,13 +470,13 @@
|
||||
" )\n",
|
||||
" out.rename(columns={TARGET_COLNAME: \"actual_level\"}, inplace=True)\n",
|
||||
" return out"
|
||||
]
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"if DIFFERENCE_SERIES:\n",
|
||||
" # convert forecast in differences to the levels\n",
|
||||
@@ -505,20 +490,20 @@
|
||||
" fcst_df[\"predicted_level\"] = y_predictions\n",
|
||||
"\n",
|
||||
"del X_trans"
|
||||
]
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Calculate metrics and save output"
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# compute metrics\n",
|
||||
"metrics_df = compute_metrics(fcst_df=fcst_df, metric_name=None, ts_id_colnames=None)\n",
|
||||
@@ -529,20 +514,20 @@
|
||||
"\n",
|
||||
"metrics_df.to_csv(os.path.join(output_dir, metrics_file_name), index=True)\n",
|
||||
"fcst_df.to_csv(os.path.join(output_dir, fcst_file_name), index=True)"
|
||||
]
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Generate and save visuals"
|
||||
]
|
||||
],
|
||||
"metadata": {}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"plot_df = df.query('{} > \"2010-01-01\"'.format(TIME_COLNAME))\n",
|
||||
"plot_df.set_index(TIME_COLNAME, inplace=True)\n",
|
||||
@@ -561,7 +546,10 @@
|
||||
"\n",
|
||||
"plt.setp(labels, rotation=45)\n",
|
||||
"plt.savefig(os.path.join(output_dir, plot_file_name))"
|
||||
]
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": null,
|
||||
"metadata": {}
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
@@ -571,21 +559,37 @@
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"name": "python38-azureml",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"display_name": "Python 3.8 - AzureML"
|
||||
},
|
||||
"language_info": {
|
||||
"name": "python",
|
||||
"version": "3.8.5",
|
||||
"mimetype": "text/x-python",
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.9"
|
||||
"nbconvert_exporter": "python",
|
||||
"file_extension": ".py"
|
||||
},
|
||||
"vscode": {
|
||||
"interpreter": {
|
||||
"hash": "6bd77c88278e012ef31757c15997a7bea8c943977c43d6909403c00ae11d43ca"
|
||||
}
|
||||
},
|
||||
"microsoft": {
|
||||
"ms_spell_check": {
|
||||
"ms_spell_check_language": "en"
|
||||
}
|
||||
},
|
||||
"kernel_info": {
|
||||
"name": "python3"
|
||||
},
|
||||
"nteract": {
|
||||
"version": "nteract-front-end@1.0.0"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
|
||||
@@ -870,9 +870,9 @@
|
||||
"friendly_name": "Classification of credit card fraudulent transactions using Automated ML",
|
||||
"index_order": 5,
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
@@ -859,8 +859,8 @@
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"%matplotlib inline\n",
|
||||
"test_pred = plt.scatter(y_test, y_pred_test, color=\"\")\n",
|
||||
"test_test = plt.scatter(y_test, y_test, color=\"g\")\n",
|
||||
"test_pred = plt.scatter(y_test, y_pred_test, c=[\"b\"])\n",
|
||||
"test_test = plt.scatter(y_test, y_test, c=[\"g\"])\n",
|
||||
"plt.legend(\n",
|
||||
" (test_pred, test_test), (\"prediction\", \"truth\"), loc=\"upper left\", fontsize=8\n",
|
||||
")\n",
|
||||
@@ -895,9 +895,9 @@
|
||||
"friendly_name": "Automated ML run with featurization and model explainability.",
|
||||
"index_order": 5,
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
@@ -422,8 +422,8 @@
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"%matplotlib inline\n",
|
||||
"test_pred = plt.scatter(y_test, y_pred_test, color=\"\")\n",
|
||||
"test_test = plt.scatter(y_test, y_test, color=\"g\")\n",
|
||||
"test_pred = plt.scatter(y_test, y_pred_test, c=[\"b\"])\n",
|
||||
"test_test = plt.scatter(y_test, y_test, c=[\"g\"])\n",
|
||||
"plt.legend(\n",
|
||||
" (test_pred, test_test), (\"prediction\", \"truth\"), loc=\"upper left\", fontsize=8\n",
|
||||
")\n",
|
||||
@@ -449,9 +449,9 @@
|
||||
"automated-machine-learning"
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
@@ -429,9 +429,9 @@
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
@@ -557,9 +557,9 @@
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
@@ -161,9 +161,9 @@
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
@@ -215,9 +215,9 @@
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
@@ -482,9 +482,9 @@
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
@@ -302,9 +302,9 @@
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
@@ -86,7 +86,7 @@
|
||||
"source": [
|
||||
"In this example, we will be using and registering two models. \n",
|
||||
"\n",
|
||||
"First we will train two simple models on the [diabetes dataset](https://scikit-learn.org/stable/datasets/index.html#diabetes-dataset) included with scikit-learn, serializing them to files in the current directory."
|
||||
"First we will train two simple models on the [diabetes dataset](https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset) included with scikit-learn, serializing them to files in the current directory."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -239,7 +239,7 @@
|
||||
"\n",
|
||||
"env = Environment(\"deploytocloudenv\")\n",
|
||||
"env.python.conda_dependencies.add_pip_package(\"joblib\")\n",
|
||||
"env.python.conda_dependencies.add_pip_package(\"numpy\")\n",
|
||||
"env.python.conda_dependencies.add_pip_package(\"numpy==1.23\")\n",
|
||||
"env.python.conda_dependencies.add_pip_package(\"scikit-learn=={}\".format(sklearn.__version__))"
|
||||
]
|
||||
},
|
||||
@@ -373,9 +373,9 @@
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
@@ -91,7 +91,7 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import joblib\n",
|
||||
"import dill\n",
|
||||
"\n",
|
||||
"from sklearn.datasets import load_diabetes\n",
|
||||
"from sklearn.linear_model import Ridge\n",
|
||||
@@ -101,7 +101,7 @@
|
||||
"\n",
|
||||
"model = Ridge().fit(dataset_x, dataset_y)\n",
|
||||
"\n",
|
||||
"joblib.dump(model, 'sklearn_regression_model.pkl')"
|
||||
"dill.dump(model, open('sklearn_regression_model.pkl', 'wb'))"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -285,7 +285,8 @@
|
||||
" 'azureml-defaults',\n",
|
||||
" 'inference-schema[numpy-support]',\n",
|
||||
" 'joblib',\n",
|
||||
" 'numpy',\n",
|
||||
" 'dill==0.3.6',\n",
|
||||
" 'numpy==1.23',\n",
|
||||
" 'scikit-learn=={}'.format(sklearn.__version__)\n",
|
||||
"])"
|
||||
]
|
||||
@@ -486,7 +487,8 @@
|
||||
" 'azureml-defaults',\n",
|
||||
" 'inference-schema[numpy-support]',\n",
|
||||
" 'joblib',\n",
|
||||
" 'numpy',\n",
|
||||
" 'dill==0.3.6',\n",
|
||||
" 'numpy==1.23',\n",
|
||||
" 'scikit-learn=={}'.format(sklearn.__version__)\n",
|
||||
"])\n",
|
||||
"inference_config = InferenceConfig(entry_script='score.py', environment=environment)\n",
|
||||
@@ -541,7 +543,7 @@
|
||||
" - To run a local web service, see the [notebook on deployment to a local Docker container](../deploy-to-local/register-model-deploy-local.ipynb).\n",
|
||||
" - For more information on datasets, see the [notebook on training with datasets](../../work-with-data/datasets-tutorial/train-with-datasets/train-with-datasets.ipynb).\n",
|
||||
" - For more information on environments, see the [notebook on using environments](../../training/using-environments/using-environments.ipynb).\n",
|
||||
" - For information on all the available deployment targets, see [“How and where to deploy models”](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-deploy-and-where#choose-a-compute-target)."
|
||||
" - For information on all the available deployment targets, see [“How and where to deploy models”](https://docs.microsoft.com/azure/machine-learning/v1/how-to-deploy-and-where#choose-a-compute-target)."
|
||||
]
|
||||
}
|
||||
],
|
||||
@@ -568,9 +570,9 @@
|
||||
"friendly_name": "Register model and deploy as webservice",
|
||||
"index_order": 3,
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
@@ -473,9 +473,9 @@
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
@@ -529,9 +529,9 @@
|
||||
"friendly_name": "Register a model and deploy locally",
|
||||
"index_order": 1,
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
@@ -1,373 +0,0 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Copyright (c) Microsoft Corporation. All rights reserved.\n",
|
||||
"\n",
|
||||
"Licensed under the MIT License."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Deploy models to Azure Kubernetes Service (AKS) using controlled roll out\n",
|
||||
"This notebook will show you how to deploy mulitple AKS webservices with the same scoring endpoint and how to roll out your models in a controlled manner by configuring % of scoring traffic going to each webservice. If you are using a Notebook VM, you are all set. Otherwise, go through the [configuration notebook](../../../configuration.ipynb) to install the Azure Machine Learning Python SDK and create an Azure ML Workspace."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Check for latest version\n",
|
||||
"import azureml.core\n",
|
||||
"print(azureml.core.VERSION)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Initialize workspace\n",
|
||||
"Create a [Workspace](https://docs.microsoft.com/python/api/azureml-core/azureml.core.workspace%28class%29?view=azure-ml-py) object from your persisted configuration."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.core.workspace import Workspace\n",
|
||||
"\n",
|
||||
"ws = Workspace.from_config()\n",
|
||||
"print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\\n')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Register the model\n",
|
||||
"Register a file or folder as a model by calling [Model.register()](https://docs.microsoft.com/python/api/azureml-core/azureml.core.model.model?view=azure-ml-py#register-workspace--model-path--model-name--tags-none--properties-none--description-none--datasets-none--model-framework-none--model-framework-version-none--child-paths-none-).\n",
|
||||
"In addition to the content of the model file itself, your registered model will also store model metadata -- model description, tags, and framework information -- that will be useful when managing and deploying models in your workspace. Using tags, for instance, you can categorize your models and apply filters when listing models in your workspace."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.core import Model\n",
|
||||
"\n",
|
||||
"model = Model.register(workspace=ws,\n",
|
||||
" model_name='sklearn_regression_model.pkl', # Name of the registered model in your workspace.\n",
|
||||
" model_path='./sklearn_regression_model.pkl', # Local file to upload and register as a model.\n",
|
||||
" model_framework=Model.Framework.SCIKITLEARN, # Framework used to create the model.\n",
|
||||
" model_framework_version='0.19.1', # Version of scikit-learn used to create the model.\n",
|
||||
" description='Ridge regression model to predict diabetes progression.',\n",
|
||||
" tags={'area': 'diabetes', 'type': 'regression'})\n",
|
||||
"\n",
|
||||
"print('Name:', model.name)\n",
|
||||
"print('Version:', model.version)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Register an environment (for all models)\n",
|
||||
"\n",
|
||||
"If you control over how your model is run, or if it has special runtime requirements, you can specify your own environment and scoring method.\n",
|
||||
"\n",
|
||||
"Specify the model's runtime environment by creating an [Environment](https://docs.microsoft.com/python/api/azureml-core/azureml.core.environment%28class%29?view=azure-ml-py) object and providing the [CondaDependencies](https://docs.microsoft.com/python/api/azureml-core/azureml.core.conda_dependencies.condadependencies?view=azure-ml-py) needed by your model."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.core import Environment\n",
|
||||
"from azureml.core.conda_dependencies import CondaDependencies\n",
|
||||
"\n",
|
||||
"environment=Environment('my-sklearn-environment')\n",
|
||||
"environment.python.conda_dependencies = CondaDependencies.create(conda_packages=[\n",
|
||||
" 'pip==20.2.4'],\n",
|
||||
" pip_packages=[\n",
|
||||
" 'azureml-defaults',\n",
|
||||
" 'inference-schema[numpy-support]',\n",
|
||||
" 'numpy',\n",
|
||||
" 'scikit-learn==0.22.1',\n",
|
||||
" 'scipy'\n",
|
||||
"])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"When using a custom environment, you must also provide Python code for initializing and running your model. An example script is included with this notebook."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"with open('score.py') as f:\n",
|
||||
" print(f.read())"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Create the InferenceConfig\n",
|
||||
"Create the inference configuration to reference your environment and entry script during deployment"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.core.model import InferenceConfig\n",
|
||||
"\n",
|
||||
"inference_config = InferenceConfig(entry_script='score.py', \n",
|
||||
" source_directory='.',\n",
|
||||
" environment=environment)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Provision the AKS Cluster\n",
|
||||
"If you already have an AKS cluster attached to this workspace, skip the step below and provide the name of the cluster.\n",
|
||||
"\n",
|
||||
"> Note that if you have an AzureML Data Scientist role, you will not have permission to create compute resources. Talk to your workspace or IT admin to create the compute targets described in this section, if they do not already exist."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.core.compute import AksCompute\n",
|
||||
"from azureml.core.compute import ComputeTarget\n",
|
||||
"# Use the default configuration (can also provide parameters to customize)\n",
|
||||
"prov_config = AksCompute.provisioning_configuration()\n",
|
||||
"\n",
|
||||
"aks_name = 'my-aks' \n",
|
||||
"# Create the cluster\n",
|
||||
"aks_target = ComputeTarget.create(workspace = ws, \n",
|
||||
" name = aks_name, \n",
|
||||
" provisioning_configuration = prov_config) \n",
|
||||
"aks_target.wait_for_completion(show_output=True)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Create an Endpoint and add a version (AKS service)\n",
|
||||
"This creates a new endpoint and adds a version behind it. By default the first version added is the default version. You can specify the traffic percentile a version takes behind an endpoint. \n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# deploying the model and create a new endpoint\n",
|
||||
"from azureml.core.webservice import AksEndpoint\n",
|
||||
"# from azureml.core.compute import ComputeTarget\n",
|
||||
"\n",
|
||||
"#select a created compute\n",
|
||||
"compute = ComputeTarget(ws, 'my-aks')\n",
|
||||
"namespace_name=\"endpointnamespace\"\n",
|
||||
"# define the endpoint name\n",
|
||||
"endpoint_name = \"myendpoint1\"\n",
|
||||
"# define the service name\n",
|
||||
"version_name= \"versiona\"\n",
|
||||
"\n",
|
||||
"endpoint_deployment_config = AksEndpoint.deploy_configuration(tags = {'modelVersion':'firstversion', 'department':'finance'}, \n",
|
||||
" description = \"my first version\", namespace = namespace_name, \n",
|
||||
" version_name = version_name, traffic_percentile = 40)\n",
|
||||
"\n",
|
||||
"endpoint = Model.deploy(ws, endpoint_name, [model], inference_config, endpoint_deployment_config, compute)\n",
|
||||
"endpoint.wait_for_deployment(True)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"endpoint.get_logs()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Add another version of the service to an existing endpoint\n",
|
||||
"This adds another version behind an existing endpoint. You can specify the traffic percentile the new version takes. If no traffic_percentile is specified then it defaults to 0. All the unspecified traffic percentile (in this example 50) across all versions goes to default version."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Adding a new version to an existing Endpoint.\n",
|
||||
"version_name_add=\"versionb\" \n",
|
||||
"\n",
|
||||
"endpoint.create_version(version_name = version_name_add, inference_config=inference_config, models=[model], tags = {'modelVersion':'secondversion', 'department':'finance'}, \n",
|
||||
" description = \"my second version\", traffic_percentile = 10)\n",
|
||||
"endpoint.wait_for_deployment(True)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Update an existing version in an endpoint\n",
|
||||
"There are two types of versions: control and treatment. An endpoint contains one or more treatment versions but only one control version. This categorization helps compare the different versions against the defined control version."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"endpoint.update_version(version_name=endpoint.versions[version_name_add].name, description=\"my second version update\", traffic_percentile=40, is_default=True, is_control_version_type=True)\n",
|
||||
"endpoint.wait_for_deployment(True)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Test the web service using run method\n",
|
||||
"Test the web sevice by passing in data. Run() method retrieves API keys behind the scenes to make sure that call is authenticated."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Scoring on endpoint\n",
|
||||
"import json\n",
|
||||
"test_sample = json.dumps({'data': [\n",
|
||||
" [1,2,3,4,5,6,7,8,9,10], \n",
|
||||
" [10,9,8,7,6,5,4,3,2,1]\n",
|
||||
"]})\n",
|
||||
"\n",
|
||||
"test_sample_encoded = bytes(test_sample, encoding='utf8')\n",
|
||||
"prediction = endpoint.run(input_data=test_sample_encoded)\n",
|
||||
"print(prediction)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Delete Resources"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# deleting a version in an endpoint\n",
|
||||
"endpoint.delete_version(version_name=version_name)\n",
|
||||
"endpoint.wait_for_deployment(True)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# deleting an endpoint, this will delete all versions in the endpoint and the endpoint itself\n",
|
||||
"endpoint.delete()"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"authors": [
|
||||
{
|
||||
"name": "shipatel"
|
||||
}
|
||||
],
|
||||
"category": "deployment",
|
||||
"compute": [
|
||||
"None"
|
||||
],
|
||||
"datasets": [
|
||||
"Diabetes"
|
||||
],
|
||||
"deployment": [
|
||||
"Azure Kubernetes Service"
|
||||
],
|
||||
"exclude_from_index": false,
|
||||
"framework": [
|
||||
"Scikit-learn"
|
||||
],
|
||||
"friendly_name": "Deploy models to AKS using controlled roll out",
|
||||
"index_order": 3,
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.0"
|
||||
},
|
||||
"star_tag": [
|
||||
"featured"
|
||||
],
|
||||
"tags": [
|
||||
"None"
|
||||
],
|
||||
"task": "Deploy a model with Azure Machine Learning"
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
@@ -1,4 +0,0 @@
|
||||
name: deploy-aks-with-controlled-rollout
|
||||
dependencies:
|
||||
- pip:
|
||||
- azureml-sdk
|
||||
@@ -1,28 +0,0 @@
|
||||
import pickle
|
||||
import json
|
||||
import numpy
|
||||
from sklearn.externals import joblib
|
||||
from sklearn.linear_model import Ridge
|
||||
from azureml.core.model import Model
|
||||
|
||||
|
||||
def init():
|
||||
global model
|
||||
# note here "sklearn_regression_model.pkl" is the name of the model registered under
|
||||
# this is a different behavior than before when the code is run locally, even though the code is the same.
|
||||
model_path = Model.get_model_path('sklearn_regression_model.pkl')
|
||||
# deserialize the model file back into a sklearn model
|
||||
model = joblib.load(model_path)
|
||||
|
||||
|
||||
# note you can pass in multiple rows for scoring
|
||||
def run(raw_data):
|
||||
try:
|
||||
data = json.loads(raw_data)['data']
|
||||
data = numpy.array(data)
|
||||
result = model.predict(data)
|
||||
# you can return any data type as long as it is JSON-serializable
|
||||
return result.tolist()
|
||||
except Exception as e:
|
||||
error = str(e)
|
||||
return error
|
||||
Binary file not shown.
@@ -476,9 +476,9 @@
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
@@ -405,9 +405,9 @@
|
||||
"friendly_name": "Convert and deploy TinyYolo with ONNX Runtime",
|
||||
"index_order": 5,
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
@@ -773,9 +773,9 @@
|
||||
"friendly_name": "Deploy Facial Expression Recognition (FER+) with ONNX Runtime",
|
||||
"index_order": 2,
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
@@ -750,9 +750,9 @@
|
||||
"friendly_name": "Deploy MNIST digit recognition with ONNX Runtime",
|
||||
"index_order": 1,
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
@@ -206,9 +206,9 @@
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
@@ -389,9 +389,9 @@
|
||||
"friendly_name": "Deploy ResNet50 with ONNX Runtime",
|
||||
"index_order": 4,
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
@@ -564,9 +564,9 @@
|
||||
"friendly_name": "Train MNIST in PyTorch, convert, and deploy with ONNX Runtime",
|
||||
"index_order": 3,
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
@@ -240,9 +240,9 @@
|
||||
"# Please see [Azure ML Containers repository](https://github.com/Azure/AzureML-Containers#featured-tags)\n",
|
||||
"# for open-sourced GPU base images.\n",
|
||||
"env.docker.base_image = DEFAULT_GPU_IMAGE\n",
|
||||
"env.python.conda_dependencies = CondaDependencies.create(python_version=\"3.6.2\", \n",
|
||||
"env.python.conda_dependencies = CondaDependencies.create(python_version=\"3.6.2\", pin_sdk_version=False,\n",
|
||||
" conda_packages=['tensorflow-gpu==1.12.0','numpy'],\n",
|
||||
" pip_packages=['azureml-contrib-services', 'azureml-defaults'])\n",
|
||||
" pip_packages=['azureml-contrib-services==1.47.0', 'azureml-defaults==1.47.0'])\n",
|
||||
"\n",
|
||||
"inference_config = InferenceConfig(entry_script=\"score.py\", environment=env)\n",
|
||||
"aks_config = AksWebservice.deploy_configuration()\n",
|
||||
@@ -329,9 +329,9 @@
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
@@ -343,7 +343,7 @@
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.6"
|
||||
"version": "3.7.0"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
|
||||
@@ -213,7 +213,7 @@
|
||||
"\n",
|
||||
"> Note that if you have an AzureML Data Scientist role, you will not have permission to create compute resources. Talk to your workspace or IT admin to create the compute targets described in this section, if they do not already exist.\n",
|
||||
"\n",
|
||||
"See code snippet below. Check the documentation [here](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-secure-web-service) for more details"
|
||||
"See code snippet below. Check the documentation [here](https://docs.microsoft.com/azure/machine-learning/v1/how-to-secure-web-service) for more details"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -334,9 +334,9 @@
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
@@ -5,4 +5,4 @@ dependencies:
|
||||
- matplotlib
|
||||
- tqdm
|
||||
- scipy
|
||||
- sklearn
|
||||
- scikit-learn
|
||||
|
||||
@@ -366,7 +366,7 @@
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Create AKS Cluster in an existing virtual network (optional)\n",
|
||||
"See code snippet below. Check the documentation [here](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-enable-virtual-network#use-azure-kubernetes-service) for more details."
|
||||
"See code snippet below. Check the documentation [here](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-network-security-overview) for more details."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -397,7 +397,7 @@
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Enable SSL on the AKS Cluster (optional)\n",
|
||||
"See code snippet below. Check the documentation [here](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-secure-web-service) for more details"
|
||||
"See code snippet below. Check the documentation [here](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-network-security-overview#secure-the-inferencing-environment-v1) for more details"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -603,9 +603,9 @@
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
@@ -5,4 +5,4 @@ dependencies:
|
||||
- matplotlib
|
||||
- tqdm
|
||||
- scipy
|
||||
- sklearn
|
||||
- scikit-learn
|
||||
|
||||
@@ -137,7 +137,7 @@
|
||||
"myenv = Environment('my-pyspark-environment')\r\n",
|
||||
"myenv.docker.base_image = \"mcr.microsoft.com/mmlspark/release:0.15\"\r\n",
|
||||
"myenv.inferencing_stack_version = \"latest\"\r\n",
|
||||
"myenv.python.conda_dependencies = CondaDependencies.create(pip_packages=[\"azureml-core\",\"azureml-defaults\",\"azureml-telemetry\",\"azureml-train-restclients-hyperdrive\",\"azureml-train-core\"], python_version=\"3.6.2\")\r\n",
|
||||
"myenv.python.conda_dependencies = CondaDependencies.create(pip_packages=[\"azureml-core\",\"azureml-defaults\",\"azureml-telemetry\",\"azureml-train-restclients-hyperdrive\",\"azureml-train-core\"], python_version=\"3.7.0\")\r\n",
|
||||
"myenv.python.conda_dependencies.add_channel(\"conda-forge\")\r\n",
|
||||
"myenv.spark.packages = [SparkPackage(\"com.microsoft.ml.spark\", \"mmlspark_2.11\", \"0.15\"), SparkPackage(\"com.microsoft.azure\", \"azure-storage\", \"2.0.0\"), SparkPackage(\"org.apache.hadoop\", \"hadoop-azure\", \"2.7.0\")]\r\n",
|
||||
"myenv.spark.repositories = [\"https://mmlspark.azureedge.net/maven\"]\r\n"
|
||||
@@ -327,9 +327,9 @@
|
||||
],
|
||||
"friendly_name": "Register Spark model and deploy as webservice",
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
@@ -341,7 +341,7 @@
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.2"
|
||||
"version": "3.7.0"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
|
||||
@@ -106,7 +106,7 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"print(\"This notebook was created using version 1.44.0 of the Azure ML SDK\")\n",
|
||||
"print(\"This notebook was created using version 1.51.0 of the Azure ML SDK\")\n",
|
||||
"print(\"You are currently using version\", azureml.core.VERSION, \"of the Azure ML SDK\")"
|
||||
]
|
||||
},
|
||||
@@ -235,14 +235,30 @@
|
||||
"# Note: this is to pin the pandas and xgboost versions to be same as notebook.\n",
|
||||
"# In production scenario user would choose their dependencies\n",
|
||||
"import pkg_resources\n",
|
||||
"from distutils.version import LooseVersion\n",
|
||||
"available_packages = pkg_resources.working_set\n",
|
||||
"pandas_ver = None\n",
|
||||
"numpy_ver = None\n",
|
||||
"sklearn_ver = None\n",
|
||||
"for dist in list(available_packages):\n",
|
||||
" if dist.key == 'pandas':\n",
|
||||
" pandas_ver = dist.version\n",
|
||||
" if dist.key == 'numpy':\n",
|
||||
" if LooseVersion(dist.version) >= LooseVersion('1.20.0'):\n",
|
||||
" numpy_ver = dist.version\n",
|
||||
" else:\n",
|
||||
" numpy_ver = '1.21.6'\n",
|
||||
" if dist.key == 'scikit-learn':\n",
|
||||
" sklearn_ver = dist.version\n",
|
||||
"pandas_dep = 'pandas'\n",
|
||||
"numpy_dep = 'numpy'\n",
|
||||
"sklearn_dep = 'scikit-learn'\n",
|
||||
"if pandas_ver:\n",
|
||||
" pandas_dep = 'pandas=={}'.format(pandas_ver)\n",
|
||||
"if numpy_ver:\n",
|
||||
" numpy_dep = 'numpy=={}'.format(numpy_ver)\n",
|
||||
"if sklearn_ver:\n",
|
||||
" sklearn_dep = 'scikit-learn=={}'.format(sklearn_ver)\n",
|
||||
"\n",
|
||||
"# Note: we build shap at commit 690245 for Tesla K80 GPUs\n",
|
||||
"env.docker.base_dockerfile = f\"\"\"\n",
|
||||
@@ -282,7 +298,9 @@
|
||||
"pip uninstall -y xgboost && \\\n",
|
||||
"conda install py-xgboost==1.3.3 && \\\n",
|
||||
"pip uninstall -y numpy && \\\n",
|
||||
"conda install numpy==1.20.3 \\\n",
|
||||
"pip install {numpy_dep} && \\\n",
|
||||
"pip install {sklearn_dep} && \\\n",
|
||||
"pip install chardet \\\n",
|
||||
"\"\"\"\n",
|
||||
"\n",
|
||||
"env.python.user_managed_dependencies = True\n",
|
||||
@@ -477,9 +495,9 @@
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
@@ -7,12 +7,12 @@ dependencies:
|
||||
- flask
|
||||
- flask-cors
|
||||
- gevent>=1.3.6
|
||||
- jinja2
|
||||
- ipython
|
||||
- matplotlib
|
||||
- ipywidgets
|
||||
- raiwidgets~=0.19.0
|
||||
- raiwidgets~=0.26.0
|
||||
- itsdangerous==2.0.1
|
||||
- markupsafe<2.1.0
|
||||
- scipy>=1.5.3
|
||||
- protobuf==3.20.0
|
||||
- jinja2==3.0.3
|
||||
|
||||
@@ -269,23 +269,29 @@
|
||||
"available_packages = pkg_resources.working_set\n",
|
||||
"sklearn_ver = None\n",
|
||||
"pandas_ver = None\n",
|
||||
"joblib_ver = None\n",
|
||||
"for dist in list(available_packages):\n",
|
||||
" if dist.key == 'scikit-learn':\n",
|
||||
" sklearn_ver = dist.version\n",
|
||||
" elif dist.key == 'pandas':\n",
|
||||
" pandas_ver = dist.version\n",
|
||||
" elif dist.key == 'joblib':\n",
|
||||
" joblib_ver = dist.version\n",
|
||||
"sklearn_dep = 'scikit-learn'\n",
|
||||
"pandas_dep = 'pandas'\n",
|
||||
"joblib_dep = 'joblib'\n",
|
||||
"if sklearn_ver:\n",
|
||||
" sklearn_dep = 'scikit-learn=={}'.format(sklearn_ver)\n",
|
||||
"if pandas_ver:\n",
|
||||
" pandas_dep = 'pandas=={}'.format(pandas_ver)\n",
|
||||
"if joblib_ver:\n",
|
||||
" joblib_dep = 'joblib=={}'.format(joblib_ver)\n",
|
||||
"# Specify CondaDependencies obj\n",
|
||||
"# The CondaDependencies specifies the conda and pip packages that are installed in the environment\n",
|
||||
"# the submitted job is run in. Note the remote environment(s) needs to be similar to the local\n",
|
||||
"# environment, otherwise if a model is trained or deployed in a different environment this can\n",
|
||||
"# cause errors. Please take extra care when specifying your dependencies in a production environment.\n",
|
||||
"azureml_pip_packages.extend([sklearn_dep, pandas_dep])\n",
|
||||
"azureml_pip_packages.extend([sklearn_dep, pandas_dep, joblib_dep])\n",
|
||||
"run_config.environment.python.conda_dependencies = CondaDependencies.create(pip_packages=azureml_pip_packages, python_version=python_version)\n",
|
||||
"\n",
|
||||
"from azureml.core import ScriptRunConfig\n",
|
||||
@@ -490,9 +496,9 @@
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
@@ -6,13 +6,13 @@ dependencies:
|
||||
- flask
|
||||
- flask-cors
|
||||
- gevent>=1.3.6
|
||||
- jinja2
|
||||
- ipython
|
||||
- matplotlib
|
||||
- azureml-dataset-runtime
|
||||
- ipywidgets
|
||||
- raiwidgets~=0.19.0
|
||||
- raiwidgets~=0.26.0
|
||||
- itsdangerous==2.0.1
|
||||
- markupsafe<2.1.0
|
||||
- scipy>=1.5.3
|
||||
- protobuf==3.20.0
|
||||
- jinja2==3.0.3
|
||||
|
||||
@@ -595,9 +595,9 @@
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
@@ -6,13 +6,13 @@ dependencies:
|
||||
- flask
|
||||
- flask-cors
|
||||
- gevent>=1.3.6
|
||||
- jinja2
|
||||
- ipython
|
||||
- matplotlib
|
||||
- ipywidgets
|
||||
- raiwidgets~=0.19.0
|
||||
- raiwidgets~=0.26.0
|
||||
- packaging>=20.9
|
||||
- itsdangerous==2.0.1
|
||||
- markupsafe<2.1.0
|
||||
- scipy>=1.5.3
|
||||
- protobuf==3.20.0
|
||||
- jinja2==3.0.3
|
||||
|
||||
@@ -516,9 +516,9 @@
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
@@ -6,13 +6,13 @@ dependencies:
|
||||
- flask
|
||||
- flask-cors
|
||||
- gevent>=1.3.6
|
||||
- jinja2
|
||||
- ipython
|
||||
- matplotlib
|
||||
- ipywidgets
|
||||
- raiwidgets~=0.19.0
|
||||
- raiwidgets~=0.26.0
|
||||
- packaging>=20.9
|
||||
- itsdangerous==2.0.1
|
||||
- markupsafe<2.1.0
|
||||
- scipy>=1.5.3
|
||||
- protobuf==3.20.0
|
||||
- jinja2==3.0.3
|
||||
|
||||
@@ -277,23 +277,29 @@
|
||||
"available_packages = pkg_resources.working_set\n",
|
||||
"sklearn_ver = None\n",
|
||||
"pandas_ver = None\n",
|
||||
"joblib_ver = None\n",
|
||||
"for dist in available_packages:\n",
|
||||
" if dist.key == 'scikit-learn':\n",
|
||||
" sklearn_ver = dist.version\n",
|
||||
" elif dist.key == 'pandas':\n",
|
||||
" pandas_ver = dist.version\n",
|
||||
" elif dist.key == 'joblib':\n",
|
||||
" joblib_ver = dist.version\n",
|
||||
"sklearn_dep = 'scikit-learn'\n",
|
||||
"pandas_dep = 'pandas'\n",
|
||||
"joblib_dep = 'joblib'\n",
|
||||
"if sklearn_ver:\n",
|
||||
" sklearn_dep = 'scikit-learn=={}'.format(sklearn_ver)\n",
|
||||
"if pandas_ver:\n",
|
||||
" pandas_dep = 'pandas=={}'.format(pandas_ver)\n",
|
||||
"if joblib_ver:\n",
|
||||
" joblib_dep = 'joblib=={}'.format(joblib_ver)\n",
|
||||
"# Specify CondaDependencies obj\n",
|
||||
"# The CondaDependencies specifies the conda and pip packages that are installed in the environment\n",
|
||||
"# the submitted job is run in. Note the remote environment(s) needs to be similar to the local\n",
|
||||
"# environment, otherwise if a model is trained or deployed in a different environment this can\n",
|
||||
"# cause errors. Please take extra care when specifying your dependencies in a production environment.\n",
|
||||
"azureml_pip_packages.extend(['pyyaml', sklearn_dep, pandas_dep])\n",
|
||||
"azureml_pip_packages.extend(['pyyaml', sklearn_dep, pandas_dep, joblib_dep])\n",
|
||||
"run_config.environment.python.conda_dependencies = CondaDependencies.create(\n",
|
||||
" python_version=python_version,\n",
|
||||
" pip_packages=azureml_pip_packages)\n",
|
||||
@@ -440,23 +446,29 @@
|
||||
"available_packages = pkg_resources.working_set\n",
|
||||
"sklearn_ver = None\n",
|
||||
"pandas_ver = None\n",
|
||||
"joblib_ver = None\n",
|
||||
"for dist in available_packages:\n",
|
||||
" if dist.key == 'scikit-learn':\n",
|
||||
" sklearn_ver = dist.version\n",
|
||||
" elif dist.key == 'pandas':\n",
|
||||
" pandas_ver = dist.version\n",
|
||||
" elif dist.key == 'joblib':\n",
|
||||
" joblib_ver = dist.version\n",
|
||||
"sklearn_dep = 'scikit-learn'\n",
|
||||
"pandas_dep = 'pandas'\n",
|
||||
"joblib_dep = 'joblib'\n",
|
||||
"if sklearn_ver:\n",
|
||||
" sklearn_dep = 'scikit-learn=={}'.format(sklearn_ver)\n",
|
||||
"if pandas_ver:\n",
|
||||
" pandas_dep = 'pandas=={}'.format(pandas_ver)\n",
|
||||
"if joblib_ver:\n",
|
||||
" joblib_dep = 'joblib=={}'.format(joblib_ver)\n",
|
||||
"# Specify CondaDependencies obj\n",
|
||||
"# The CondaDependencies specifies the conda and pip packages that are installed in the environment\n",
|
||||
"# the submitted job is run in. Note the remote environment(s) needs to be similar to the local\n",
|
||||
"# environment, otherwise if a model is trained or deployed in a different environment this can\n",
|
||||
"# cause errors. Please take extra care when specifying your dependencies in a production environment.\n",
|
||||
"azureml_pip_packages.extend(['pyyaml', sklearn_dep, pandas_dep])\n",
|
||||
"azureml_pip_packages.extend(['pyyaml', sklearn_dep, pandas_dep, joblib_dep])\n",
|
||||
"myenv = CondaDependencies.create(python_version=python_version, pip_packages=azureml_pip_packages)\n",
|
||||
"\n",
|
||||
"with open(\"myenv.yml\",\"w\") as f:\n",
|
||||
@@ -564,9 +576,9 @@
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
@@ -6,14 +6,14 @@ dependencies:
|
||||
- flask
|
||||
- flask-cors
|
||||
- gevent>=1.3.6
|
||||
- jinja2
|
||||
- ipython
|
||||
- matplotlib
|
||||
- azureml-dataset-runtime
|
||||
- azureml-core
|
||||
- ipywidgets
|
||||
- raiwidgets~=0.19.0
|
||||
- raiwidgets~=0.26.0
|
||||
- itsdangerous==2.0.1
|
||||
- markupsafe<2.1.0
|
||||
- scipy>=1.5.3
|
||||
- protobuf==3.20.0
|
||||
- jinja2==3.0.3
|
||||
|
||||
@@ -175,7 +175,7 @@
|
||||
"store_name=os.getenv(\"ADL_STORENAME_62\", \"<my-datastore-name>\") # ADLS account name\n",
|
||||
"tenant_id=os.getenv(\"ADL_TENANT_62\", \"<my-tenant-id>\") # tenant id of service principal\n",
|
||||
"client_id=os.getenv(\"ADL_CLIENTID_62\", \"<my-client-id>\") # client id of service principal\n",
|
||||
"client_secret=os.getenv(\"ADL_CLIENT_SECRET_62\", \"<my-client-secret>\") # the secret of service principal\n",
|
||||
"client_st=os.getenv(\"ADL_CLIENT_SECRET_62\", \"<my-client-secret>\") # the secret of service principal\n",
|
||||
"\n",
|
||||
"try:\n",
|
||||
" adls_datastore = Datastore.get(ws, datastore_name)\n",
|
||||
@@ -189,7 +189,7 @@
|
||||
" store_name=store_name, # ADLS account name\n",
|
||||
" tenant_id=tenant_id, # tenant id of service principal\n",
|
||||
" client_id=client_id, # client id of service principal\n",
|
||||
" client_secret=client_secret) # the secret of service principal\n",
|
||||
" client_secret=client_st) # the secret of service principal\n",
|
||||
" print(\"Registered datastore with name: %s\" % datastore_name)\n",
|
||||
"\n",
|
||||
"adls_data_ref = DataReference(\n",
|
||||
@@ -579,9 +579,9 @@
|
||||
],
|
||||
"friendly_name": "Azure Machine Learning Pipeline with DataTranferStep",
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
@@ -632,9 +632,9 @@
|
||||
],
|
||||
"friendly_name": "Getting Started with Azure Machine Learning Pipelines",
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
@@ -384,9 +384,9 @@
|
||||
],
|
||||
"friendly_name": "Azure Machine Learning Pipeline with AzureBatchStep",
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
@@ -470,9 +470,9 @@
|
||||
],
|
||||
"friendly_name": "How to use ModuleStep with AML Pipelines",
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
@@ -261,9 +261,9 @@
|
||||
],
|
||||
"friendly_name": "How to use Pipeline Drafts to create a Published Pipeline",
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
@@ -292,7 +292,7 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"tf_env = Environment.get(ws, name='AzureML-TensorFlow-2.0-GPU')"
|
||||
"tf_env = Environment.get(ws, name='AzureML-tensorflow-2.6-ubuntu20.04-py38-cuda11-gpu')"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -595,9 +595,9 @@
|
||||
],
|
||||
"friendly_name": "Azure Machine Learning Pipeline with HyperDriveStep",
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
@@ -443,9 +443,9 @@
|
||||
],
|
||||
"friendly_name": "How to Publish a Pipeline and Invoke the REST endpoint",
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
@@ -432,7 +432,7 @@
|
||||
"This schedule will run when additions or modifications are made to Blobs in the Datastore.\n",
|
||||
"By default, the Datastore container is monitored for changes. Use the path_on_datastore parameter to instead specify a path on the Datastore to monitor for changes. Note: the path_on_datastore will be under the container for the datastore, so the actual path monitored will be container/path_on_datastore. Changes made to subfolders in the container/path will not trigger the schedule.\n",
|
||||
"Note: Only Blob Datastores are supported.\n",
|
||||
"Note: Not supported for CMK workspaces. Please review these [instructions](https://docs.microsoft.com/azure/machine-learning/how-to-trigger-published-pipeline) in order to setup a blob trigger submission schedule with CMK enabled. Also see those instructions to bring your own LogicApp to avoid the schedule triggers per month limit."
|
||||
"Note: Not supported for CMK workspaces. Please review these [instructions](https://docs.microsoft.com/azure/machine-learning/v1/how-to-trigger-published-pipeline) in order to setup a blob trigger submission schedule with CMK enabled. Also see those instructions to bring your own LogicApp to avoid the schedule triggers per month limit."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -637,9 +637,9 @@
|
||||
],
|
||||
"friendly_name": "How to Setup a Schedule for a Published Pipeline or Pipeline Endpoint",
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
@@ -581,9 +581,9 @@
|
||||
],
|
||||
"friendly_name": "How to setup a versioned Pipeline Endpoint",
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
@@ -500,9 +500,9 @@
|
||||
],
|
||||
"friendly_name": "How to use DataPath as a PipelineParameter",
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
@@ -496,9 +496,9 @@
|
||||
],
|
||||
"friendly_name": "How to use Dataset as a PipelineParameter",
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
@@ -147,7 +147,7 @@
|
||||
"store_name = os.getenv(\"ADL_STORENAME_62\", \"<my-datastore-name>\") # ADLS account name\n",
|
||||
"tenant_id = os.getenv(\"ADL_TENANT_62\", \"<my-tenant-id>\") # tenant id of service principal\n",
|
||||
"client_id = os.getenv(\"ADL_CLIENTID_62\", \"<my-client-id>\") # client id of service principal\n",
|
||||
"client_secret = os.getenv(\"ADL_CLIENT_62_SECRET\", \"<my-client-secret>\") # the secret of service principal\n",
|
||||
"client_st = os.getenv(\"ADL_CLIENT_62_SECRET\", \"<my-client-secret>\") # the secret of service principal\n",
|
||||
"\n",
|
||||
"try:\n",
|
||||
" adls_datastore = Datastore.get(ws, datastore_name)\n",
|
||||
@@ -161,7 +161,7 @@
|
||||
" store_name=store_name, # ADLS account name\n",
|
||||
" tenant_id=tenant_id, # tenant id of service principal\n",
|
||||
" client_id=client_id, # client id of service principal\n",
|
||||
" client_secret=client_secret) # the secret of service principal\n",
|
||||
" client_secret=client_st) # the secret of service principal\n",
|
||||
" print(\"registered datastore with name: %s\" % datastore_name)"
|
||||
]
|
||||
},
|
||||
@@ -377,9 +377,9 @@
|
||||
],
|
||||
"friendly_name": "How to use AdlaStep with AML Pipelines",
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
@@ -20,7 +20,7 @@
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Using Databricks as a Compute Target from Azure Machine Learning Pipeline\n",
|
||||
"To use Databricks as a compute target from [Azure Machine Learning Pipeline](https://aka.ms/pl-concept), a [DatabricksStep](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-steps/azureml.pipeline.steps.databricks_step.databricksstep?view=azure-ml-py) is used. This notebook demonstrates the use of DatabricksStep in Azure Machine Learning Pipeline.\n",
|
||||
"To use Databricks as a compute target from [Azure Machine Learning Pipeline](https://aka.ms/pl-concept), a [DatabricksStep](https://docs.microsoft.com/python/api/azureml-pipeline-steps/azureml.pipeline.steps.databricks_step.databricksstep?view=azure-ml-py) is used. This notebook demonstrates the use of DatabricksStep in Azure Machine Learning Pipeline.\n",
|
||||
"\n",
|
||||
"The notebook will show:\n",
|
||||
"1. Running an arbitrary Databricks notebook that the customer has in Databricks workspace\n",
|
||||
@@ -180,10 +180,9 @@
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Data Connections with Inputs and Outputs\n",
|
||||
"The DatabricksStep supports DBFS, Azure Blob and ADLS for inputs and outputs. You also will need to define a [Secrets](https://docs.azuredatabricks.net/user-guide/secrets/index.html) scope to enable authentication to external data sources such as Blob and ADLS from Databricks.\n",
|
||||
"The DatabricksStep supports DBFS, Azure Blob and ADLS for inputs and outputs. You also will need to define a [Secrets](https://docs.microsoft.com/azure/databricks/security/access-control/secret-acl) scope to enable authentication to external data sources such as Blob and ADLS from Databricks.\n",
|
||||
"\n",
|
||||
"- Databricks documentation on [Azure Blob](https://docs.azuredatabricks.net/spark/latest/data-sources/azure/azure-storage.html)\n",
|
||||
"- Databricks documentation on [ADLS](https://docs.databricks.com/spark/latest/data-sources/azure/azure-datalake.html)\n",
|
||||
"- Databricks documentation on [Azure Storage](https://docs.microsoft.com/azure/databricks/data/data-sources/azure/azure-storage)\n",
|
||||
"\n",
|
||||
"### Type of Data Access\n",
|
||||
"Databricks allows to interact with Azure Blob and ADLS in two ways.\n",
|
||||
@@ -331,7 +330,7 @@
|
||||
"- **inputs:** List of input connections for data consumed by this step. Fetch this inside the notebook using dbutils.widgets.get(\"input\")\n",
|
||||
"- **outputs:** List of output port definitions for outputs produced by this step. Fetch this inside the notebook using dbutils.widgets.get(\"output\")\n",
|
||||
"- **existing_cluster_id:** Cluster ID of an existing Interactive cluster on the Databricks workspace. If you are providing this, do not provide any of the parameters below that are used to create a new cluster such as spark_version, node_type, etc.\n",
|
||||
"- **spark_version:** Version of spark for the databricks run cluster. default value: 4.0.x-scala2.11\n",
|
||||
"- **spark_version:** Version of spark for the databricks run cluster. You can refer to [DataBricks runtime version](https://learn.microsoft.com/azure/databricks/dev-tools/api/#--runtime-version-strings) to specify the spark version. default value: 10.4.x-scala2.12\n",
|
||||
"- **node_type:** Azure vm node types for the databricks run cluster. default value: Standard_D3_v2\n",
|
||||
"- **num_workers:** Specifies a static number of workers for the databricks run cluster\n",
|
||||
"- **min_workers:** Specifies a min number of workers to use for auto-scaling the databricks run cluster\n",
|
||||
@@ -415,7 +414,7 @@
|
||||
"### 1. Running the demo notebook already added to the Databricks workspace\n",
|
||||
"Create a notebook in the Azure Databricks workspace, and provide the path to that notebook as the value associated with the environment variable \"DATABRICKS_NOTEBOOK_PATH\". This will then set the variable\u00c2\u00a0notebook_path\u00c2\u00a0when you run the code cell below:\n",
|
||||
"\n",
|
||||
"your notebook's path in Azure Databricks UI by hovering over to notebook's title. A typical path of notebook looks like this `/Users/example@databricks.com/example`. See [Databricks Workspace](https://docs.azuredatabricks.net/user-guide/workspace.html) to learn about the folder structure.\n",
|
||||
"your notebook's path in Azure Databricks UI by hovering over to notebook's title. A typical path of notebook looks like this `/Users/example@databricks.com/example`. See [Databricks Workspace](https://docs.microsoft.com/azure/databricks/workspace) to learn about the folder structure.\n",
|
||||
"\n",
|
||||
"Note: DataPath `PipelineParameter` should be provided in list of inputs. Such parameters can be accessed by the datapath `name`."
|
||||
]
|
||||
@@ -487,7 +486,7 @@
|
||||
"### 2. Running a Python script from DBFS\n",
|
||||
"This shows how to run a Python script in DBFS. \n",
|
||||
"\n",
|
||||
"To complete this, you will need to first upload the Python script in your local machine to DBFS using the [CLI](https://docs.azuredatabricks.net/user-guide/dbfs-databricks-file-system.html). The CLI command is given below:\n",
|
||||
"To complete this, you will need to first upload the Python script in your local machine to DBFS using the [CLI](https://docs.microsoft.com/azure/databricks/dbfs). The CLI command is given below:\n",
|
||||
"\n",
|
||||
"```\n",
|
||||
"dbfs cp ./train-db-dbfs.py dbfs:/train-db-dbfs.py\n",
|
||||
@@ -630,7 +629,7 @@
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### 4. Running a JAR job that is alreay added in DBFS\n",
|
||||
"To run a JAR job that is already uploaded to DBFS, follow the instructions below. You will first upload the JAR file to DBFS using the [CLI](https://docs.azuredatabricks.net/user-guide/dbfs-databricks-file-system.html).\n",
|
||||
"To run a JAR job that is already uploaded to DBFS, follow the instructions below. You will first upload the JAR file to DBFS using the [CLI](https://docs.microsoft.com/azure/databricks/dbfs).\n",
|
||||
"\n",
|
||||
"The commented out code in the below cell assumes that you have uploaded `train-db-dbfs.jar` to the root folder in DBFS. You can upload `train-db-dbfs.jar` to the root folder in DBFS using this commandline so you can use `jar_library_dbfs_path = \"dbfs:/train-db-dbfs.jar\"`:\n",
|
||||
"\n",
|
||||
@@ -704,7 +703,7 @@
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### 5. Running demo notebook already added to the Databricks workspace using existing cluster\n",
|
||||
"First you need register DBFS datastore and make sure path_on_datastore does exist in databricks file system, you can browser the files by refering [this](https://docs.azuredatabricks.net/user-guide/dbfs-databricks-file-system.html).\n",
|
||||
"First you need register DBFS datastore and make sure path_on_datastore does exist in databricks file system, you can browser the files by refering [this](https://docs.microsoft.com/azure/databricks/dbfs).\n",
|
||||
"\n",
|
||||
"Find existing_cluster_id by opeing Azure Databricks UI with Clusters page and in url you will find a string connected with '-' right after \"clusters/\"."
|
||||
]
|
||||
@@ -941,9 +940,9 @@
|
||||
],
|
||||
"friendly_name": "How to use DatabricksStep with AML Pipelines",
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
@@ -244,9 +244,9 @@
|
||||
],
|
||||
"friendly_name": "How to use KustoStep with AML Pipelines",
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
@@ -498,9 +498,9 @@
|
||||
],
|
||||
"friendly_name": "How to use AutoMLStep with AML Pipelines",
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
@@ -315,9 +315,9 @@
|
||||
],
|
||||
"friendly_name": "Azure Machine Learning Pipeline with CommandStep for R",
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
@@ -278,9 +278,9 @@
|
||||
],
|
||||
"friendly_name": "Azure Machine Learning Pipeline with CommandStep",
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
@@ -252,7 +252,7 @@
|
||||
"# is_directory=None)\n",
|
||||
"\n",
|
||||
"# Naming the intermediate data as processed_data1 and assigning it to the variable processed_data1.\n",
|
||||
"processed_data1 = PipelineData(\"processed_data1\",datastore=def_blob_store)\n",
|
||||
"processed_data1 = PipelineData(\"processed_data1\",datastore=def_blob_store, is_directory=True)\n",
|
||||
"print(\"PipelineData object created\")"
|
||||
]
|
||||
},
|
||||
@@ -347,7 +347,7 @@
|
||||
"source": [
|
||||
"# step5 to use the intermediate data produced by step4\n",
|
||||
"# This step also produces an output processed_data2\n",
|
||||
"processed_data2 = PipelineData(\"processed_data2\", datastore=def_blob_store)\n",
|
||||
"processed_data2 = PipelineData(\"processed_data2\", datastore=def_blob_store, is_directory=True)\n",
|
||||
"source_directory = \"data_dependency_run_extract\"\n",
|
||||
"\n",
|
||||
"extractStep = PythonScriptStep(\n",
|
||||
@@ -394,7 +394,7 @@
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Now define the compare step which takes two inputs and produces an output\n",
|
||||
"processed_data3 = PipelineData(\"processed_data3\", datastore=def_blob_store)\n",
|
||||
"processed_data3 = PipelineData(\"processed_data3\", datastore=def_blob_store, is_directory=True)\n",
|
||||
"source_directory = \"data_dependency_run_compare\"\n",
|
||||
"\n",
|
||||
"compareStep = PythonScriptStep(\n",
|
||||
@@ -545,9 +545,9 @@
|
||||
],
|
||||
"friendly_name": "Azure Machine Learning Pipelines with Data Dependency",
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"display_name": "Python 3.8 - AzureML",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
"name": "python38-azureml"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user