Compare commits

...

80 Commits

Author SHA1 Message Date
jeff-shepherd
f1aff553c4 Merge pull request #1980 from Man-MSFT/mafong/fairness-dep
Remove fairness notebooks
2025-03-14 09:42:02 -07:00
Man Fong
d195a673e2 Remove fairness notebooks 2025-03-13 14:25:59 -07:00
jeff-shepherd
8dce0fa6fe Merge pull request #1977 from Azure/jeffshep/windowsonnx
Pin onnx on Windows
2024-12-16 08:44:42 -08:00
Jeff Shepherd
4e8a240a71 Pin onnx on Windows 2024-12-13 15:51:10 -08:00
jeff-shepherd
5b019e28de Merge pull request #1976 from Azure/release_update_stablev2/Release-247
update samples from Release-247 as a part of 1.59.0 SDK stable release
2024-12-13 08:50:52 -08:00
amlrelsa-ms
bf4cb1e86c update samples from Release-247 as a part of 1.59.0 SDK stable release 2024-12-10 17:34:41 +00:00
jeff-shepherd
eaa7c56590 Merge pull request #1974 from Azure/jeffshep/post158sync
Remove deprecated sample notebooks
2024-11-04 09:20:56 -08:00
Jeff Shepherd
8fc0fa040d Remove deprecated sample notebooks 2024-11-01 11:49:20 -07:00
jeff-shepherd
56e13b0b9a Merge pull request #1972 from Azure/release_update_stablev2/Release-243
update samples from Release-243 as a part of 1.58.0 SDK stable release
2024-10-21 09:03:36 -07:00
amlrelsa-ms
785fe3c962 update samples from Release-243 as a part of 1.58.0 SDK stable release 2024-10-16 17:50:12 +00:00
jeff-shepherd
3c341f6e9a Merge pull request #1968 from Azure/release_update_stablev2/Release-240
update samples from Release-240 as a part of 1.57.0 SDK stable release
2024-08-08 08:36:05 -07:00
amlrelsa-ms
aae88e87ea update samples from Release-240 as a part of 1.57.0 SDK stable release 2024-08-05 21:57:46 +00:00
jeff-shepherd
2352e458c7 Merge pull request #1963 from Azure/release_update_stablev2/Release-209
update samples from Release-209 as a part of 1.56.0 SDK stable release
2024-05-16 09:15:57 -07:00
amlrelsa-ms
8373b93887 update samples from Release-209 as a part of 1.56.0 SDK stable release 2024-04-29 18:42:13 +00:00
jeff-shepherd
f0442166cd Updated curated environments in sample notebooks (#1958)
* Updated curated environments in sample notebooks

* Fixed continuous retraining notebook
2024-02-15 13:01:44 -05:00
jeff-shepherd
33ca8c7933 Merge pull request #1957 from Azure/release_update_stablev2/Release-207
update samples from Release-207 as a part of 1.55.0 SDK stable release
2024-02-07 08:48:02 -08:00
amlrelsa-ms
3fd1ce8993 update samples from Release-207 as a part of 1.55.0 SDK stable release 2024-02-06 19:58:35 +00:00
jeff-shepherd
aa93588190 Merge pull request #1954 from Azure/jeffshep/pinpy38
Temporarily pin back to Python 3.8
2023-12-07 11:03:20 -08:00
Jeff Shepherd
12520400e5 Temporarily pin back to Python 3.8 2023-12-06 13:24:28 -08:00
jeff-shepherd
35614e83fa Merge pull request #1951 from Azure/release_update_stablev2/Release-200
update samples from Release-200 as a part of 1.54.0 SDK stable release
2023-11-22 18:24:05 -08:00
amlrelsa-ms
ff22ac01cc update samples from Release-200 as a part of 1.54.0 SDK stable release 2023-11-21 17:51:12 +00:00
jeff-shepherd
e7dd826f34 Merge pull request #1946 from Azure/jeffshep/pinscikit-learn
Pin scikit-learn to avoid conflict with azureml-responsibleai
2023-10-23 14:57:13 -07:00
Jeff Shepherd
fcc882174b Pin scikit-learn to avoid conflict with azureml-responsibleai 2023-10-23 09:53:39 -07:00
jeff-shepherd
6872d8a3bb Merge pull request #1941 from Azure/jeffshep/updatefor1.53.2
Updated automl_env.yml for Azure ML SDK 1.53.2
2023-10-10 08:49:04 -07:00
Jeff Shepherd
a2cb4c3589 Updated fbprophet to prophet 2023-10-10 08:47:09 -07:00
Jeff Shepherd
15008962b2 Updated automl_env.yml for Azure ML SDK 1.53.2 2023-10-05 19:29:26 -07:00
jeff-shepherd
9414b51fac Merge pull request #1937 from Azure/jeffshep/fixwindows153
Fixed Windows automl_setup for 1.53.0
2023-08-31 21:56:12 -07:00
Jeff Shepherd
80ac414582 Fixed Windows automl_setup for 1.53.0 2023-08-31 16:54:20 -07:00
jeff-shepherd
cbc151660b Merge pull request #1936 from Azure/jeffshep/fixtabulardataset
Fixed tabular-dataset-partition-per-column.ipynb
2023-08-25 15:34:08 -07:00
Jeff Shepherd
0024abc6e3 Fixed tabular-dataset-partition-per-column.ipynb and removed deploy-to-cloud/model-register-and-deploy.ipynb 2023-08-25 13:52:29 -07:00
jeff-shepherd
fa13385860 Merge pull request #1935 from Azure/release_update_stablev2/Release-193
update samples from Release-193 as a part of 1.53.0 SDK stable release
2023-08-23 11:41:24 -07:00
Jeff Shepherd
0c5f6daf52 Fixed readme syntax 2023-08-23 11:37:30 -07:00
Jeff Shepherd
c11e9fc1da Fixed readme syntax 2023-08-23 11:36:17 -07:00
Jeff Shepherd
280150713e Restored V2 message 2023-08-23 10:20:25 -07:00
amlrelsa-ms
bb11c80b1b update samples from Release-193 as a part of 1.53.0 SDK stable release 2023-08-23 03:24:03 +00:00
Diondra Peck
d0961b98bf Add disclaimer to README 2023-06-28 15:47:49 -07:00
Paul Shealy
302589b7f9 Merge pull request #1915 from Azure/release_update_stablev2/Release-171
Release update stablev2/release 171 for SDK 1.51.0
2023-06-07 19:19:33 -07:00
amlrelsa-ms
cc85949d6d update samples from Release-171 as a part of 1.51 SDK stable release 2023-06-06 21:58:24 +05:30
amlrelsa-sa
3a1824e3ad update samples from Release-170 as a part of 1.51 SDK stable release 2023-06-06 10:50:33 +05:30
Paul Shealy
579643326d Merge pull request #1911 from diondrapeck/add-deprecation-disclaimer
Add repository deprecation disclaimer and pointer to v2 repo
2023-05-25 08:04:29 -07:00
Diondra Peck
14f76f227e Add deprecation disclaimer 2023-05-23 12:48:14 -07:00
Paul Shealy
25baf5203a Merge pull request #1899 from Azure/release_update/Release-177
update samples from Release-177 as a part of  SDK release
2023-04-17 13:01:27 -07:00
amlrelsa-ms
1178fcb0ba update samples from Release-177 as a part of SDK release 2023-04-17 10:22:59 +00:00
Sasidhar Kasturi
e4d84c8e45 update samples from Release-169 as a part of 1.50.0 SDK stable release (#1898)
Co-authored-by: amlrelsa-ms <amlrelsa@microsoft.com>
2023-04-14 10:39:38 -04:00
Harneet Virk
7a3ab1e44c Merge pull request #1895 from Azure/release_update/Release-175
update samples from Release-175 as a part of  SDK release
2023-03-28 10:17:27 -07:00
amlrelsa-ms
598a293dfa update samples from Release-175 as a part of SDK release 2023-03-28 01:02:26 +00:00
Harneet Virk
40b3068462 Merge pull request #1884 from Azure/release_update_stablev2/Release-166
update samples from Release-166 as a part of 1.49.0 SDK stable release
2023-02-13 21:22:05 -08:00
amlrelsa-ms
0ecbbbce75 update samples from Release-166 as a part of 1.49.0 SDK stable release 2023-02-14 02:46:24 +00:00
Harneet Virk
9b1e130d18 Merge pull request #1867 from Azure/release_update/Release-173
update samples from Release-173 as a part of  SDK release
2022-12-19 19:37:41 -08:00
amlrelsa-ms
0e17b33d2a update samples from Release-173 as a part of SDK release 2022-12-20 03:35:58 +00:00
Harneet Virk
34d80abd26 Merge pull request #1864 from Azure/release_update/Release-172
update samples from Release-172 as a part of  SDK release
2022-12-16 09:28:16 -08:00
amlrelsa-ms
249278ab77 update samples from Release-172 as a part of SDK release 2022-12-15 17:32:05 +00:00
Harneet Virk
25fdb17f80 Merge pull request #1862 from Azure/release_update/Release-170
update samples from Release-170 as a part of  SDK release
2022-12-06 10:06:06 -08:00
amlrelsa-ms
3a02a27f1e update samples from Release-170 as a part of SDK release 2022-12-06 03:22:18 +00:00
Harneet Virk
4eed9d529f Merge pull request #1861 from Azure/release_update/Release-169
update samples from Release-169 as a part of  SDK release
2022-12-05 12:33:52 -08:00
amlrelsa-ms
f344d410a2 update samples from Release-169 as a part of SDK release 2022-12-05 20:12:47 +00:00
Harneet Virk
9dc1228063 Merge pull request #1860 from Azure/release_update/Release-168
update samples from Release-168 as a part of  SDK release
2022-12-05 09:54:01 -08:00
amlrelsa-ms
4404e62f58 update samples from Release-168 as a part of SDK release 2022-12-05 17:52:07 +00:00
Harneet Virk
38d5743bbb Merge pull request #1852 from Azure/release_update/Release-167
update samples from Release-167 as a part of  SDK release
2022-11-08 11:01:10 -08:00
amlrelsa-ms
0814eee151 update samples from Release-167 as a part of SDK release 2022-11-08 01:17:48 +00:00
Harneet Virk
f45b815221 Merge pull request #1848 from Azure/release_update/Release-166
update samples from Release-166 as a part of  SDK release
2022-10-26 12:04:10 -07:00
amlrelsa-ms
bd629ae454 update samples from Release-166 as a part of SDK release 2022-10-26 18:46:34 +00:00
Harneet Virk
41de75a584 Merge pull request #1846 from Azure/release_update_stablev2/Release-156
update samples from Release-156 as a part of 1.47.0 SDK stable release
2022-10-25 21:01:03 -07:00
amlrelsa-ms
96a426dc36 update samples from Release-156 as a part of 1.47.0 SDK stable release 2022-10-25 21:28:24 +00:00
Harneet Virk
824dd40f7e Merge pull request #1836 from Azure/release_update/Release-165
update samples from Release-165 as a part of  SDK release
2022-10-11 13:07:26 -07:00
amlrelsa-ms
fa2e649fe8 update samples from Release-165 as a part of SDK release 2022-10-11 19:33:50 +00:00
Harneet Virk
e25e8e3a41 Merge pull request #1832 from Azure/release_update/Release-164
update samples from Release-164 as a part of  SDK release
2022-10-05 11:29:47 -07:00
amlrelsa-ms
aa3670a902 update samples from Release-164 as a part of SDK release 2022-10-05 17:31:10 +00:00
Harneet Virk
ef1f9205ac Merge pull request #1831 from Azure/release_update_stablev2/Release-153
update samples from Release-153 as a part of 1.46.0 SDK stable release
2022-10-04 15:04:25 -07:00
amlrelsa-ms
3228bbfc63 update samples from Release-153 as a part of 1.46.0 SDK stable release 2022-09-30 17:30:23 +00:00
Harneet Virk
f18a0dfc4d Merge pull request #1825 from Azure/release_update/Release-163
update samples from Release-163 as a part of  SDK release
2022-09-20 14:12:22 -07:00
amlrelsa-ms
badb620261 update samples from Release-163 as a part of SDK release 2022-09-20 21:11:25 +00:00
Harneet Virk
acf46100ae Merge pull request #1817 from Azure/release_update/Release-161
update samples from Release-161 as a part of  SDK release
2022-09-16 15:54:11 -07:00
amlrelsa-ms
cf2e3804d5 update samples from Release-161 as a part of SDK release 2022-09-16 20:16:37 +00:00
Harneet Virk
b7be42357f Merge pull request #1814 from Azure/release_update/Release-160
update samples from Release-160 as a part of  SDK release
2022-09-12 18:57:44 -07:00
amlrelsa-ms
3ac82c07ae update samples from Release-160 as a part of SDK release 2022-09-13 01:24:40 +00:00
Harneet Virk
9743c0a1fa Merge pull request #1755 from Azure/users/GitHubPolicyService/11f57c70-4141-4c68-9224-aceb8eab1c48
Adding Microsoft SECURITY.MD
2022-09-06 16:52:36 -07:00
Harneet Virk
ba4dac530e Merge pull request #1808 from Azure/release_update/Release-157
update samples from Release-157 as a part of  SDK release
2022-09-06 16:33:03 -07:00
amlrelsa-ms
7f7f0040fd update samples from Release-157 as a part of SDK release 2022-09-06 23:16:24 +00:00
microsoft-github-policy-service[bot]
e0c9376aab Microsoft mandatory file 2022-05-25 17:12:16 +00:00
459 changed files with 2142 additions and 25311 deletions

View File

@@ -1,6 +1,6 @@
# Azure Machine Learning Python SDK notebooks
> a community-driven repository of examples using mlflow for tracking can be found at https://github.com/Azure/azureml-examples
### **With the introduction of AzureML SDK v2, this samples repository for the v1 SDK is now deprecated and will not be monitored or updated. Users are encouraged to visit the [v2 SDK samples repository](https://github.com/Azure/azureml-examples) instead for up-to-date and enhanced examples of how to build, train, and deploy machine learning models with AzureML's newest features.**
Welcome to the Azure Machine Learning Python SDK notebooks repository!

41
SECURITY.md Normal file
View File

@@ -0,0 +1,41 @@
<!-- BEGIN MICROSOFT SECURITY.MD V0.0.7 BLOCK -->
## Security
Microsoft takes the security of our software products and services seriously, which includes all source code repositories managed through our GitHub organizations, which include [Microsoft](https://github.com/Microsoft), [Azure](https://github.com/Azure), [DotNet](https://github.com/dotnet), [AspNet](https://github.com/aspnet), [Xamarin](https://github.com/xamarin), and [our GitHub organizations](https://opensource.microsoft.com/).
If you believe you have found a security vulnerability in any Microsoft-owned repository that meets [Microsoft's definition of a security vulnerability](https://aka.ms/opensource/security/definition), please report it to us as described below.
## Reporting Security Issues
**Please do not report security vulnerabilities through public GitHub issues.**
Instead, please report them to the Microsoft Security Response Center (MSRC) at [https://msrc.microsoft.com/create-report](https://aka.ms/opensource/security/create-report).
If you prefer to submit without logging in, send email to [secure@microsoft.com](mailto:secure@microsoft.com). If possible, encrypt your message with our PGP key; please download it from the [Microsoft Security Response Center PGP Key page](https://aka.ms/opensource/security/pgpkey).
You should receive a response within 24 hours. If for some reason you do not, please follow up via email to ensure we received your original message. Additional information can be found at [microsoft.com/msrc](https://aka.ms/opensource/security/msrc).
Please include the requested information listed below (as much as you can provide) to help us better understand the nature and scope of the possible issue:
* Type of issue (e.g. buffer overflow, SQL injection, cross-site scripting, etc.)
* Full paths of source file(s) related to the manifestation of the issue
* The location of the affected source code (tag/branch/commit or direct URL)
* Any special configuration required to reproduce the issue
* Step-by-step instructions to reproduce the issue
* Proof-of-concept or exploit code (if possible)
* Impact of the issue, including how an attacker might exploit the issue
This information will help us triage your report more quickly.
If you are reporting for a bug bounty, more complete reports can contribute to a higher bounty award. Please visit our [Microsoft Bug Bounty Program](https://aka.ms/opensource/security/bounty) page for more details about our active programs.
## Preferred Languages
We prefer all communications to be in English.
## Policy
Microsoft follows the principle of [Coordinated Vulnerability Disclosure](https://aka.ms/opensource/security/cvd).
<!-- END MICROSOFT SECURITY.MD BLOCK -->

View File

@@ -103,7 +103,7 @@
"source": [
"import azureml.core\n",
"\n",
"print(\"This notebook was created using version 1.44.0 of the Azure ML SDK\")\n",
"print(\"This notebook was created using version 1.59.0 of the Azure ML SDK\")\n",
"print(\"You are currently using version\", azureml.core.VERSION, \"of the Azure ML SDK\")"
]
},
@@ -329,7 +329,7 @@
" print(\"Creating new gpu-cluster\")\n",
" \n",
" # Specify the configuration for the new cluster\n",
" compute_config = AmlCompute.provisioning_configuration(vm_size=\"STANDARD_NC6\",\n",
" compute_config = AmlCompute.provisioning_configuration(vm_size=\"Standard_NC6s_v3\",\n",
" min_nodes=0,\n",
" max_nodes=4)\n",
" # Create the cluster with the specified name and configuration\n",
@@ -367,9 +367,9 @@
}
],
"kernelspec": {
"display_name": "Python 3.6",
"display_name": "Python 3.8 - AzureML",
"language": "python",
"name": "python36"
"name": "python38-azureml"
},
"language_info": {
"codemirror_mode": {

View File

@@ -174,7 +174,7 @@
"else:\n",
" print(\"creating new cluster\")\n",
" # vm_size parameter below could be modified to one of the RAPIDS-supported VM types\n",
" provisioning_config = AmlCompute.provisioning_configuration(vm_size = \"Standard_NC6s_v2\", min_nodes=1, max_nodes = 1)\n",
" provisioning_config = AmlCompute.provisioning_configuration(vm_size = \"Standard_NC6s_v3\", min_nodes=1, max_nodes = 1)\n",
"\n",
" # create the cluster\n",
" gpu_cluster = ComputeTarget.create(ws, gpu_cluster_name, provisioning_config)\n",
@@ -398,7 +398,7 @@
"# run_config.target = gpu_cluster_name\n",
"# run_config.environment.docker.enabled = True\n",
"# run_config.environment.docker.gpu_support = True\n",
"# run_config.environment.docker.base_image = \"rapidsai/rapidsai:cuda9.2-runtime-ubuntu18.04\"\n",
"# run_config.environment.docker.base_image = \"rapidsai/rapidsai:cuda9.2-runtime-ubuntu20.04\"\n",
"# # run_config.environment.docker.base_image_registry.address = '<registry_url>' # not required if the base_image is in Docker hub\n",
"# # run_config.environment.docker.base_image_registry.username = '<user_name>' # needed only for private images\n",
"# # run_config.environment.docker.base_image_registry.password = '<password>' # needed only for private images\n",
@@ -525,9 +525,9 @@
}
],
"kernelspec": {
"display_name": "Python 3.6",
"display_name": "Python 3.8 - AzureML",
"language": "python",
"name": "python36"
"name": "python38-azureml"
},
"language_info": {
"codemirror_mode": {

View File

@@ -1,621 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Copyright (c) Microsoft Corporation. All rights reserved. \n",
"Licensed under the MIT License."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/contrib/fairness/fairlearn-azureml-mitigation.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Unfairness Mitigation with Fairlearn and Azure Machine Learning\n",
"**This notebook shows how to upload results from Fairlearn's GridSearch mitigation algorithm into a dashboard in Azure Machine Learning Studio**\n",
"\n",
"## Table of Contents\n",
"\n",
"1. [Introduction](#Introduction)\n",
"1. [Loading the Data](#LoadingData)\n",
"1. [Training an Unmitigated Model](#UnmitigatedModel)\n",
"1. [Mitigation with GridSearch](#Mitigation)\n",
"1. [Uploading a Fairness Dashboard to Azure](#AzureUpload)\n",
" 1. Registering models\n",
" 1. Computing Fairness Metrics\n",
" 1. Uploading to Azure\n",
"1. [Conclusion](#Conclusion)\n",
"\n",
"<a id=\"Introduction\"></a>\n",
"## Introduction\n",
"This notebook shows how to use [Fairlearn (an open source fairness assessment and unfairness mitigation package)](http://fairlearn.org) and Azure Machine Learning Studio for a binary classification problem. This example uses the well-known adult census dataset. For the purposes of this notebook, we shall treat this as a loan decision problem. We will pretend that the label indicates whether or not each individual repaid a loan in the past. We will use the data to train a predictor to predict whether previously unseen individuals will repay a loan or not. The assumption is that the model predictions are used to decide whether an individual should be offered a loan. Its purpose is purely illustrative of a workflow including a fairness dashboard - in particular, we do **not** include a full discussion of the detailed issues which arise when considering fairness in machine learning. For such discussions, please [refer to the Fairlearn website](http://fairlearn.org/).\n",
"\n",
"We will apply the [grid search algorithm](https://fairlearn.org/v0.4.6/api_reference/fairlearn.reductions.html#fairlearn.reductions.GridSearch) from the Fairlearn package using a specific notion of fairness called Demographic Parity. This produces a set of models, and we will view these in a dashboard both locally and in the Azure Machine Learning Studio.\n",
"\n",
"### Setup\n",
"\n",
"To use this notebook, an Azure Machine Learning workspace is required.\n",
"Please see the [configuration notebook](../../configuration.ipynb) for information about creating one, if required.\n",
"This notebook also requires the following packages:\n",
"* `azureml-contrib-fairness`\n",
"* `fairlearn>=0.6.2` (pre-v0.5.0 will work with minor modifications)\n",
"* `joblib`\n",
"* `liac-arff`\n",
"* `raiwidgets`\n",
"\n",
"Fairlearn relies on features introduced in v0.22.1 of `scikit-learn`. If you have an older version already installed, please uncomment and run the following cell:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# !pip install --upgrade scikit-learn>=0.22.1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, please ensure that when you downloaded this notebook, you also downloaded the `fairness_nb_utils.py` file from the same location, and placed it in the same directory as this notebook."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"LoadingData\"></a>\n",
"## Loading the Data\n",
"We use the well-known `adult` census dataset, which we will fetch from the OpenML website. We start with a fairly unremarkable set of imports:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from fairlearn.reductions import GridSearch, DemographicParity, ErrorRate\n",
"from raiwidgets import FairnessDashboard\n",
"\n",
"from sklearn.compose import ColumnTransformer\n",
"from sklearn.impute import SimpleImputer\n",
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.preprocessing import StandardScaler, OneHotEncoder\n",
"from sklearn.compose import make_column_selector as selector\n",
"from sklearn.pipeline import Pipeline\n",
"\n",
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can now load and inspect the data:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from fairness_nb_utils import fetch_census_dataset\n",
"\n",
"data = fetch_census_dataset()\n",
" \n",
"# Extract the items we want\n",
"X_raw = data.data\n",
"y = (data.target == '>50K') * 1\n",
"\n",
"X_raw[\"race\"].value_counts().to_dict()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We are going to treat the sex and race of each individual as protected attributes, and in this particular case we are going to remove these attributes from the main data (this is not always the best option - see the [Fairlearn website](http://fairlearn.github.io/) for further discussion). Protected attributes are often denoted by 'A' in the literature, and we follow that convention here:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"A = X_raw[['sex','race']]\n",
"X_raw = X_raw.drop(labels=['sex', 'race'], axis = 1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We now preprocess our data. To avoid the problem of data leakage, we split our data into training and test sets before performing any other transformations. Subsequent transformations (such as scalings) will be fit to the training data set, and then applied to the test dataset."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"(X_train, X_test, y_train, y_test, A_train, A_test) = train_test_split(\n",
" X_raw, y, A, test_size=0.3, random_state=12345, stratify=y\n",
")\n",
"\n",
"# Ensure indices are aligned between X, y and A,\n",
"# after all the slicing and splitting of DataFrames\n",
"# and Series\n",
"\n",
"X_train = X_train.reset_index(drop=True)\n",
"X_test = X_test.reset_index(drop=True)\n",
"y_train = y_train.reset_index(drop=True)\n",
"y_test = y_test.reset_index(drop=True)\n",
"A_train = A_train.reset_index(drop=True)\n",
"A_test = A_test.reset_index(drop=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We have two types of column in the dataset - categorical columns which will need to be one-hot encoded, and numeric ones which will need to be rescaled. We also need to take care of missing values. We use a simple approach here, but please bear in mind that this is another way that bias could be introduced (especially if one subgroup tends to have more missing values).\n",
"\n",
"For this preprocessing, we make use of `Pipeline` objects from `sklearn`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"numeric_transformer = Pipeline(\n",
" steps=[\n",
" (\"impute\", SimpleImputer()),\n",
" (\"scaler\", StandardScaler()),\n",
" ]\n",
")\n",
"\n",
"categorical_transformer = Pipeline(\n",
" [\n",
" (\"impute\", SimpleImputer(strategy=\"most_frequent\")),\n",
" (\"ohe\", OneHotEncoder(handle_unknown=\"ignore\", sparse=False)),\n",
" ]\n",
")\n",
"\n",
"preprocessor = ColumnTransformer(\n",
" transformers=[\n",
" (\"num\", numeric_transformer, selector(dtype_exclude=\"category\")),\n",
" (\"cat\", categorical_transformer, selector(dtype_include=\"category\")),\n",
" ]\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, the preprocessing pipeline is defined, we can run it on our training data, and apply the generated transform to our test data:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X_train = preprocessor.fit_transform(X_train)\n",
"X_test = preprocessor.transform(X_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"UnmitigatedModel\"></a>\n",
"## Training an Unmitigated Model\n",
"\n",
"So we have a point of comparison, we first train a model (specifically, logistic regression from scikit-learn) on the raw data, without applying any mitigation algorithm:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"unmitigated_predictor = LogisticRegression(solver='liblinear', fit_intercept=True)\n",
"\n",
"unmitigated_predictor.fit(X_train, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can view this model in the fairness dashboard, and see the disparities which appear:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"FairnessDashboard(sensitive_features=A_test,\n",
" y_true=y_test,\n",
" y_pred={\"unmitigated\": unmitigated_predictor.predict(X_test)})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Looking at the disparity in accuracy when we select 'Sex' as the sensitive feature, we see that males have an error rate about three times greater than the females. More interesting is the disparity in opportunitiy - males are offered loans at three times the rate of females.\n",
"\n",
"Despite the fact that we removed the feature from the training data, our predictor still discriminates based on sex. This demonstrates that simply ignoring a protected attribute when fitting a predictor rarely eliminates unfairness. There will generally be enough other features correlated with the removed attribute to lead to disparate impact."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"Mitigation\"></a>\n",
"## Mitigation with GridSearch\n",
"\n",
"The `GridSearch` class in `Fairlearn` implements a simplified version of the exponentiated gradient reduction of [Agarwal et al. 2018](https://arxiv.org/abs/1803.02453). The user supplies a standard ML estimator, which is treated as a blackbox - for this simple example, we shall use the logistic regression estimator from scikit-learn. `GridSearch` works by generating a sequence of relabellings and reweightings, and trains a predictor for each.\n",
"\n",
"For this example, we specify demographic parity (on the protected attribute of sex) as the fairness metric. Demographic parity requires that individuals are offered the opportunity (a loan in this example) independent of membership in the protected class (i.e., females and males should be offered loans at the same rate). *We are using this metric for the sake of simplicity* in this example; the appropriate fairness metric can only be selected after *careful examination of the broader context* in which the model is to be used."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sweep = GridSearch(LogisticRegression(solver='liblinear', fit_intercept=True),\n",
" constraints=DemographicParity(),\n",
" grid_size=71)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With our estimator created, we can fit it to the data. After `fit()` completes, we extract the full set of predictors from the `GridSearch` object.\n",
"\n",
"The following cell trains a many copies of the underlying estimator, and may take a minute or two to run:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sweep.fit(X_train, y_train,\n",
" sensitive_features=A_train.sex)\n",
"\n",
"# For Fairlearn pre-v0.5.0, need sweep._predictors\n",
"predictors = sweep.predictors_"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We could load these predictors into the Fairness dashboard now. However, the plot would be somewhat confusing due to their number. In this case, we are going to remove the predictors which are dominated in the error-disparity space by others from the sweep (note that the disparity will only be calculated for the protected attribute; other potentially protected attributes will *not* be mitigated). In general, one might not want to do this, since there may be other considerations beyond the strict optimisation of error and disparity (of the given protected attribute)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"errors, disparities = [], []\n",
"for predictor in predictors:\n",
" error = ErrorRate()\n",
" error.load_data(X_train, pd.Series(y_train), sensitive_features=A_train.sex)\n",
" disparity = DemographicParity()\n",
" disparity.load_data(X_train, pd.Series(y_train), sensitive_features=A_train.sex)\n",
" \n",
" errors.append(error.gamma(predictor.predict)[0])\n",
" disparities.append(disparity.gamma(predictor.predict).max())\n",
" \n",
"all_results = pd.DataFrame( {\"predictor\": predictors, \"error\": errors, \"disparity\": disparities})\n",
"\n",
"dominant_models_dict = dict()\n",
"base_name_format = \"census_gs_model_{0}\"\n",
"row_id = 0\n",
"for row in all_results.itertuples():\n",
" model_name = base_name_format.format(row_id)\n",
" errors_for_lower_or_eq_disparity = all_results[\"error\"][all_results[\"disparity\"]<=row.disparity]\n",
" if row.error <= errors_for_lower_or_eq_disparity.min():\n",
" dominant_models_dict[model_name] = row.predictor\n",
" row_id = row_id + 1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can construct predictions for the dominant models (we include the unmitigated predictor as well, for comparison):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"predictions_dominant = {\"census_unmitigated\": unmitigated_predictor.predict(X_test)}\n",
"models_dominant = {\"census_unmitigated\": unmitigated_predictor}\n",
"for name, predictor in dominant_models_dict.items():\n",
" value = predictor.predict(X_test)\n",
" predictions_dominant[name] = value\n",
" models_dominant[name] = predictor"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"These predictions may then be viewed in the fairness dashboard. We include the race column from the dataset, as an alternative basis for assessing the models. However, since we have not based our mitigation on it, the variation in the models with respect to race can be large."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"FairnessDashboard(sensitive_features=A_test, \n",
" y_true=y_test.tolist(),\n",
" y_pred=predictions_dominant)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When using sex as the sensitive feature and accuracy as the metric, we see a Pareto front forming - the set of predictors which represent optimal tradeoffs between accuracy and disparity in predictions. In the ideal case, we would have a predictor at (1,0) - perfectly accurate and without any unfairness under demographic parity (with respect to the protected attribute \"sex\"). The Pareto front represents the closest we can come to this ideal based on our data and choice of estimator. Note the range of the axes - the disparity axis covers more values than the accuracy, so we can reduce disparity substantially for a small loss in accuracy. Finally, we also see that the unmitigated model is towards the top right of the plot, with high accuracy, but worst disparity.\n",
"\n",
"By clicking on individual models on the plot, we can inspect their metrics for disparity and accuracy in greater detail. In a real example, we would then pick the model which represented the best trade-off between accuracy and disparity given the relevant business constraints."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"AzureUpload\"></a>\n",
"## Uploading a Fairness Dashboard to Azure\n",
"\n",
"Uploading a fairness dashboard to Azure is a two stage process. The `FairnessDashboard` invoked in the previous section relies on the underlying Python kernel to compute metrics on demand. This is obviously not available when the fairness dashboard is rendered in AzureML Studio. By default, the dashboard in Azure Machine Learning Studio also requires the models to be registered. The required stages are therefore:\n",
"1. Register the dominant models\n",
"1. Precompute all the required metrics\n",
"1. Upload to Azure\n",
"\n",
"Before that, we need to connect to Azure Machine Learning Studio:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core import Workspace, Experiment, Model\n",
"\n",
"ws = Workspace.from_config()\n",
"ws.get_details()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"RegisterModels\"></a>\n",
"### Registering Models\n",
"\n",
"The fairness dashboard is designed to integrate with registered models, so we need to do this for the models we want in the Studio portal. The assumption is that the names of the models specified in the dashboard dictionary correspond to the `id`s (i.e. `<name>:<version>` pairs) of registered models in the workspace. We register each of the models in the `models_dominant` dictionary into the workspace. For this, we have to save each model to a file, and then register that file:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import joblib\n",
"import os\n",
"\n",
"os.makedirs('models', exist_ok=True)\n",
"def register_model(name, model):\n",
" print(\"Registering \", name)\n",
" model_path = \"models/{0}.pkl\".format(name)\n",
" joblib.dump(value=model, filename=model_path)\n",
" registered_model = Model.register(model_path=model_path,\n",
" model_name=name,\n",
" workspace=ws)\n",
" print(\"Registered \", registered_model.id)\n",
" return registered_model.id\n",
"\n",
"model_name_id_mapping = dict()\n",
"for name, model in models_dominant.items():\n",
" m_id = register_model(name, model)\n",
" model_name_id_mapping[name] = m_id"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, produce new predictions dictionaries, with the updated names:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"predictions_dominant_ids = dict()\n",
"for name, y_pred in predictions_dominant.items():\n",
" predictions_dominant_ids[model_name_id_mapping[name]] = y_pred"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"PrecomputeMetrics\"></a>\n",
"### Precomputing Metrics\n",
"\n",
"We create a _dashboard dictionary_ using Fairlearn's `metrics` package. The `_create_group_metric_set` method has arguments similar to the Dashboard constructor, except that the sensitive features are passed as a dictionary (to ensure that names are available), and we must specify the type of prediction. Note that we use the `predictions_dominant_ids` dictionary we just created:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sf = { 'sex': A_test.sex, 'race': A_test.race }\n",
"\n",
"from fairlearn.metrics._group_metric_set import _create_group_metric_set\n",
"\n",
"\n",
"dash_dict = _create_group_metric_set(y_true=y_test,\n",
" predictions=predictions_dominant_ids,\n",
" sensitive_features=sf,\n",
" prediction_type='binary_classification')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"DashboardUpload\"></a>\n",
"### Uploading the Dashboard\n",
"\n",
"Now, we import our `contrib` package which contains the routine to perform the upload:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.contrib.fairness import upload_dashboard_dictionary, download_dashboard_by_upload_id"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can create an Experiment, then a Run, and upload our dashboard to it:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"exp = Experiment(ws, \"Test_Fairlearn_GridSearch_Census_Demo\")\n",
"print(exp)\n",
"\n",
"run = exp.start_logging()\n",
"try:\n",
" dashboard_title = \"Dominant Models from GridSearch\"\n",
" upload_id = upload_dashboard_dictionary(run,\n",
" dash_dict,\n",
" dashboard_name=dashboard_title)\n",
" print(\"\\nUploaded to id: {0}\\n\".format(upload_id))\n",
"\n",
" downloaded_dict = download_dashboard_by_upload_id(run, upload_id)\n",
"finally:\n",
" run.complete()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The dashboard can be viewed in the Run Details page.\n",
"\n",
"Finally, we can verify that the dashboard dictionary which we downloaded matches our upload:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(dash_dict == downloaded_dict)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"Conclusion\"></a>\n",
"## Conclusion\n",
"\n",
"In this notebook we have demonstrated how to use the `GridSearch` algorithm from Fairlearn to generate a collection of models, and then present them in the fairness dashboard in Azure Machine Learning Studio. Please remember that this notebook has not attempted to discuss the many considerations which should be part of any approach to unfairness mitigation. The [Fairlearn website](http://fairlearn.org/) provides that discussion"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"authors": [
{
"name": "riedgar"
}
],
"kernelspec": {
"display_name": "Python 3.6",
"language": "python",
"name": "python36"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.10"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

View File

@@ -1,12 +0,0 @@
name: fairlearn-azureml-mitigation
dependencies:
- pip:
- azureml-sdk
- azureml-contrib-fairness
- fairlearn>=0.6.2
- joblib
- liac-arff
- raiwidgets~=0.19.0
- itsdangerous==2.0.1
- markupsafe<2.1.0
- protobuf==3.20.0

View File

@@ -1,111 +0,0 @@
# ---------------------------------------------------------
# Copyright (c) Microsoft Corporation. All rights reserved.
# ---------------------------------------------------------
"""Utilities for azureml-contrib-fairness notebooks."""
import arff
from collections import OrderedDict
from contextlib import closing
import gzip
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.utils import Bunch
import time
def fetch_openml_with_retries(data_id, max_retries=4, retry_delay=60):
"""Fetch a given dataset from OpenML with retries as specified."""
for i in range(max_retries):
try:
print("Download attempt {0} of {1}".format(i + 1, max_retries))
data = fetch_openml(data_id=data_id, as_frame=True)
break
except Exception as e: # noqa: B902
print("Download attempt failed with exception:")
print(e)
if i + 1 != max_retries:
print("Will retry after {0} seconds".format(retry_delay))
time.sleep(retry_delay)
retry_delay = retry_delay * 2
else:
raise RuntimeError("Unable to download dataset from OpenML")
return data
_categorical_columns = [
'workclass',
'education',
'marital-status',
'occupation',
'relationship',
'race',
'sex',
'native-country'
]
def fetch_census_dataset():
"""Fetch the Adult Census Dataset.
This uses a particular URL for the Adult Census dataset. The code
is a simplified version of fetch_openml() in sklearn.
The data are copied from:
https://openml.org/data/v1/download/1595261.gz
(as of 2021-03-31)
"""
try:
from urllib import urlretrieve
except ImportError:
from urllib.request import urlretrieve
filename = "1595261.gz"
data_url = "https://rainotebookscdn.blob.core.windows.net/datasets/"
remaining_attempts = 5
sleep_duration = 10
while remaining_attempts > 0:
try:
urlretrieve(data_url + filename, filename)
http_stream = gzip.GzipFile(filename=filename, mode='rb')
with closing(http_stream):
def _stream_generator(response):
for line in response:
yield line.decode('utf-8')
stream = _stream_generator(http_stream)
data = arff.load(stream)
except Exception as exc: # noqa: B902
remaining_attempts -= 1
print("Error downloading dataset from {} ({} attempt(s) remaining)"
.format(data_url, remaining_attempts))
print(exc)
time.sleep(sleep_duration)
sleep_duration *= 2
continue
else:
# dataset successfully downloaded
break
else:
raise Exception("Could not retrieve dataset from {}.".format(data_url))
attributes = OrderedDict(data['attributes'])
arff_columns = list(attributes)
raw_df = pd.DataFrame(data=data['data'], columns=arff_columns)
target_column_name = 'class'
target = raw_df.pop(target_column_name)
for col_name in _categorical_columns:
dtype = pd.api.types.CategoricalDtype(attributes[col_name])
raw_df[col_name] = raw_df[col_name].astype(dtype, copy=False)
result = Bunch()
result.data = raw_df
result.target = target
return result

View File

@@ -1,545 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Copyright (c) Microsoft Corporation. All rights reserved. \n",
"Licensed under the MIT License."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/contrib/fairness/upload-fairness-dashboard.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Upload a Fairness Dashboard to Azure Machine Learning Studio\n",
"**This notebook shows how to generate and upload a fairness assessment dashboard from Fairlearn to AzureML Studio**\n",
"\n",
"## Table of Contents\n",
"\n",
"1. [Introduction](#Introduction)\n",
"1. [Loading the Data](#LoadingData)\n",
"1. [Processing the Data](#ProcessingData)\n",
"1. [Training Models](#TrainingModels)\n",
"1. [Logging in to AzureML](#LoginAzureML)\n",
"1. [Registering the Models](#RegisterModels)\n",
"1. [Using the Fairness Dashboard](#LocalDashboard)\n",
"1. [Uploading a Fairness Dashboard to Azure](#AzureUpload)\n",
" 1. Computing Fairness Metrics\n",
" 1. Uploading to Azure\n",
"1. [Conclusion](#Conclusion)\n",
" \n",
"\n",
"<a id=\"Introduction\"></a>\n",
"## Introduction\n",
"\n",
"In this notebook, we walk through a simple example of using the `azureml-contrib-fairness` package to upload a collection of fairness statistics for a fairness dashboard. It is an example of integrating the [open source Fairlearn package](https://www.github.com/fairlearn/fairlearn) with Azure Machine Learning. This is not an example of fairness analysis or mitigation - this notebook simply shows how to get a fairness dashboard into the Azure Machine Learning portal. We will load the data and train a couple of simple models. We will then use Fairlearn to generate data for a Fairness dashboard, which we can upload to Azure Machine Learning portal and view there.\n",
"\n",
"### Setup\n",
"\n",
"To use this notebook, an Azure Machine Learning workspace is required.\n",
"Please see the [configuration notebook](../../configuration.ipynb) for information about creating one, if required.\n",
"This notebook also requires the following packages:\n",
"* `azureml-contrib-fairness`\n",
"* `fairlearn>=0.6.2` (also works for pre-v0.5.0 with slight modifications)\n",
"* `joblib`\n",
"* `liac-arff`\n",
"* `raiwidgets`\n",
"\n",
"Fairlearn relies on features introduced in v0.22.1 of `scikit-learn`. If you have an older version already installed, please uncomment and run the following cell:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# !pip install --upgrade scikit-learn>=0.22.1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, please ensure that when you downloaded this notebook, you also downloaded the `fairness_nb_utils.py` file from the same location, and placed it in the same directory as this notebook."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"LoadingData\"></a>\n",
"## Loading the Data\n",
"We use the well-known `adult` census dataset, which we fetch from the OpenML website. We start with a fairly unremarkable set of imports:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn import svm\n",
"from sklearn.compose import ColumnTransformer\n",
"from sklearn.impute import SimpleImputer\n",
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.preprocessing import StandardScaler, OneHotEncoder\n",
"from sklearn.compose import make_column_selector as selector\n",
"from sklearn.pipeline import Pipeline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can load the data:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from fairness_nb_utils import fetch_census_dataset\n",
"\n",
"data = fetch_census_dataset()\n",
" \n",
"# Extract the items we want\n",
"X_raw = data.data\n",
"y = (data.target == '>50K') * 1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can take a look at some of the data. For example, the next cells shows the counts of the different races identified in the dataset:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(X_raw[\"race\"].value_counts().to_dict())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"ProcessingData\"></a>\n",
"## Processing the Data\n",
"\n",
"With the data loaded, we process it for our needs. First, we extract the sensitive features of interest into `A` (conventionally used in the literature) and leave the rest of the feature data in `X_raw`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"A = X_raw[['sex','race']]\n",
"X_raw = X_raw.drop(labels=['sex', 'race'],axis = 1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We now preprocess our data. To avoid the problem of data leakage, we split our data into training and test sets before performing any other transformations. Subsequent transformations (such as scalings) will be fit to the training data set, and then applied to the test dataset."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"(X_train, X_test, y_train, y_test, A_train, A_test) = train_test_split(\n",
" X_raw, y, A, test_size=0.3, random_state=12345, stratify=y\n",
")\n",
"\n",
"# Ensure indices are aligned between X, y and A,\n",
"# after all the slicing and splitting of DataFrames\n",
"# and Series\n",
"\n",
"X_train = X_train.reset_index(drop=True)\n",
"X_test = X_test.reset_index(drop=True)\n",
"y_train = y_train.reset_index(drop=True)\n",
"y_test = y_test.reset_index(drop=True)\n",
"A_train = A_train.reset_index(drop=True)\n",
"A_test = A_test.reset_index(drop=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We have two types of column in the dataset - categorical columns which will need to be one-hot encoded, and numeric ones which will need to be rescaled. We also need to take care of missing values. We use a simple approach here, but please bear in mind that this is another way that bias could be introduced (especially if one subgroup tends to have more missing values).\n",
"\n",
"For this preprocessing, we make use of `Pipeline` objects from `sklearn`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"numeric_transformer = Pipeline(\n",
" steps=[\n",
" (\"impute\", SimpleImputer()),\n",
" (\"scaler\", StandardScaler()),\n",
" ]\n",
")\n",
"\n",
"categorical_transformer = Pipeline(\n",
" [\n",
" (\"impute\", SimpleImputer(strategy=\"most_frequent\")),\n",
" (\"ohe\", OneHotEncoder(handle_unknown=\"ignore\", sparse=False)),\n",
" ]\n",
")\n",
"\n",
"preprocessor = ColumnTransformer(\n",
" transformers=[\n",
" (\"num\", numeric_transformer, selector(dtype_exclude=\"category\")),\n",
" (\"cat\", categorical_transformer, selector(dtype_include=\"category\")),\n",
" ]\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, the preprocessing pipeline is defined, we can run it on our training data, and apply the generated transform to our test data:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X_train = preprocessor.fit_transform(X_train)\n",
"X_test = preprocessor.transform(X_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"TrainingModels\"></a>\n",
"## Training Models\n",
"\n",
"We now train a couple of different models on our data. The `adult` census dataset is a classification problem - the goal is to predict whether a particular individual exceeds an income threshold. For the purpose of generating a dashboard to upload, it is sufficient to train two basic classifiers. First, a logistic regression classifier:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"lr_predictor = LogisticRegression(solver='liblinear', fit_intercept=True)\n",
"\n",
"lr_predictor.fit(X_train, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And for comparison, a support vector classifier:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"svm_predictor = svm.SVC()\n",
"\n",
"svm_predictor.fit(X_train, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"LoginAzureML\"></a>\n",
"## Logging in to AzureML\n",
"\n",
"With our two classifiers trained, we can log into our AzureML workspace:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core import Workspace, Experiment, Model\n",
"\n",
"ws = Workspace.from_config()\n",
"ws.get_details()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"RegisterModels\"></a>\n",
"## Registering the Models\n",
"\n",
"Next, we register our models. By default, the subroutine which uploads the models checks that the names provided correspond to registered models in the workspace. We define a utility routine to do the registering:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import joblib\n",
"import os\n",
"\n",
"os.makedirs('models', exist_ok=True)\n",
"def register_model(name, model):\n",
" print(\"Registering \", name)\n",
" model_path = \"models/{0}.pkl\".format(name)\n",
" joblib.dump(value=model, filename=model_path)\n",
" registered_model = Model.register(model_path=model_path,\n",
" model_name=name,\n",
" workspace=ws)\n",
" print(\"Registered \", registered_model.id)\n",
" return registered_model.id"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we register the models. For convenience in subsequent method calls, we store the results in a dictionary, which maps the `id` of the registered model (a string in `name:version` format) to the predictor itself:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"model_dict = {}\n",
"\n",
"lr_reg_id = register_model(\"fairness_linear_regression\", lr_predictor)\n",
"model_dict[lr_reg_id] = lr_predictor\n",
"svm_reg_id = register_model(\"fairness_svm\", svm_predictor)\n",
"model_dict[svm_reg_id] = svm_predictor"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"LocalDashboard\"></a>\n",
"## Using the Fairlearn Dashboard\n",
"\n",
"We can now examine the fairness of the two models we have training, both as a function of race and (binary) sex. Before uploading the dashboard to the AzureML portal, we will first instantiate a local instance of the Fairlearn dashboard.\n",
"\n",
"Regardless of the viewing location, the dashboard is based on three things - the true values, the model predictions and the sensitive feature values. The dashboard can use predictions from multiple models and multiple sensitive features if desired (as we are doing here).\n",
"\n",
"Our first step is to generate a dictionary mapping the `id` of the registered model to the corresponding array of predictions:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ys_pred = {}\n",
"for n, p in model_dict.items():\n",
" ys_pred[n] = p.predict(X_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can examine these predictions in a locally invoked Fairlearn dashboard. This can be compared to the dashboard uploaded to the portal (in the next section):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from raiwidgets import FairnessDashboard\n",
"\n",
"FairnessDashboard(sensitive_features=A_test, \n",
" y_true=y_test.tolist(),\n",
" y_pred=ys_pred)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"AzureUpload\"></a>\n",
"## Uploading a Fairness Dashboard to Azure\n",
"\n",
"Uploading a fairness dashboard to Azure is a two stage process. The `FairnessDashboard` invoked in the previous section relies on the underlying Python kernel to compute metrics on demand. This is obviously not available when the fairness dashboard is rendered in AzureML Studio. The required stages are therefore:\n",
"1. Precompute all the required metrics\n",
"1. Upload to Azure\n",
"\n",
"\n",
"### Computing Fairness Metrics\n",
"We use Fairlearn to create a dictionary which contains all the data required to display a dashboard. This includes both the raw data (true values, predicted values and sensitive features), and also the fairness metrics. The API is similar to that used to invoke the Dashboard locally. However, there are a few minor changes to the API, and the type of problem being examined (binary classification, regression etc.) needs to be specified explicitly:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sf = { 'Race': A_test.race, 'Sex': A_test.sex }\n",
"\n",
"from fairlearn.metrics._group_metric_set import _create_group_metric_set\n",
"\n",
"dash_dict = _create_group_metric_set(y_true=y_test,\n",
" predictions=ys_pred,\n",
" sensitive_features=sf,\n",
" prediction_type='binary_classification')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `_create_group_metric_set()` method is currently underscored since its exact design is not yet final in Fairlearn."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Uploading to Azure\n",
"\n",
"We can now import the `azureml.contrib.fairness` package itself. We will round-trip the data, so there are two required subroutines:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.contrib.fairness import upload_dashboard_dictionary, download_dashboard_by_upload_id"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, we can upload the generated dictionary to AzureML. The upload method requires a run, so we first create an experiment and a run. The uploaded dashboard can be seen on the corresponding Run Details page in AzureML Studio. For completeness, we also download the dashboard dictionary which we uploaded."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"exp = Experiment(ws, \"notebook-01\")\n",
"print(exp)\n",
"\n",
"run = exp.start_logging()\n",
"try:\n",
" dashboard_title = \"Sample notebook upload\"\n",
" upload_id = upload_dashboard_dictionary(run,\n",
" dash_dict,\n",
" dashboard_name=dashboard_title)\n",
" print(\"\\nUploaded to id: {0}\\n\".format(upload_id))\n",
"\n",
" downloaded_dict = download_dashboard_by_upload_id(run, upload_id)\n",
"finally:\n",
" run.complete()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, we can verify that the dashboard dictionary which we downloaded matches our upload:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(dash_dict == downloaded_dict)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"Conclusion\"></a>\n",
"## Conclusion\n",
"\n",
"In this notebook we have demonstrated how to generate and upload a fairness dashboard to AzureML Studio. We have not discussed how to analyse the results and apply mitigations. Those topics will be covered elsewhere."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"authors": [
{
"name": "riedgar"
}
],
"kernelspec": {
"display_name": "Python 3.6",
"language": "python",
"name": "python36"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.10"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@@ -1,12 +0,0 @@
name: upload-fairness-dashboard
dependencies:
- pip:
- azureml-sdk
- azureml-contrib-fairness
- fairlearn>=0.6.2
- joblib
- liac-arff
- raiwidgets~=0.19.0
- itsdangerous==2.0.1
- markupsafe<2.1.0
- protobuf==3.20.0

View File

@@ -9,7 +9,6 @@ As a pre-requisite, run the [configuration Notebook](../configuration.ipynb) not
* [train-on-amlcompute](./training/train-on-amlcompute): Use a 1-n node Azure ML managed compute cluster for remote runs on Azure CPU or GPU infrastructure.
* [train-on-remote-vm](./training/train-on-remote-vm): Use Data Science Virtual Machine as a target for remote runs.
* [logging-api](./track-and-monitor-experiments/logging-api): Learn about the details of logging metrics to run history.
* [production-deploy-to-aks](./deployment/production-deploy-to-aks) Deploy a model to production at scale on Azure Kubernetes Service.
* [enable-app-insights-in-production-service](./deployment/enable-app-insights-in-production-service) Learn how to use App Insights with production web service.
Find quickstarts, end-to-end tutorials, and how-tos on the [official documentation site for Azure Machine Learning service](https://docs.microsoft.com/en-us/azure/machine-learning/service/).

View File

@@ -5,29 +5,22 @@ channels:
- main
dependencies:
# The python interpreter version.
# Currently Azure ML only supports 3.6.0 and later.
- pip==20.2.4
- python>=3.6,<3.9
- matplotlib==3.2.1
- py-xgboost==1.3.3
- pytorch::pytorch=1.4.0
- conda-forge::fbprophet==0.7.1
- cudatoolkit=10.1.243
- scipy==1.5.3
- notebook
- pywin32==227
- PySocks==1.7.1
- conda-forge::pyqt==5.12.3
- jsonschema==4.9.1
- Pygments==2.12.0
# Azure ML only supports 3.8 and later.
- pip==22.3.1
- python>=3.10,<3.11
- holidays==0.29
- scipy==1.10.1
- tqdm==4.66.1
- pip:
# Required packages for AzureML execution, history, and data preparation.
- azureml-widgets~=1.44.0
- pytorch-transformers==1.0.0
- spacy==2.2.4
- pystan==2.19.1.1
- https://aka.ms/automl-resources/packages/en_core_web_sm-2.1.0.tar.gz
- -r https://automlsdkdataresources.blob.core.windows.net/validated-requirements/1.44.0/validated_win32_requirements.txt [--no-deps]
- arch==4.14
- wasabi==0.9.1
- azureml-widgets~=1.59.0
- azureml-defaults~=1.59.0
- -r https://automlcesdkdataresources.blob.core.windows.net/validated-requirements/1.59.0/validated_win32_requirements.txt [--no-deps]
- matplotlib==3.7.1
- xgboost==1.5.2
- prophet==1.1.4
- onnx==1.16.1
- setuptools-git==1.2
- spacy==3.7.4
- https://aka.ms/automl-resources/packages/en_core_web_sm-3.7.1.tar.gz

View File

@@ -5,29 +5,26 @@ channels:
- main
dependencies:
# The python interpreter version.
# Currently Azure ML only supports 3.6.0 and later.
- pip==20.2.4
- python>=3.6,<3.9
- boto3==1.20.19
- botocore<=1.23.19
- matplotlib==3.2.1
- numpy>=1.21.6,<=1.22.3
- cython==0.29.14
# Azure ML only supports 3.7 and later.
- pip==22.3.1
- python>=3.10,<3.11
- matplotlib==3.7.1
- numpy>=1.21.6,<=1.23.5
- urllib3==1.26.7
- scipy>=1.4.1,<=1.5.3
- scikit-learn==0.22.1
- py-xgboost<=1.3.3
- holidays==0.10.3
- conda-forge::fbprophet==0.7.1
- pytorch::pytorch=1.4.0
- scipy==1.10.1
- scikit-learn==1.5.1
- holidays==0.29
- pytorch::pytorch=1.11.0
- cudatoolkit=10.1.243
- notebook
- pip:
# Required packages for AzureML execution, history, and data preparation.
- azureml-widgets~=1.44.0
- azureml-widgets~=1.59.0
- azureml-defaults~=1.59.0
- pytorch-transformers==1.0.0
- spacy==2.2.4
- pystan==2.19.1.1
- https://aka.ms/automl-resources/packages/en_core_web_sm-2.1.0.tar.gz
- -r https://automlsdkdataresources.blob.core.windows.net/validated-requirements/1.44.0/validated_linux_requirements.txt [--no-deps]
- arch==4.14
- spacy==3.7.4
- xgboost==1.5.2
- prophet==1.1.4
- https://aka.ms/automl-resources/packages/en_core_web_sm-3.7.1.tar.gz
- -r https://automlcesdkdataresources.blob.core.windows.net/validated-requirements/1.59.0/validated_linux_requirements.txt [--no-deps]

View File

@@ -5,30 +5,22 @@ channels:
- main
dependencies:
# The python interpreter version.
# Currently Azure ML only supports 3.6.0 and later.
- pip==20.2.4
- nomkl
- python>=3.6,<3.9
- boto3==1.20.19
- botocore<=1.23.19
- matplotlib==3.2.1
- numpy>=1.21.6,<=1.22.3
- cython==0.29.14
- urllib3==1.26.7
- scipy>=1.4.1,<=1.5.3
- scikit-learn==0.22.1
- py-xgboost<=1.3.3
- holidays==0.10.3
- conda-forge::fbprophet==0.7.1
- pytorch::pytorch=1.4.0
- cudatoolkit=9.0
# Currently Azure ML only supports 3.7 and later.
- pip==22.3.1
- python>=3.10,<3.11
- numpy>=1.21.6,<=1.23.5
- scipy==1.10.1
- scikit-learn==1.5.1
- holidays==0.29
- pip:
# Required packages for AzureML execution, history, and data preparation.
- azureml-widgets~=1.44.0
- azureml-widgets~=1.59.0
- azureml-defaults~=1.59.0
- pytorch-transformers==1.0.0
- spacy==2.2.4
- pystan==2.19.1.1
- https://aka.ms/automl-resources/packages/en_core_web_sm-2.1.0.tar.gz
- -r https://automlsdkdataresources.blob.core.windows.net/validated-requirements/1.44.0/validated_darwin_requirements.txt [--no-deps]
- arch==4.14
- prophet==1.1.4
- xgboost==1.5.2
- spacy==3.7.4
- matplotlib==3.7.1
- https://aka.ms/automl-resources/packages/en_core_web_sm-3.7.1.tar.gz
- -r https://automlcesdkdataresources.blob.core.windows.net/validated-requirements/1.59.0/validated_darwin_requirements.txt [--no-deps]

View File

@@ -33,6 +33,8 @@ if not errorlevel 1 (
call conda env create -f %automl_env_file% -n %conda_env_name%
)
python "%conda_prefix%\scripts\pywin32_postinstall.py" -install
call conda activate %conda_env_name% 2>nul:
if errorlevel 1 goto ErrorExit

View File

@@ -1,4 +1,4 @@
from distutils.version import LooseVersion
from setuptools._vendor.packaging import version
import platform
try:
@@ -17,7 +17,7 @@ if architecture != "64bit":
minimumVersion = "4.7.8"
versionInvalid = (LooseVersion(conda.__version__) < LooseVersion(minimumVersion))
versionInvalid = (version.parse(conda.__version__) < version.parse(minimumVersion))
if versionInvalid:
print('Setup requires conda version ' + minimumVersion + ' or higher.')

View File

@@ -1,5 +1,21 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Copyright (c) Microsoft Corporation. All rights reserved.\n",
"\n",
"Licensed under the MIT License."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/classification-bank-marketing-all-features/auto-ml-classification-bank-marketing-all-features.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
@@ -77,7 +93,8 @@
"from azureml.core.workspace import Workspace\n",
"from azureml.core.dataset import Dataset\n",
"from azureml.train.automl import AutoMLConfig\n",
"from azureml.interpret import ExplanationClient"
"from azureml.interpret import ExplanationClient\n",
"from azureml.data.datapath import DataPath"
]
},
{
@@ -250,10 +267,12 @@
"pd.DataFrame(data).to_csv(\"data/train_data.csv\", index=False)\n",
"\n",
"ds = ws.get_default_datastore()\n",
"ds.upload(\n",
" src_dir=\"./data\", target_path=\"bankmarketing\", overwrite=True, show_progress=True\n",
"target = DataPath(\n",
" datastore=ds, path_on_datastore=\"bankmarketing/train_data.csv\", name=\"bankmarketing\"\n",
")\n",
"Dataset.File.upload_directory(\n",
" src_dir=\"./data\", target=target, overwrite=True, show_progress=True\n",
")\n",
"\n",
"\n",
"# Upload the training data as a tabular dataset for access during training on remote compute\n",
"train_data = Dataset.Tabular.from_delimited_files(\n",
@@ -712,7 +731,9 @@
"from azureml.core.model import Model\n",
"from azureml.core.environment import Environment\n",
"\n",
"inference_config = InferenceConfig(entry_script=script_file_name)\n",
"inference_config = InferenceConfig(\n",
" environment=best_run.get_environment(), entry_script=script_file_name\n",
")\n",
"\n",
"aciconfig = AciWebservice.deploy_configuration(\n",
" cpu_cores=2,\n",
@@ -828,9 +849,7 @@
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"metadata": {},
"outputs": [],
"source": [
"%matplotlib notebook\n",
@@ -1060,9 +1079,9 @@
"name": "python3-azureml"
},
"kernelspec": {
"display_name": "Python 3.6",
"display_name": "Python 3.8 - AzureML",
"language": "python",
"name": "python36"
"name": "python38-azureml"
},
"language_info": {
"codemirror_mode": {
@@ -1074,7 +1093,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.12"
"version": "3.10.14"
},
"nteract": {
"version": "nteract-front-end@1.0.0"
@@ -1088,5 +1107,5 @@
"task": "Classification"
},
"nbformat": 4,
"nbformat_minor": 1
"nbformat_minor": 4
}

View File

@@ -1,4 +0,0 @@
name: auto-ml-classification-bank-marketing-all-features
dependencies:
- pip:
- azureml-sdk

View File

@@ -1,5 +1,21 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Copyright (c) Microsoft Corporation. All rights reserved.\n",
"\n",
"Licensed under the MIT License."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/classification-credit-card-fraud/auto-ml-classification-credit-card-fraud.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
@@ -456,9 +472,9 @@
"friendly_name": "Classification of credit card fraudulent transactions using Automated ML",
"index_order": 5,
"kernelspec": {
"display_name": "Python 3.6",
"display_name": "Python 3.8 - AzureML",
"language": "python",
"name": "python36"
"name": "python38-azureml"
},
"language_info": {
"codemirror_mode": {

View File

@@ -1,4 +0,0 @@
name: auto-ml-classification-credit-card-fraud
dependencies:
- pip:
- azureml-sdk

View File

@@ -1,593 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Automated Machine Learning\n",
"_**Text Classification Using Deep Learning**_\n",
"\n",
"## Contents\n",
"1. [Introduction](#Introduction)\n",
"1. [Setup](#Setup)\n",
"1. [Data](#Data)\n",
"1. [Train](#Train)\n",
"1. [Evaluate](#Evaluate)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Introduction\n",
"This notebook demonstrates classification with text data using deep learning in AutoML.\n",
"\n",
"AutoML highlights here include using deep neural networks (DNNs) to create embedded features from text data. Depending on the compute cluster the user provides, AutoML tried out Bidirectional Encoder Representations from Transformers (BERT) when a GPU compute is used, and Bidirectional Long-Short Term neural network (BiLSTM) when a CPU compute is used, thereby optimizing the choice of DNN for the uesr's setup.\n",
"\n",
"Make sure you have executed the [configuration](../../../configuration.ipynb) before running this notebook.\n",
"\n",
"Notebook synopsis:\n",
"\n",
"1. Creating an Experiment in an existing Workspace\n",
"2. Configuration and remote run of AutoML for a text dataset (20 Newsgroups dataset from scikit-learn) for classification\n",
"3. Registering the best model for future use\n",
"4. Evaluating the final model on a test set"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"import logging\n",
"import os\n",
"import shutil\n",
"\n",
"import pandas as pd\n",
"\n",
"import azureml.core\n",
"from azureml.core.experiment import Experiment\n",
"from azureml.core.workspace import Workspace\n",
"from azureml.core.dataset import Dataset\n",
"from azureml.core.compute import AmlCompute\n",
"from azureml.core.compute import ComputeTarget\n",
"from azureml.core.run import Run\n",
"from azureml.widgets import RunDetails\n",
"from azureml.core.model import Model\n",
"from helper import run_inference, get_result_df\n",
"from azureml.train.automl import AutoMLConfig\n",
"from sklearn.datasets import fetch_20newsgroups"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This sample notebook may use features that are not available in previous versions of the Azure ML SDK."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As part of the setup you have already created a <b>Workspace</b>. To run AutoML, you also need to create an <b>Experiment</b>. An Experiment corresponds to a prediction problem you are trying to solve, while a Run corresponds to a specific approach to the problem."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ws = Workspace.from_config()\n",
"\n",
"# Choose an experiment name.\n",
"experiment_name = \"automl-classification-text-dnn\"\n",
"\n",
"experiment = Experiment(ws, experiment_name)\n",
"\n",
"output = {}\n",
"output[\"Subscription ID\"] = ws.subscription_id\n",
"output[\"Workspace Name\"] = ws.name\n",
"output[\"Resource Group\"] = ws.resource_group\n",
"output[\"Location\"] = ws.location\n",
"output[\"Experiment Name\"] = experiment.name\n",
"output[\"SDK Version\"] = azureml.core.VERSION\n",
"pd.set_option(\"display.max_colwidth\", None)\n",
"outputDf = pd.DataFrame(data=output, index=[\"\"])\n",
"outputDf.T"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Set up a compute cluster\n",
"This section uses a user-provided compute cluster (named \"dnntext-cluster\" in this example). If a cluster with this name does not exist in the user's workspace, the below code will create a new cluster. You can choose the parameters of the cluster as mentioned in the comments.\n",
"\n",
"> Note that if you have an AzureML Data Scientist role, you will not have permission to create compute resources. Talk to your workspace or IT admin to create the compute targets described in this section, if they do not already exist.\n",
"\n",
"Whether you provide/select a CPU or GPU cluster, AutoML will choose the appropriate DNN for that setup - BiLSTM or BERT text featurizer will be included in the candidate featurizers on CPU and GPU respectively. If your goal is to obtain the most accurate model, we recommend you use GPU clusters since BERT featurizers usually outperform BiLSTM featurizers."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.compute import ComputeTarget, AmlCompute\n",
"from azureml.core.compute_target import ComputeTargetException\n",
"\n",
"num_nodes = 2\n",
"\n",
"# Choose a name for your cluster.\n",
"amlcompute_cluster_name = \"dnntext-cluster\"\n",
"\n",
"# Verify that cluster does not exist already\n",
"try:\n",
" compute_target = ComputeTarget(workspace=ws, name=amlcompute_cluster_name)\n",
" print(\"Found existing cluster, use it.\")\n",
"except ComputeTargetException:\n",
" compute_config = AmlCompute.provisioning_configuration(\n",
" vm_size=\"STANDARD_NC6\", # CPU for BiLSTM, such as \"STANDARD_D2_V2\"\n",
" # To use BERT (this is recommended for best performance), select a GPU such as \"STANDARD_NC6\"\n",
" # or similar GPU option\n",
" # available in your workspace\n",
" idle_seconds_before_scaledown=60,\n",
" max_nodes=num_nodes,\n",
" )\n",
" compute_target = ComputeTarget.create(ws, amlcompute_cluster_name, compute_config)\n",
"\n",
"compute_target.wait_for_completion(show_output=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Get data\n",
"For this notebook we will use 20 Newsgroups data from scikit-learn. We filter the data to contain four classes and take a sample as training data. Please note that for accuracy improvement, more data is needed. For this notebook we provide a small-data example so that you can use this template to use with your larger sized data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data_dir = \"text-dnn-data\" # Local directory to store data\n",
"blobstore_datadir = data_dir # Blob store directory to store data in\n",
"target_column_name = \"y\"\n",
"feature_column_name = \"X\"\n",
"\n",
"\n",
"def get_20newsgroups_data():\n",
" \"\"\"Fetches 20 Newsgroups data from scikit-learn\n",
" Returns them in form of pandas dataframes\n",
" \"\"\"\n",
" remove = (\"headers\", \"footers\", \"quotes\")\n",
" categories = [\n",
" \"rec.sport.baseball\",\n",
" \"rec.sport.hockey\",\n",
" \"comp.graphics\",\n",
" \"sci.space\",\n",
" ]\n",
"\n",
" data = fetch_20newsgroups(\n",
" subset=\"train\",\n",
" categories=categories,\n",
" shuffle=True,\n",
" random_state=42,\n",
" remove=remove,\n",
" )\n",
" data = pd.DataFrame(\n",
" {feature_column_name: data.data, target_column_name: data.target}\n",
" )\n",
"\n",
" data_train = data[:200]\n",
" data_test = data[200:300]\n",
"\n",
" data_train = remove_blanks_20news(\n",
" data_train, feature_column_name, target_column_name\n",
" )\n",
" data_test = remove_blanks_20news(data_test, feature_column_name, target_column_name)\n",
"\n",
" return data_train, data_test\n",
"\n",
"\n",
"def remove_blanks_20news(data, feature_column_name, target_column_name):\n",
"\n",
" for index, row in data.iterrows():\n",
" data.at[index, feature_column_name] = (\n",
" row[feature_column_name].replace(\"\\n\", \" \").strip()\n",
" )\n",
"\n",
" data = data[data[feature_column_name] != \"\"]\n",
"\n",
" return data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Fetch data and upload to datastore for use in training"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data_train, data_test = get_20newsgroups_data()\n",
"\n",
"if not os.path.isdir(data_dir):\n",
" os.mkdir(data_dir)\n",
"\n",
"train_data_fname = data_dir + \"/train_data.csv\"\n",
"test_data_fname = data_dir + \"/test_data.csv\"\n",
"\n",
"data_train.to_csv(train_data_fname, index=False)\n",
"data_test.to_csv(test_data_fname, index=False)\n",
"\n",
"datastore = ws.get_default_datastore()\n",
"datastore.upload(src_dir=data_dir, target_path=blobstore_datadir, overwrite=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"train_dataset = Dataset.Tabular.from_delimited_files(\n",
" path=[(datastore, blobstore_datadir + \"/train_data.csv\")]\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Prepare AutoML run"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This notebook uses the blocked_models parameter to exclude some models that can take a longer time to train on some text datasets. You can choose to remove models from the blocked_models list but you may need to increase the experiment_timeout_hours parameter value to get results."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"automl_settings = {\n",
" \"experiment_timeout_minutes\": 30,\n",
" \"primary_metric\": \"accuracy\",\n",
" \"max_concurrent_iterations\": num_nodes,\n",
" \"max_cores_per_iteration\": -1,\n",
" \"enable_dnn\": True,\n",
" \"enable_early_stopping\": True,\n",
" \"validation_size\": 0.3,\n",
" \"verbosity\": logging.INFO,\n",
" \"enable_voting_ensemble\": False,\n",
" \"enable_stack_ensemble\": False,\n",
"}\n",
"\n",
"automl_config = AutoMLConfig(\n",
" task=\"classification\",\n",
" debug_log=\"automl_errors.log\",\n",
" compute_target=compute_target,\n",
" training_data=train_dataset,\n",
" label_column_name=target_column_name,\n",
" blocked_models=[\"LightGBM\", \"XGBoostClassifier\"],\n",
" **automl_settings,\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Submit AutoML Run"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"automl_run = experiment.submit(automl_config, show_output=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Displaying the run objects gives you links to the visual tools in the Azure Portal. Go try them!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Retrieve the Best Model\n",
"Below we select the best model pipeline from our iterations, use it to test on test data on the same compute cluster."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For local inferencing, you can load the model locally via. the method `remote_run.get_output()`. For more information on the arguments expected by this method, you can run `remote_run.get_output??`.\n",
"Note that when the model contains BERT, this step will require pytorch and pytorch-transformers installed in your local environment. The exact versions of these packages can be found in the **automl_env.yml** file located in the local copy of your azureml-examples folder here: \"azureml-examples/python-sdk/tutorials/automl-with-azureml\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Retrieve the best Run object\n",
"best_run = automl_run.get_best_child()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can now see what text transformations are used to convert text data to features for this dataset, including deep learning transformations based on BiLSTM or Transformer (BERT is one implementation of a Transformer) models."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Download the featurization summary JSON file locally\n",
"best_run.download_file(\n",
" \"outputs/featurization_summary.json\", \"featurization_summary.json\"\n",
")\n",
"\n",
"# Render the JSON as a pandas DataFrame\n",
"with open(\"featurization_summary.json\", \"r\") as f:\n",
" records = json.load(f)\n",
"\n",
"featurization_summary = pd.DataFrame.from_records(records)\n",
"featurization_summary[\"Transformations\"].tolist()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Registering the best model\n",
"We now register the best fitted model from the AutoML Run for use in future deployments. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Get results stats, extract the best model from AutoML run, download and register the resultant best model"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"summary_df = get_result_df(automl_run)\n",
"best_dnn_run_id = summary_df[\"run_id\"].iloc[0]\n",
"best_dnn_run = Run(experiment, best_dnn_run_id)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"model_dir = \"Model\" # Local folder where the model will be stored temporarily\n",
"if not os.path.isdir(model_dir):\n",
" os.mkdir(model_dir)\n",
"\n",
"best_dnn_run.download_file(\"outputs/model.pkl\", model_dir + \"/model.pkl\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Register the model in your Azure Machine Learning Workspace. If you previously registered a model, please make sure to delete it so as to replace it with this new model."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Register the model\n",
"model_name = \"textDNN-20News\"\n",
"model = Model.register(\n",
" model_path=model_dir + \"/model.pkl\", model_name=model_name, tags=None, workspace=ws\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Evaluate on Test Data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We now use the best fitted model from the AutoML Run to make predictions on the test set. \n",
"\n",
"Test set schema should match that of the training set."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"test_dataset = Dataset.Tabular.from_delimited_files(\n",
" path=[(datastore, blobstore_datadir + \"/test_data.csv\")]\n",
")\n",
"\n",
"# preview the first 3 rows of the dataset\n",
"test_dataset.take(3).to_pandas_dataframe()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"test_experiment = Experiment(ws, experiment_name + \"_test\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"script_folder = os.path.join(os.getcwd(), \"inference\")\n",
"os.makedirs(script_folder, exist_ok=True)\n",
"shutil.copy(\"infer.py\", script_folder)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"test_run = run_inference(\n",
" test_experiment,\n",
" compute_target,\n",
" script_folder,\n",
" best_dnn_run,\n",
" test_dataset,\n",
" target_column_name,\n",
" model_name,\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Display computed metrics"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"test_run"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"RunDetails(test_run).show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"test_run.wait_for_completion()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pd.Series(test_run.get_metrics())"
]
}
],
"metadata": {
"authors": [
{
"name": "anshirga"
}
],
"compute": [
"AML Compute"
],
"datasets": [
"None"
],
"deployment": [
"None"
],
"exclude_from_index": false,
"framework": [
"None"
],
"friendly_name": "DNN Text Featurization",
"index_order": 2,
"kernelspec": {
"display_name": "Python 3.6",
"language": "python",
"name": "python36"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.7"
},
"tags": [
"None"
],
"task": "Text featurization using DNNs for classification"
},
"nbformat": 4,
"nbformat_minor": 2
}

View File

@@ -1,4 +0,0 @@
name: auto-ml-classification-text-dnn
dependencies:
- pip:
- azureml-sdk

View File

@@ -1,70 +0,0 @@
import pandas as pd
from azureml.core import Environment, ScriptRunConfig
from azureml.core.run import Run
def run_inference(
test_experiment,
compute_target,
script_folder,
train_run,
test_dataset,
target_column_name,
model_name,
):
inference_env = train_run.get_environment()
est = ScriptRunConfig(
source_directory=script_folder,
script="infer.py",
arguments=[
"--target_column_name",
target_column_name,
"--model_name",
model_name,
"--input-data",
test_dataset.as_named_input("data"),
],
compute_target=compute_target,
environment=inference_env,
)
run = test_experiment.submit(
est,
tags={
"training_run_id": train_run.id,
"run_algorithm": train_run.properties["run_algorithm"],
"valid_score": train_run.properties["score"],
"primary_metric": train_run.properties["primary_metric"],
},
)
run.log("run_algorithm", run.tags["run_algorithm"])
return run
def get_result_df(remote_run):
children = list(remote_run.get_children(recursive=True))
summary_df = pd.DataFrame(
index=["run_id", "run_algorithm", "primary_metric", "Score"]
)
goal_minimize = False
for run in children:
if "run_algorithm" in run.properties and "score" in run.properties:
summary_df[run.id] = [
run.id,
run.properties["run_algorithm"],
run.properties["primary_metric"],
float(run.properties["score"]),
]
if "goal" in run.properties:
goal_minimize = run.properties["goal"].split("_")[-1] == "min"
summary_df = summary_df.T.sort_values(
"Score", ascending=goal_minimize
).drop_duplicates(["run_algorithm"])
summary_df = summary_df.set_index("run_algorithm")
return summary_df

View File

@@ -1,70 +0,0 @@
import argparse
import pandas as pd
import numpy as np
from sklearn.externals import joblib
from azureml.automl.runtime.shared.score import scoring, constants
from azureml.core import Run, Dataset
from azureml.core.model import Model
parser = argparse.ArgumentParser()
parser.add_argument(
"--target_column_name",
type=str,
dest="target_column_name",
help="Target Column Name",
)
parser.add_argument(
"--model_name", type=str, dest="model_name", help="Name of registered model"
)
parser.add_argument("--input-data", type=str, dest="input_data", help="Dataset")
args = parser.parse_args()
target_column_name = args.target_column_name
model_name = args.model_name
print("args passed are: ")
print("Target column name: ", target_column_name)
print("Name of registered model: ", model_name)
model_path = Model.get_model_path(model_name)
# deserialize the model file back into a sklearn model
model = joblib.load(model_path)
run = Run.get_context()
test_dataset = Dataset.get_by_id(run.experiment.workspace, id=args.input_data)
X_test_df = test_dataset.drop_columns(
columns=[target_column_name]
).to_pandas_dataframe()
y_test_df = (
test_dataset.with_timestamp_columns(None)
.keep_columns(columns=[target_column_name])
.to_pandas_dataframe()
)
predicted = model.predict_proba(X_test_df)
if isinstance(predicted, pd.DataFrame):
predicted = predicted.values
# Use the AutoML scoring module
train_labels = model.classes_
class_labels = np.unique(
np.concatenate((y_test_df.values, np.reshape(train_labels, (-1, 1))))
)
classification_metrics = list(constants.CLASSIFICATION_SCALAR_SET)
scores = scoring.score_classification(
y_test_df.values, predicted, classification_metrics, class_labels, train_labels
)
print("scores:")
print(scores)
for key, value in scores.items():
run.log(key, value)

View File

@@ -1,5 +1,21 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Copyright (c) Microsoft Corporation. All rights reserved.\n",
"\n",
"Licensed under the MIT License."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/continuous-retraining/auto-ml-continuous-retraining.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
@@ -564,9 +580,9 @@
}
],
"kernelspec": {
"display_name": "Python 3.6",
"display_name": "Python 3.8 - AzureML",
"language": "python",
"name": "python36"
"name": "python38-azureml"
},
"language_info": {
"codemirror_mode": {

View File

@@ -1,4 +0,0 @@
name: auto-ml-continuous-retraining
dependencies:
- pip:
- azureml-sdk

View File

@@ -31,12 +31,15 @@ try:
model = Model(ws, args.model_name)
last_train_time = model.created_time
print("Model was last trained on {0}.".format(last_train_time))
except Exception as e:
except Exception:
print("Could not get last model train time.")
last_train_time = datetime.min.replace(tzinfo=pytz.UTC)
train_ds = Dataset.get_by_name(ws, args.ds_name)
dataset_changed_time = train_ds.data_changed_time
dataset_changed_time = train_ds.data_changed_time.replace(tzinfo=pytz.UTC)
print("dataset_changed_time=" + str(dataset_changed_time))
print("last_train_time=" + str(last_train_time))
if not dataset_changed_time > last_train_time:
print("Cancelling run since there is no new data.")

View File

@@ -9,7 +9,7 @@ To run these notebook on your own notebook server, use these installation instru
The instructions below will install everything you need and then start a Jupyter notebook.
If you would like to use a lighter-weight version of the client that does not install all of the machine learning libraries locally, you can leverage the [experimental notebooks.](experimental/README.md)
### 1. Install mini-conda from [here](https://conda.io/miniconda.html), choose 64-bit Python 3.7 or higher.
### 1. Install mini-conda from [here](https://conda.io/miniconda.html), choose 64-bit Python 3.8 or higher.
- **Note**: if you already have conda installed, you can keep using it but it should be version 4.4.10 or later (as shown by: conda -V). If you have a previous version installed, you can update it using the command: conda update conda.
There's no need to install mini-conda specifically.

View File

@@ -97,7 +97,7 @@
"metadata": {},
"outputs": [],
"source": [
"print(\"This notebook was created using version 1.44.0 of the Azure ML SDK\")\n",
"print(\"This notebook was created using version 1.59.0 of the Azure ML SDK\")\n",
"print(\"You are currently using version\", azureml.core.VERSION, \"of the Azure ML SDK\")"
]
},
@@ -148,7 +148,7 @@
"from azureml.core.compute_target import ComputeTargetException\n",
"\n",
"# Choose a name for your CPU cluster\n",
"cpu_cluster_name = \"cpu-cluster\"\n",
"cpu_cluster_name = \"cpu-codegen\"\n",
"\n",
"# Verify that cluster does not exist already\n",
"try:\n",
@@ -324,9 +324,9 @@
"hash": "adb464b67752e4577e3dc163235ced27038d19b7d88def00d75d1975bde5d9ab"
},
"kernelspec": {
"display_name": "Python 3.6",
"display_name": "Python 3.8 - AzureML",
"language": "python",
"name": "python36"
"name": "python38-azureml"
},
"language_info": {
"codemirror_mode": {

View File

@@ -1,4 +0,0 @@
name: codegen-for-autofeaturization
dependencies:
- pip:
- azureml-sdk

View File

@@ -97,7 +97,7 @@
"metadata": {},
"outputs": [],
"source": [
"print(\"This notebook was created using version 1.44.0 of the Azure ML SDK\")\n",
"print(\"This notebook was created using version 1.59.0 of the Azure ML SDK\")\n",
"print(\"You are currently using version\", azureml.core.VERSION, \"of the Azure ML SDK\")"
]
},
@@ -454,10 +454,13 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"**Note:** Not all datasets produce a y_transformer. The dataset used in the current notebook requires a transformer as the y column data is categorical."
"**Note:** Not all datasets produce a y_transformer. The dataset used in the current notebook requires a transformer as the y column data is categorical. \n",
"\n",
"We will go ahead and download the mlflow transformer model and use it to transform test data that can be used for further experimentation below. To run the commented code, make sure the environment requirement is satisfied. You can go ahead and create the environment from the `conda.yaml` file under `/outputs/featurization/pipeline/` and run the given code in it."
]
},
{
@@ -466,7 +469,7 @@
"metadata": {},
"outputs": [],
"source": [
"from azureml.automl.core.shared.constants import Transformers\n",
"''' from azureml.automl.core.shared.constants import Transformers\n",
"\n",
"transformers = mlflow.sklearn.load_model(uri) # Using method 1\n",
"data_transformers = transformers.get_transformers()\n",
@@ -474,14 +477,15 @@
"y_transformer = data_transformers[Transformers.Y_TRANSFORMER]\n",
"\n",
"X_test = x_transformer.transform(X_test_data)\n",
"y_test = y_transformer.transform(y_test_data)"
"y_test = y_transformer.transform(y_test_data) '''"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Run the following cell to see the featurization summary of X and y transformers. "
"Run the following cell to see the featurization summary of X and y transformers. Uncomment to use. "
]
},
{
@@ -490,10 +494,10 @@
"metadata": {},
"outputs": [],
"source": [
"X_data_summary = x_transformer.get_featurization_summary(is_user_friendly=False)\n",
"''' X_data_summary = x_transformer.get_featurization_summary(is_user_friendly=False)\n",
"\n",
"summary_df = pd.DataFrame.from_records(X_data_summary)\n",
"summary_df"
"summary_df '''"
]
},
{
@@ -544,10 +548,11 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Another way to load the data is to go to the above autofeaturization experiment and check for the featurized dataset ids under `Output datasets`. Uncomment and replace them accordingly below to use."
"Another way to load the data is to go to the above autofeaturization experiment and check for the featurized dataset ids under `Output datasets`. Uncomment and replace them accordingly below, to use."
]
},
{
@@ -597,10 +602,20 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we are passing our training data to the lightgbm classifier, any custom model can be used with your data."
"Here we are passing our training data to the lightgbm classifier, any custom model can be used with your data. Let us first install lightgbm."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"! pip install lightgbm"
]
},
{
@@ -612,11 +627,27 @@
"import lightgbm as lgb\n",
"\n",
"model = lgb.LGBMClassifier(learning_rate=0.08,max_depth=-5,random_state=42)\n",
"model.fit(X_train, y_train, sample_weight=sample_weight, eval_set=[(X_test, y_test),(X_train, y_train)],\n",
" verbose=20,eval_metric='logloss')\n",
"\n",
"model.fit(X_train, y_train, sample_weight=sample_weight)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Once training is done, the test data obtained after transforming from the above downloaded transformer can be used to calculate the accuracy "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print('Training accuracy {:.4f}'.format(model.score(X_train, y_train)))\n",
"print('Testing accuracy {:.4f}'.format(model.score(X_test, y_test)))"
"\n",
"# Uncomment below to test the model on test data \n",
"# print('Testing accuracy {:.4f}'.format(model.score(X_test, y_test)))"
]
},
{
@@ -654,45 +685,8 @@
"metadata": {},
"outputs": [],
"source": [
"y_pred = model.predict(X_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Calculate metrics for the prediction\n",
"\n",
"Now visualize the data on a scatter plot to show what our truth (actual) values are compared to the predicted values \n",
"from the trained model that was returned."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.metrics import confusion_matrix\n",
"from matplotlib import pyplot as plt\n",
"import numpy as np\n",
"import itertools\n",
"\n",
"cf =confusion_matrix(y_test,y_pred)\n",
"plt.imshow(cf,cmap=plt.cm.Blues,interpolation='nearest')\n",
"plt.colorbar()\n",
"plt.title('Confusion Matrix')\n",
"plt.xlabel('Predicted')\n",
"plt.ylabel('Actual')\n",
"class_labels = ['False','True']\n",
"tick_marks = np.arange(len(class_labels))\n",
"plt.xticks(tick_marks,class_labels)\n",
"plt.yticks([-0.5,0,1,1.5],['','False','True',''])\n",
"# plotting text value inside cells\n",
"thresh = cf.max() / 2.\n",
"for i,j in itertools.product(range(cf.shape[0]),range(cf.shape[1])):\n",
" plt.text(j,i,format(cf[i,j],'d'),horizontalalignment='center',color='white' if cf[i,j] >thresh else 'black')\n",
"plt.show()"
"# Uncomment below to test the model on test data\n",
"# y_pred = model.predict(X_test)"
]
},
{
@@ -713,9 +707,9 @@
"hash": "adb464b67752e4577e3dc163235ced27038d19b7d88def00d75d1975bde5d9ab"
},
"kernelspec": {
"display_name": "Python 3.6",
"display_name": "Python 3.8 - AzureML",
"language": "python",
"name": "python36"
"name": "python38-azureml"
},
"language_info": {
"codemirror_mode": {

View File

@@ -1,4 +0,0 @@
name: custom-model-training-from-autofeaturization-run
dependencies:
- pip:
- azureml-sdk

View File

@@ -1,21 +1,15 @@
name: azure_automl_experimental
dependencies:
# The python interpreter version.
# Currently Azure ML only supports 3.6.0 and later.
- pip<=20.2.4
- python>=3.6.0,<3.9
- cython==0.29.14
- urllib3==1.26.7
- PyJWT < 2.0.0
- numpy==1.21.6
- pywin32==227
- cryptography<37.0.0
# Currently Azure ML only supports 3.7.0 and later.
- pip<=22.3.1
- python>=3.7.0,<3.11
- pip:
# Required packages for AzureML execution, history, and data preparation.
- azure-core==1.24.1
- azure-identity==1.7.0
- azureml-defaults
- azureml-sdk
- azureml-widgets
- azureml-mlflow
- pandas
- mlflow

View File

@@ -4,14 +4,13 @@ channels:
- main
dependencies:
# The python interpreter version.
# Currently Azure ML only supports 3.6.0 and later.
# Currently Azure ML only supports 3.7.0 and later.
- pip<=20.2.4
- nomkl
- python>=3.6.0,<3.9
- python>=3.7.0,<3.11
- urllib3==1.26.7
- PyJWT < 2.0.0
- numpy>=1.21.6,<=1.22.3
- cryptography<37.0.0
- pip:
# Required packages for AzureML execution, history, and data preparation.
@@ -20,4 +19,6 @@ dependencies:
- azureml-defaults
- azureml-sdk
- azureml-widgets
- azureml-mlflow
- pandas
- mlflow

View File

@@ -1,420 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Copyright (c) Microsoft Corporation. All rights reserved.\n",
"\n",
"Licensed under the MIT License."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/experimental/classification-credit-card-fraud/auto-ml-classification-credit-card-fraud.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Automated Machine Learning\n",
"_**Classification of credit card fraudulent transactions on local managed compute **_\n",
"\n",
"## Contents\n",
"1. [Introduction](#Introduction)\n",
"1. [Setup](#Setup)\n",
"1. [Train](#Train)\n",
"1. [Results](#Results)\n",
"1. [Test](#Test)\n",
"1. [Acknowledgements](#Acknowledgements)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Introduction\n",
"\n",
"In this example we use the associated credit card dataset to showcase how you can use AutoML for a simple classification problem. The goal is to predict if a credit card transaction is considered a fraudulent charge.\n",
"\n",
"This notebook is using local managed compute to train the model.\n",
"\n",
"If you are using an Azure Machine Learning Compute Instance, you are all set. Otherwise, go through the [configuration](../../../configuration.ipynb) notebook first if you haven't already to establish your connection to the AzureML Workspace. \n",
"\n",
"In this notebook you will learn how to:\n",
"1. Create an experiment using an existing workspace.\n",
"2. Configure AutoML using `AutoMLConfig`.\n",
"3. Train the model using local managed compute.\n",
"4. Explore the results.\n",
"5. Test the fitted model."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup\n",
"\n",
"As part of the setup you have already created an Azure ML `Workspace` object. For Automated ML you will need to create an `Experiment` object, which is a named object in a `Workspace` used to run experiments."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import logging\n",
"\n",
"import pandas as pd\n",
"\n",
"import azureml.core\n",
"from azureml.core.compute_target import LocalTarget\n",
"from azureml.core.experiment import Experiment\n",
"from azureml.core.workspace import Workspace\n",
"from azureml.core.dataset import Dataset\n",
"from azureml.train.automl import AutoMLConfig"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This sample notebook may use features that are not available in previous versions of the Azure ML SDK."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"This notebook was created using version 1.44.0 of the Azure ML SDK\")\n",
"print(\"You are currently using version\", azureml.core.VERSION, \"of the Azure ML SDK\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ws = Workspace.from_config()\n",
"\n",
"# choose a name for experiment\n",
"experiment_name = 'automl-local-managed'\n",
"\n",
"experiment=Experiment(ws, experiment_name)\n",
"\n",
"output = {}\n",
"output['Subscription ID'] = ws.subscription_id\n",
"output['Workspace'] = ws.name\n",
"output['Resource Group'] = ws.resource_group\n",
"output['Location'] = ws.location\n",
"output['Experiment Name'] = experiment.name\n",
"pd.set_option('display.max_colwidth', None)\n",
"outputDf = pd.DataFrame(data = output, index = [''])\n",
"outputDf.T"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Determine if local docker is configured for Linux images\n",
"\n",
"Local managed runs will leverage a Linux docker container to submit the run to. Due to this, the docker needs to be configured to use Linux containers."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Check if Docker is installed and Linux containers are enabled\n",
"import subprocess\n",
"from subprocess import CalledProcessError\n",
"try:\n",
" assert subprocess.run(\"docker -v\", shell=True).returncode == 0, 'Local Managed runs require docker to be installed.'\n",
" out = subprocess.check_output(\"docker system info\", shell=True).decode('ascii')\n",
" assert \"OSType: linux\" in out, 'Docker engine needs to be configured to use Linux containers.' \\\n",
" 'https://docs.docker.com/docker-for-windows/#switch-between-windows-and-linux-containers'\n",
"except CalledProcessError as ex:\n",
" raise Exception('Local Managed runs require docker to be installed.') from ex"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Load Data\n",
"\n",
"Load the credit card dataset from a csv file containing both training features and labels. The features are inputs to the model, while the training labels represent the expected output of the model. Next, we'll split the data using random_split and extract the training data for the model."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data = \"https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/creditcard.csv\"\n",
"dataset = Dataset.Tabular.from_delimited_files(data)\n",
"training_data, validation_data = dataset.random_split(percentage=0.8, seed=223)\n",
"label_column_name = 'Class'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Train\n",
"\n",
"Instantiate a AutoMLConfig object. This defines the settings and data used to run the experiment.\n",
"\n",
"|Property|Description|\n",
"|-|-|\n",
"|**task**|classification or regression|\n",
"|**primary_metric**|This is the metric that you want to optimize. Classification supports the following primary metrics: <br><i>accuracy</i><br><i>AUC_weighted</i><br><i>average_precision_score_weighted</i><br><i>norm_macro_recall</i><br><i>precision_score_weighted</i>|\n",
"|**enable_early_stopping**|Stop the run if the metric score is not showing improvement.|\n",
"|**n_cross_validations**|Number of cross validation splits.|\n",
"|**training_data**|Input dataset, containing both features and label column.|\n",
"|**label_column_name**|The name of the label column.|\n",
"|**enable_local_managed**|Enable the experimental local-managed scenario.|\n",
"\n",
"**_You can find more information about primary metrics_** [here](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-configure-auto-train#primary-metric)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"automl_settings = {\n",
" \"n_cross_validations\": 3,\n",
" \"primary_metric\": 'average_precision_score_weighted',\n",
" \"enable_early_stopping\": True,\n",
" \"experiment_timeout_hours\": 0.3, #for real scenarios we recommend a timeout of at least one hour \n",
" \"verbosity\": logging.INFO,\n",
"}\n",
"\n",
"automl_config = AutoMLConfig(task = 'classification',\n",
" debug_log = 'automl_errors.log',\n",
" compute_target = LocalTarget(),\n",
" enable_local_managed = True,\n",
" training_data = training_data,\n",
" label_column_name = label_column_name,\n",
" **automl_settings\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Call the `submit` method on the experiment object and pass the run configuration. Depending on the data and the number of iterations this can run for a while. Validation errors and current status will be shown when setting `show_output=True` and the execution will be synchronous."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"parent_run = experiment.submit(automl_config, show_output = True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# If you need to retrieve a run that already started, use the following code\n",
"#from azureml.train.automl.run import AutoMLRun\n",
"#parent_run = AutoMLRun(experiment = experiment, run_id = '<replace with your run id>')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"parent_run"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Results"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Explain model\n",
"\n",
"Automated ML models can be explained and visualized using the SDK Explainability library. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Analyze results\n",
"\n",
"### Retrieve the Best Child Run\n",
"\n",
"Below we select the best pipeline from our iterations. The `get_best_child` method returns the best run. Overloads on `get_best_child` allow you to retrieve the best run for *any* logged metric."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"best_run = parent_run.get_best_child()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Test the fitted model\n",
"\n",
"Now that the model is trained, split the data in the same way the data was split for training (The difference here is the data is being split locally) and then run the test data through the trained model to get the predicted values."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X_test_df = validation_data.drop_columns(columns=[label_column_name])\n",
"y_test_df = validation_data.keep_columns(columns=[label_column_name], validate=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Creating ModelProxy for submitting prediction runs to the training environment.\n",
"We will create a ModelProxy for the best child run, which will allow us to submit a run that does the prediction in the training environment. Unlike the local client, which can have different versions of some libraries, the training environment will have all the compatible libraries for the model already."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.train.automl.model_proxy import ModelProxy\n",
"best_model_proxy = ModelProxy(best_run)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# call the predict functions on the model proxy\n",
"y_pred = best_model_proxy.predict(X_test_df).to_pandas_dataframe()\n",
"y_pred"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Acknowledgements"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This Credit Card fraud Detection dataset is made available under the Open Database License: http://opendatacommons.org/licenses/odbl/1.0/. Any rights in individual contents of the database are licensed under the Database Contents License: http://opendatacommons.org/licenses/dbcl/1.0/ and is available at: https://www.kaggle.com/mlg-ulb/creditcardfraud\n",
"\n",
"\n",
"The dataset has been collected and analysed during a research collaboration of Worldline and the Machine Learning Group (http://mlg.ulb.ac.be) of ULB (Universit\u00c3\u0192\u00c2\u00a9 Libre de Bruxelles) on big data mining and fraud detection. More details on current and past projects on related topics are available on https://www.researchgate.net/project/Fraud-detection-5 and the page of the DefeatFraud project\n",
"Please cite the following works: \n",
"\u00c3\u00a2\u00e2\u201a\u00ac\u00c2\u00a2\tAndrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015\n",
"\u00c3\u00a2\u00e2\u201a\u00ac\u00c2\u00a2\tDal Pozzolo, Andrea; Caelen, Olivier; Le Borgne, Yann-Ael; Waterschoot, Serge; Bontempi, Gianluca. Learned lessons in credit card fraud detection from a practitioner perspective, Expert systems with applications,41,10,4915-4928,2014, Pergamon\n",
"\u00c3\u00a2\u00e2\u201a\u00ac\u00c2\u00a2\tDal Pozzolo, Andrea; Boracchi, Giacomo; Caelen, Olivier; Alippi, Cesare; Bontempi, Gianluca. Credit card fraud detection: a realistic modeling and a novel learning strategy, IEEE transactions on neural networks and learning systems,29,8,3784-3797,2018,IEEE\n",
"o\tDal Pozzolo, Andrea Adaptive Machine learning for credit card fraud detection ULB MLG PhD thesis (supervised by G. Bontempi)\n",
"\u00c3\u00a2\u00e2\u201a\u00ac\u00c2\u00a2\tCarcillo, Fabrizio; Dal Pozzolo, Andrea; Le Borgne, Yann-A\u00c3\u0192\u00c2\u00abl; Caelen, Olivier; Mazzer, Yannis; Bontempi, Gianluca. Scarff: a scalable framework for streaming credit card fraud detection with Spark, Information fusion,41, 182-194,2018,Elsevier\n",
"\u00c3\u00a2\u00e2\u201a\u00ac\u00c2\u00a2\tCarcillo, Fabrizio; Le Borgne, Yann-A\u00c3\u0192\u00c2\u00abl; Caelen, Olivier; Bontempi, Gianluca. Streaming active learning strategies for real-life credit card fraud detection: assessment and visualization, International Journal of Data Science and Analytics, 5,4,285-300,2018,Springer International Publishing"
]
}
],
"metadata": {
"authors": [
{
"name": "sekrupa"
}
],
"category": "tutorial",
"compute": [
"AML Compute"
],
"datasets": [
"Creditcard"
],
"deployment": [
"None"
],
"exclude_from_index": false,
"file_extension": ".py",
"framework": [
"None"
],
"friendly_name": "Classification of credit card fraudulent transactions using Automated ML",
"index_order": 5,
"kernelspec": {
"display_name": "Python 3.6",
"language": "python",
"name": "python36"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.7"
},
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"tags": [
"AutomatedML"
],
"task": "Classification",
"version": "3.6.7"
},
"nbformat": 4,
"nbformat_minor": 2
}

View File

@@ -1,4 +0,0 @@
name: auto-ml-classification-credit-card-fraud-local-managed
dependencies:
- pip:
- azureml-sdk

View File

@@ -91,7 +91,7 @@
"metadata": {},
"outputs": [],
"source": [
"print(\"This notebook was created using version 1.44.0 of the Azure ML SDK\")\n",
"print(\"This notebook was created using version 1.59.0 of the Azure ML SDK\")\n",
"print(\"You are currently using version\", azureml.core.VERSION, \"of the Azure ML SDK\")"
]
},
@@ -448,9 +448,9 @@
"automated-machine-learning"
],
"kernelspec": {
"display_name": "Python 3.6",
"display_name": "Python 3.8 - AzureML",
"language": "python",
"name": "python36"
"name": "python38-azureml"
},
"language_info": {
"codemirror_mode": {

View File

@@ -1,4 +0,0 @@
name: auto-ml-regression-model-proxy
dependencies:
- pip:
- azureml-sdk

View File

@@ -122,7 +122,10 @@ def calculate_scores_and_build_plots(
input_dir: str, output_dir: str, automl_settings: Dict[str, Any]
):
os.makedirs(output_dir, exist_ok=True)
grains = automl_settings.get(constants.TimeSeries.TIME_SERIES_ID_COLUMN_NAMES)
grains = automl_settings.get(
constants.TimeSeries.TIME_SERIES_ID_COLUMN_NAMES,
automl_settings.get(constants.TimeSeries.GRAIN_COLUMN_NAMES, None),
)
time_column_name = automl_settings.get(constants.TimeSeries.TIME_COLUMN_NAME)
if grains is None:
grains = []

View File

@@ -13,7 +13,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/forecasting-hierarchical-timeseries/auto-ml-forecasting-hierarchical-timeseries.png)"
"![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/forecasting-backtest-many-models/auto-ml-forecasting-backtest-many-models.png)"
]
},
{
@@ -33,6 +33,7 @@
"For this notebook we are using a synthetic dataset to demonstrate the back testing in many model scenario. This allows us to check historical performance of AutoML on a historical data. To do that we step back on the backtesting period by the data set several times and split the data to train and test sets. Then these data sets are used for training and evaluation of model.<br>\n",
"\n",
"Thus, it is a quick way of evaluating AutoML as if it was in production. Here, we do not test historical performance of a particular model, for this see the [notebook](../forecasting-backtest-single-model/auto-ml-forecasting-backtest-single-model.ipynb). Instead, the best model for every backtest iteration can be different since AutoML chooses the best model for a given training set.\n",
"\n",
"![Backtesting](Backtesting.png)\n",
"\n",
"**NOTE: There are limits on how many runs we can do in parallel per workspace, and we currently recommend to set the parallelism to maximum of 320 runs per experiment per workspace. If users want to have more parallelism and increase this limit they might encounter Too Many Requests errors (HTTP 429).**"
@@ -43,7 +44,7 @@
"metadata": {},
"source": [
"### Prerequisites\n",
"You'll need to create a compute Instance by following the instructions in the [EnvironmentSetup.md](../Setup_Resources/EnvironmentSetup.md)."
"You'll need to create a compute Instance by following [these](https://learn.microsoft.com/en-us/azure/machine-learning/v1/how-to-create-manage-compute-instance?tabs=python) instructions."
]
},
{
@@ -313,21 +314,37 @@
"source": [
"### Set up training parameters\n",
"\n",
"This dictionary defines the AutoML and many models settings. For this forecasting task we need to define several settings including the name of the time column, the maximum forecast horizon, and the partition column name definition. Please note, that in this case we are setting grain_column_names to be the time series ID column plus iteration, because we want to train a separate model for each time series and iteration.\n",
"We need to provide ``ForecastingParameters``, ``AutoMLConfig`` and ``ManyModelsTrainParameters`` objects. For the forecasting task we also need to define several settings including the name of the time column, the maximum forecast horizon, and the partition column name(s) definition.\n",
"\n",
"#### ``ForecastingParameters`` arguments\n",
"| Property | Description|\n",
"| :--------------- | :------------------- |\n",
"| **forecast_horizon** | The forecast horizon is how many periods forward you would like to forecast. This integer horizon is in units of the timeseries frequency (e.g. daily, weekly). Periods are inferred from your data. |\n",
"| **time_column_name** | The name of your time column. |\n",
"| **time_series_id_column_names** | The column names used to uniquely identify timeseries in data that has multiple rows with the same timestamp. |\n",
"| **cv_step_size** | Number of periods between two consecutive cross-validation folds. The default value is \\\"auto\\\", in which case AutoMl determines the cross-validation step size automatically, if a validation set is not provided. Or users could specify an integer value. |\n",
"\n",
"#### ``AutoMLConfig`` arguments\n",
"| Property | Description|\n",
"| :--------------- | :------------------- |\n",
"| **task** | forecasting |\n",
"| **primary_metric** | This is the metric that you want to optimize.<br> Forecasting supports the following primary metrics <br><i>normalized_root_mean_squared_error</i><br><i>normalized_mean_absolute_error</i> |\n",
"| **primary_metric** | This is the metric that you want to optimize.<br> Forecasting supports the following primary metrics <br><i>spearman_correlation</i><br><i>normalized_root_mean_squared_error</i><br><i>r2_score</i><br><i>normalized_mean_absolute_error</i> |\n",
"| **blocked_models** | Blocked models won't be used by AutoML. |\n",
"| **iteration_timeout_minutes** | Maximum amount of time in minutes that the model can train. This is optional but provides customers with greater control on exit criteria. |\n",
"| **iterations** | Number of models to train. This is optional but provides customers with greater control on exit criteria. |\n",
"| **experiment_timeout_hours** | Maximum amount of time in hours that the experiment can take before it terminates. This is optional but provides customers with greater control on exit criteria. |\n",
"| **experiment_timeout_hours** | Maximum amount of time in hours that each experiment can take before it terminates. This is optional but provides customers with greater control on exit criteria. **It does not control the overall timeout for the pipeline run, instead controls the timeout for each training run per partitioned time series.** |\n",
"| **label_column_name** | The name of the label column. |\n",
"| **forecast_horizon** | The forecast horizon is how many periods forward you would like to forecast. This integer horizon is in units of the timeseries frequency (e.g. daily, weekly). Periods are inferred from your data. |\n",
"| **n_cross_validations** | Number of cross validation splits. Rolling Origin Validation is used to split time-series in a temporally consistent way. |\n",
"| **time_column_name** | The name of your time column. |\n",
"| **time_series_id_column_names** | The column names used to uniquely identify timeseries in data that has multiple rows with the same timestamp. |\n",
"| **n_cross_validations** | Number of cross validation splits. The default value is \\\"auto\\\", in which case AutoMl determines the number of cross-validations automatically, if a validation set is not provided. Or users could specify an integer value. Rolling Origin Validation is used to split time-series in a temporally consistent way. |\n",
"| **enable_early_stopping** | Flag to enable early termination if the primary metric is no longer improving. |\n",
"| **enable_engineered_explanations** | Engineered feature explanations will be downloaded if enable_engineered_explanations flag is set to True. By default it is set to False to save storage space. |\n",
"| **track_child_runs** | Flag to disable tracking of child runs. Only best run is tracked if the flag is set to False (this includes the model and metrics of the run). |\n",
"| **pipeline_fetch_max_batch_size** | Determines how many pipelines (training algorithms) to fetch at a time for training, this helps reduce throttling when training at large scale. |\n",
"\n",
"\n",
"#### ``ManyModelsTrainParameters`` arguments\n",
"| Property | Description|\n",
"| :--------------- | :------------------- |\n",
"| **automl_settings** | The ``AutoMLConfig`` object defined above. |\n",
"| **partition_column_names** | The names of columns used to group your models. For timeseries, the groups must not split up individual time-series. That is, each group must contain one or more whole time-series. |"
]
},
@@ -344,21 +361,30 @@
"from azureml.train.automl.runtime._many_models.many_models_parameters import (\n",
" ManyModelsTrainParameters,\n",
")\n",
"from azureml.automl.core.forecasting_parameters import ForecastingParameters\n",
"from azureml.train.automl.automlconfig import AutoMLConfig\n",
"\n",
"partition_column_names = [TIME_SERIES_ID_COLNAME, \"backtest_iteration\"]\n",
"automl_settings = {\n",
" \"task\": \"forecasting\",\n",
" \"primary_metric\": \"normalized_root_mean_squared_error\",\n",
" \"iteration_timeout_minutes\": 10, # This needs to be changed based on the dataset. We ask customer to explore how long training is taking before settings this value\n",
" \"iterations\": 15,\n",
" \"experiment_timeout_hours\": 0.25, # This also needs to be changed based on the dataset. For larger data set this number needs to be bigger.\n",
" \"label_column_name\": TARGET_COLNAME,\n",
" \"n_cross_validations\": 3,\n",
" \"time_column_name\": TIME_COLNAME,\n",
" \"forecast_horizon\": 6,\n",
" \"time_series_id_column_names\": partition_column_names,\n",
" \"track_child_runs\": False,\n",
"}\n",
"\n",
"forecasting_parameters = ForecastingParameters(\n",
" time_column_name=TIME_COLNAME,\n",
" forecast_horizon=6,\n",
" time_series_id_column_names=partition_column_names,\n",
" cv_step_size=\"auto\",\n",
")\n",
"\n",
"automl_settings = AutoMLConfig(\n",
" task=\"forecasting\",\n",
" primary_metric=\"normalized_root_mean_squared_error\",\n",
" iteration_timeout_minutes=10,\n",
" iterations=15,\n",
" experiment_timeout_hours=0.25,\n",
" label_column_name=TARGET_COLNAME,\n",
" n_cross_validations=\"auto\", # Feel free to set to a small integer (>=2) if runtime is an issue.\n",
" track_child_runs=False,\n",
" forecasting_parameters=forecasting_parameters,\n",
")\n",
"\n",
"\n",
"mm_paramters = ManyModelsTrainParameters(\n",
" automl_settings=automl_settings, partition_column_names=partition_column_names\n",
@@ -385,8 +411,16 @@
"| **node_count** | The number of compute nodes to be used for running the user script. We recommend to start with 3 and increase the node_count if the training time is taking too long. |\n",
"| **process_count_per_node** | Process count per node, we recommend 2:1 ratio for number of cores: number of processes per node. eg. If node has 16 cores then configure 8 or less process count per node or optimal performance. |\n",
"| **train_pipeline_parameters** | The set of configuration parameters defined in the previous section. |\n",
"| **run_invocation_timeout** | Maximum amount of time in seconds that the ``ParallelRunStep`` class is allowed. This is optional but provides customers with greater control on exit criteria. This must be greater than ``experiment_timeout_hours`` by at least 300 seconds. |\n",
"\n",
"Calling this method will create a new aggregated dataset which is generated dynamically on pipeline execution."
"Calling this method will create a new aggregated dataset which is generated dynamically on pipeline execution.\n",
"\n",
"**Note**: Total time taken for the **training step** in the pipeline to complete = $ \\frac{t}{ p \\times n } \\times ts $\n",
"where,\n",
"- $ t $ is time taken for training one partition (can be viewed in the training logs)\n",
"- $ p $ is ``process_count_per_node``\n",
"- $ n $ is ``node_count``\n",
"- $ ts $ is total number of partitions in time series based on ``partition_column_names``"
]
},
{
@@ -404,7 +438,7 @@
" compute_target=compute_target,\n",
" node_count=2,\n",
" process_count_per_node=2,\n",
" run_invocation_timeout=920,\n",
" run_invocation_timeout=1200,\n",
" train_pipeline_parameters=mm_paramters,\n",
")"
]
@@ -489,25 +523,31 @@
"source": [
"For many models we need to provide the ManyModelsInferenceParameters object.\n",
"\n",
"#### ManyModelsInferenceParameters arguments\n",
"#### ``ManyModelsInferenceParameters`` arguments\n",
"| Property | Description|\n",
"| :--------------- | :------------------- |\n",
"| **partition_column_names** | List of column names that identifies groups. |\n",
"| **target_column_name** | \\[Optional\\] Column name only if the inference dataset has the target. |\n",
"| **time_column_name** | Column name only if it is timeseries. |\n",
"| **many_models_run_id** | \\[Optional\\] Many models pipeline run id where models were trained. |\n",
"| **partition_column_names** | List of column names that identifies groups. |\n",
"| **target_column_name** | \\[Optional] Column name only if the inference dataset has the target. |\n",
"| **time_column_name** | \\[Optional] Time column name only if it is timeseries. |\n",
"| **inference_type** | \\[Optional] Which inference method to use on the model. Possible values are 'forecast', 'predict_proba', and 'predict'. |\n",
"| **forecast_mode** | \\[Optional] The type of forecast to be used, either 'rolling' or 'recursive'; defaults to 'recursive'. |\n",
"| **step** | \\[Optional] Number of periods to advance the forecasting window in each iteration **(for rolling forecast only)**; defaults to 1. |\n",
"\n",
"#### get_many_models_batch_inference_steps arguments\n",
"#### ``get_many_models_batch_inference_steps`` arguments\n",
"| Property | Description|\n",
"| :--------------- | :------------------- |\n",
"| **experiment** | The experiment used for inference run. |\n",
"| **inference_data** | The data to use for inferencing. It should be the same schema as used for training.\n",
"| **compute_target** | The compute target that runs the inference pipeline.|\n",
"| **compute_target** | The compute target that runs the inference pipeline. |\n",
"| **node_count** | The number of compute nodes to be used for running the user script. We recommend to start with the number of cores per node (varies by compute sku). |\n",
"| **process_count_per_node** | The number of processes per node.\n",
"| **train_run_id** | \\[Optional\\] The run id of the hierarchy training, by default it is the latest successful training many model run in the experiment. |\n",
"| **train_experiment_name** | \\[Optional\\] The train experiment that contains the train pipeline. This one is only needed when the train pipeline is not in the same experiement as the inference pipeline. |\n",
"| **process_count_per_node** | \\[Optional\\] The number of processes per node, by default it's 4. |"
"| **process_count_per_node** | \\[Optional] The number of processes per node. By default it's 2 (should be at most half of the number of cores in a single node of the compute cluster that will be used for the experiment).\n",
"| **inference_pipeline_parameters** | \\[Optional] The ``ManyModelsInferenceParameters`` object defined above. |\n",
"| **append_row_file_name** | \\[Optional] The name of the output file (optional, default value is 'parallel_run_step.txt'). Supports 'txt' and 'csv' file extension. A 'txt' file extension generates the output in 'txt' format with space as separator without column names. A 'csv' file extension generates the output in 'csv' format with comma as separator and with column names. |\n",
"| **train_run_id** | \\[Optional] The run id of the **training pipeline**. By default it is the latest successful training pipeline run in the experiment. |\n",
"| **train_experiment_name** | \\[Optional] The train experiment that contains the train pipeline. This one is only needed when the train pipeline is not in the same experiement as the inference pipeline. |\n",
"| **run_invocation_timeout** | \\[Optional] Maximum amount of time in seconds that the ``ParallelRunStep`` class is allowed. This is optional but provides customers with greater control on exit criteria. |\n",
"| **output_datastore** | \\[Optional] The ``Datastore`` or ``OutputDatasetConfig`` to be used for output. If specified any pipeline output will be written to that location. If unspecified the default datastore will be used. |\n",
"| **arguments** | \\[Optional] Arguments to be passed to inference script. Possible argument is '--forecast_quantiles' followed by quantile values. |"
]
},
{
@@ -527,6 +567,8 @@
" target_column_name=TARGET_COLNAME,\n",
")\n",
"\n",
"output_file_name = \"parallel_run_step.csv\"\n",
"\n",
"inference_steps = AutoMLPipelineBuilder.get_many_models_batch_inference_steps(\n",
" experiment=experiment,\n",
" inference_data=test_data,\n",
@@ -538,6 +580,7 @@
" train_run_id=training_run.id,\n",
" train_experiment_name=training_run.experiment.name,\n",
" inference_pipeline_parameters=mm_parameters,\n",
" append_row_file_name=output_file_name,\n",
")"
]
},
@@ -585,18 +628,21 @@
"source": [
"from azureml.contrib.automl.pipeline.steps.utilities import get_output_from_mm_pipeline\n",
"\n",
"PREDICTION_COLNAME = \"Predictions\"\n",
"forecasting_results_name = \"forecasting_results\"\n",
"forecasting_output_name = \"many_models_inference_output\"\n",
"forecast_file = get_output_from_mm_pipeline(\n",
" inference_run, forecasting_results_name, forecasting_output_name\n",
" inference_run, forecasting_results_name, forecasting_output_name, output_file_name\n",
")\n",
"df = pd.read_csv(forecast_file, delimiter=\" \", header=None, parse_dates=[0])\n",
"df.columns = list(X_train.columns) + [\"predicted_level\"]\n",
"df = pd.read_csv(forecast_file, parse_dates=[0])\n",
"print(\n",
" \"Prediction has \", df.shape[0], \" rows. Here the first 10 rows are being displayed.\"\n",
")\n",
"# Save the scv file with header to read it in the next step.\n",
"df.rename(columns={TARGET_COLNAME: \"actual_level\"}, inplace=True)\n",
"# Save the csv file to read it in the next step.\n",
"df.rename(\n",
" columns={TARGET_COLNAME: \"actual_level\", PREDICTION_COLNAME: \"predicted_level\"},\n",
" inplace=True,\n",
")\n",
"df.to_csv(os.path.join(forecasting_results_name, \"forecast.csv\"), index=False)\n",
"df.head(10)"
]
@@ -620,7 +666,9 @@
"backtesting_results = \"backtesting_mm_results\"\n",
"os.makedirs(backtesting_results, exist_ok=True)\n",
"calculate_scores_and_build_plots(\n",
" forecasting_results_name, backtesting_results, automl_settings\n",
" forecasting_results_name,\n",
" backtesting_results,\n",
" automl_settings.as_serializable_dict(),\n",
")\n",
"pd.DataFrame({\"File\": os.listdir(backtesting_results)})"
]
@@ -704,9 +752,9 @@
"automated-machine-learning"
],
"kernelspec": {
"display_name": "Python 3.6",
"display_name": "Python 3.8 - AzureML",
"language": "python",
"name": "python36"
"name": "python38-azureml"
},
"language_info": {
"codemirror_mode": {
@@ -718,7 +766,12 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
"version": "3.8.5"
},
"vscode": {
"interpreter": {
"hash": "6bd77c88278e012ef31757c15997a7bea8c943977c43d6909403c00ae11d43ca"
}
}
},
"nbformat": 4,

View File

@@ -1,4 +0,0 @@
name: auto-ml-forecasting-backtest-many-models
dependencies:
- pip:
- azureml-sdk

View File

@@ -43,11 +43,20 @@ def init():
global output_dir
global automl_settings
global model_uid
global forecast_quantiles
logger.info("Initialization of the run.")
parser = argparse.ArgumentParser("Parsing input arguments.")
parser.add_argument("--output-dir", dest="out", required=True)
parser.add_argument("--model-name", dest="model", default=None)
parser.add_argument("--model-uid", dest="model_uid", default=None)
parser.add_argument(
"--forecast_quantiles",
nargs="*",
type=float,
help="forecast quantiles list",
default=None,
)
parsed_args, _ = parser.parse_known_args()
model_name = parsed_args.model
@@ -55,6 +64,7 @@ def init():
target_column_name = automl_settings.get("label_column_name")
output_dir = parsed_args.out
model_uid = parsed_args.model_uid
forecast_quantiles = parsed_args.forecast_quantiles
os.makedirs(output_dir, exist_ok=True)
os.environ["AUTOML_IGNORE_PACKAGE_VERSION_INCOMPATIBILITIES".lower()] = "True"
@@ -126,23 +136,18 @@ def run_backtest(data_input_name: str, file_name: str, experiment: Experiment):
)
print(f"The model {best_run.properties['model_name']} was registered.")
_, x_pred = fitted_model.forecast(X_test)
x_pred.reset_index(inplace=True, drop=False)
columns = [automl_settings[constants.TimeSeries.TIME_COLUMN_NAME]]
if automl_settings.get(constants.TimeSeries.GRAIN_COLUMN_NAMES):
# We know that fitted_model.grain_column_names is a list.
columns.extend(fitted_model.grain_column_names)
columns.append(constants.TimeSeriesInternal.DUMMY_TARGET_COLUMN)
# Remove featurized columns.
x_pred = x_pred[columns]
x_pred.rename(
{constants.TimeSeriesInternal.DUMMY_TARGET_COLUMN: "predicted_level"},
axis=1,
inplace=True,
)
# By default we will have forecast quantiles of 0.5, which is our target
if forecast_quantiles:
if 0.5 not in forecast_quantiles:
forecast_quantiles.append(0.5)
fitted_model.quantiles = forecast_quantiles
x_pred = fitted_model.forecast_quantiles(X_test)
x_pred["actual_level"] = y_test
x_pred["backtest_iteration"] = f"iteration_{last_training_date}"
x_pred.rename({0.5: "predicted_level"}, axis=1, inplace=True)
date_safe = RE_INVALID_SYMBOLS.sub("_", last_training_date)
x_pred.to_csv(os.path.join(output_dir, f"iteration_{date_safe}.csv"), index=False)
return x_pred

View File

@@ -7,7 +7,7 @@
"Copyright (c) Microsoft Corporation. All rights reserved.\n",
"\n",
"Licensed under the MIT License.\n",
"![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/automl-forecasting-function.png)"
"![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/auto-ml-forecasting-backtest-single-model.png)"
]
},
{
@@ -283,7 +283,8 @@
"| **experiment_timeout_hours** | Maximum amount of time in hours that the experiment can take before it terminates. This is optional but provides customers with greater control on exit criteria. |\n",
"| **label_column_name** | The name of the label column. |\n",
"| **max_horizon** | The forecast horizon is how many periods forward you would like to forecast. This integer horizon is in units of the timeseries frequency (e.g. daily, weekly). Periods are inferred from your data. |\n",
"| **n_cross_validations** | Number of cross validation splits. Rolling Origin Validation is used to split time-series in a temporally consistent way. |\n",
"| **n_cross_validations** | Number of cross validation splits. The default value is \"auto\", in which case AutoMl determines the number of cross-validations automatically, if a validation set is not provided. Or users could specify an integer value. Rolling Origin Validation is used to split time-series in a temporally consistent way. |\n",
"|**cv_step_size**|Number of periods between two consecutive cross-validation folds. The default value is \"auto\", in which case AutoMl determines the cross-validation step size automatically, if a validation set is not provided. Or users could specify an integer value.\n",
"| **time_column_name** | The name of your time column. |\n",
"| **grain_column_names** | The column names used to uniquely identify timeseries in data that has multiple rows with the same timestamp. |"
]
@@ -301,7 +302,8 @@
" \"iterations\": 15,\n",
" \"experiment_timeout_hours\": 1, # This also needs to be changed based on the dataset. For larger data set this number needs to be bigger.\n",
" \"label_column_name\": LABEL_COLUMN_NAME,\n",
" \"n_cross_validations\": 3,\n",
" \"n_cross_validations\": \"auto\", # Feel free to set to a small integer (>=2) if runtime is an issue.\n",
" \"cv_step_size\": \"auto\",\n",
" \"time_column_name\": TIME_COLUMN_NAME,\n",
" \"max_horizon\": FORECAST_HORIZON,\n",
" \"track_child_runs\": False,\n",
@@ -363,6 +365,7 @@
" step_size=BACKTESTING_PERIOD,\n",
" step_number=NUMBER_OF_BACKTESTS,\n",
" model_uid=model_uid,\n",
" forecast_quantiles=[0.025, 0.975], # Optional\n",
")"
]
},
@@ -588,6 +591,7 @@
" step_size=BACKTESTING_PERIOD,\n",
" step_number=NUMBER_OF_BACKTESTS,\n",
" model_name=model_name,\n",
" forecast_quantiles=[0.025, 0.975],\n",
")"
]
},
@@ -698,9 +702,9 @@
"Azure ML AutoML"
],
"kernelspec": {
"display_name": "Python 3.6",
"display_name": "Python 3.8 - AzureML",
"language": "python",
"name": "python36"
"name": "python38-azureml"
},
"language_info": {
"codemirror_mode": {
@@ -712,7 +716,12 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
"version": "3.8.5"
},
"vscode": {
"interpreter": {
"hash": "6bd77c88278e012ef31757c15997a7bea8c943977c43d6909403c00ae11d43ca"
}
}
},
"nbformat": 4,

View File

@@ -1,4 +0,0 @@
name: auto-ml-forecasting-backtest-single-model
dependencies:
- pip:
- azureml-sdk

View File

@@ -31,6 +31,7 @@ def get_backtest_pipeline(
step_number: int,
model_name: Optional[str] = None,
model_uid: Optional[str] = None,
forecast_quantiles: Optional[list] = None,
) -> Pipeline:
"""
:param experiment: The experiment used to run the pipeline.
@@ -44,6 +45,7 @@ def get_backtest_pipeline(
:param step_size: The number of periods to step back in backtesting.
:param step_number: The number of backtesting iterations.
:param model_uid: The uid to mark models from this run of the experiment.
:param forecast_quantiles: The forecast quantiles that are required in the inference.
:return: The pipeline to be used for model retraining.
**Note:** The output will be uploaded in the pipeline output
called 'score'.
@@ -135,6 +137,9 @@ def get_backtest_pipeline(
if model_uid is not None:
prs_args.append("--model-uid")
prs_args.append(model_uid)
if forecast_quantiles:
prs_args.append("--forecast_quantiles")
prs_args.extend(forecast_quantiles)
backtest_prs = ParallelRunStep(
name=parallel_step_name,
parallel_run_config=back_test_config,

View File

@@ -16,6 +16,13 @@
"![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/forecasting-bike-share/auto-ml-forecasting-bike-share.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<font color=\"red\" size=\"5\"><strong>!Important!</strong> </br>This notebook is outdated and is not supported by the AutoML Team. Please use the supported version ([link](https://github.com/Azure/azureml-examples/tree/main/sdk/python/jobs/automl-standalone-jobs/automl-forecasting-task-bike-share)).</font>"
]
},
{
"cell_type": "markdown",
"metadata": {},
@@ -42,7 +49,7 @@
"\n",
"AutoML highlights here include built-in holiday featurization, accessing engineered feature names, and working with the `forecast` function. Please also look at the additional forecasting notebooks, which document lagging, rolling windows, forecast quantiles, other ways to use the forecast function, and forecaster deployment.\n",
"\n",
"Make sure you have executed the [configuration notebook](../../../configuration.ipynb) before running this notebook.\n",
"Make sure you have executed the [configuration notebook](https://github.com/Azure/MachineLearningNotebooks/blob/master/configuration.ipynb) before running this notebook.\n",
"\n",
"Notebook synopsis:\n",
"1. Creating an Experiment in an existing Workspace\n",
@@ -61,7 +68,11 @@
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"gather": {
"logged": 1680248038565
}
},
"outputs": [],
"source": [
"import json\n",
@@ -170,25 +181,6 @@
"source": [
"## Data\n",
"\n",
"The [Machine Learning service workspace](https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-workspace) is paired with the storage account, which contains the default data store. We will use it to upload the bike share data and create [tabular dataset](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.tabulardataset?view=azure-ml-py) for training. A tabular dataset defines a series of lazily-evaluated, immutable operations to load data from the data source into tabular representation."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"datastore = ws.get_default_datastore()\n",
"datastore.upload_files(\n",
" files=[\"./bike-no.csv\"], target_path=\"dataset/\", overwrite=True, show_progress=True\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's set up what we know about the dataset. \n",
"\n",
"**Target column** is what we want to forecast.\n",
@@ -206,25 +198,50 @@
"time_column_name = \"date\""
]
},
{
"cell_type": "markdown",
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
},
"source": [
"You are now ready to load the historical bike share data. We will load the CSV file into a plain pandas DataFrame."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"jupyter": {
"outputs_hidden": false,
"source_hidden": false
},
"nteract": {
"transient": {
"deleting": false
}
}
},
"outputs": [],
"source": [
"dataset = Dataset.Tabular.from_delimited_files(\n",
" path=[(datastore, \"dataset/bike-no.csv\")]\n",
").with_timestamp_columns(fine_grain_timestamp=time_column_name)\n",
"all_data = pd.read_csv(\"bike-no.csv\", parse_dates=[time_column_name])\n",
"\n",
"# Drop the columns 'casual' and 'registered' as these columns are a breakdown of the total and therefore a leak.\n",
"dataset = dataset.drop_columns(columns=[\"casual\", \"registered\"])\n",
"\n",
"dataset.take(5).to_pandas_dataframe().reset_index(drop=True)"
"all_data.drop([\"casual\", \"registered\"], axis=1, inplace=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
},
"source": [
"### Split the data\n",
"\n",
@@ -234,22 +251,63 @@
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"gather": {
"logged": 1680247376789
},
"jupyter": {
"outputs_hidden": false,
"source_hidden": false
},
"nteract": {
"transient": {
"deleting": false
}
}
},
"outputs": [],
"source": [
"# select data that occurs before a specified date\n",
"train = dataset.time_before(datetime(2012, 8, 31), include_boundary=True)\n",
"train.to_pandas_dataframe().tail(5).reset_index(drop=True)"
"train = all_data[all_data[time_column_name] <= pd.Timestamp(\"2012-08-31\")].copy()\n",
"test = all_data[all_data[time_column_name] >= pd.Timestamp(\"2012-09-01\")].copy()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Upload data to datastore\n",
"\n",
"The [Machine Learning service workspace](https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-workspace) is paired with the storage account, which contains the default data store. We will use it to upload the bike share data and create [tabular dataset](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.tabulardataset?view=azure-ml-py) for training. A tabular dataset defines a series of lazily-evaluated, immutable operations to load data from the data source into tabular representation."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"jupyter": {
"outputs_hidden": false,
"source_hidden": false
},
"nteract": {
"transient": {
"deleting": false
}
}
},
"outputs": [],
"source": [
"test = dataset.time_after(datetime(2012, 9, 1), include_boundary=True)\n",
"test.to_pandas_dataframe().head(5).reset_index(drop=True)"
"from azureml.data.dataset_factory import TabularDatasetFactory\n",
"\n",
"datastore = ws.get_default_datastore()\n",
"\n",
"train_dataset = TabularDatasetFactory.register_pandas_dataframe(\n",
" train, target=(datastore, \"dataset/\"), name=\"bike_no_train\"\n",
")\n",
"\n",
"test_dataset = TabularDatasetFactory.register_pandas_dataframe(\n",
" test, target=(datastore, \"dataset/\"), name=\"bike_no_test\"\n",
")"
]
},
{
@@ -265,7 +323,8 @@
"|**forecast_horizon**|The forecast horizon is how many periods forward you would like to forecast. This integer horizon is in units of the timeseries frequency (e.g. daily, weekly).|\n",
"|**country_or_region_for_holidays**|The country/region used to generate holiday features. These should be ISO 3166 two-letter country/region codes (i.e. 'US', 'GB').|\n",
"|**target_lags**|The target_lags specifies how far back we will construct the lags of the target variable.|\n",
"|**freq**|Forecast frequency. This optional parameter represents the period with which the forecast is desired, for example, daily, weekly, yearly, etc. Use this parameter for the correction of time series containing irregular data points or for padding of short time series. The frequency needs to be a pandas offset alias. Please refer to [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects) for more information."
"|**freq**|Forecast frequency. This optional parameter represents the period with which the forecast is desired, for example, daily, weekly, yearly, etc. Use this parameter for the correction of time series containing irregular data points or for padding of short time series. The frequency needs to be a pandas offset alias. Please refer to [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects) for more information.\n",
"|**cv_step_size**|Number of periods between two consecutive cross-validation folds. The default value is \"auto\", in which case AutoMl determines the cross-validation step size automatically, if a validation set is not provided. Or users could specify an integer value."
]
},
{
@@ -285,7 +344,7 @@
"|**training_data**|Input dataset, containing both features and label column.|\n",
"|**label_column_name**|The name of the label column.|\n",
"|**compute_target**|The remote compute for training.|\n",
"|**n_cross_validations**|Number of cross validation splits.|\n",
"|**n_cross_validations**|Number of cross-validation folds to use for model/pipeline selection. The default value is \"auto\", in which case AutoMl determines the number of cross-validations automatically, if a validation set is not provided. Or users could specify an integer value.\n",
"|**enable_early_stopping**|If early stopping is on, training will stop when the primary metric is no longer improving.|\n",
"|**forecasting_parameters**|A class that holds all the forecasting related parameters.|\n",
"\n",
@@ -350,6 +409,7 @@
" country_or_region_for_holidays=\"US\", # set country_or_region will trigger holiday featurizer\n",
" target_lags=\"auto\", # use heuristic based lag setting\n",
" freq=\"D\", # Set the forecast frequency to be daily\n",
" cv_step_size=\"auto\",\n",
")\n",
"\n",
"automl_config = AutoMLConfig(\n",
@@ -358,11 +418,11 @@
" featurization=featurization_config,\n",
" blocked_models=[\"ExtremeRandomTrees\"],\n",
" experiment_timeout_hours=0.3,\n",
" training_data=train,\n",
" training_data=train_dataset,\n",
" label_column_name=target_column_name,\n",
" compute_target=compute_target,\n",
" enable_early_stopping=True,\n",
" n_cross_validations=3,\n",
" n_cross_validations=\"auto\", # Feel free to set to a small integer (>=2) if runtime is an issue.\n",
" max_concurrent_iterations=4,\n",
" max_cores_per_iteration=-1,\n",
" verbosity=logging.INFO,\n",
@@ -544,7 +604,7 @@
"from run_forecast import run_rolling_forecast\n",
"\n",
"remote_run = run_rolling_forecast(\n",
" test_experiment, compute_target, best_run, test, target_column_name\n",
" test_experiment, compute_target, best_run, test_dataset, target_column_name\n",
")\n",
"remote_run"
]
@@ -573,7 +633,32 @@
"outputs": [],
"source": [
"remote_run.download_file(\"outputs/predictions.csv\", \"predictions.csv\")\n",
"df_all = pd.read_csv(\"predictions.csv\")"
"fcst_df = pd.read_csv(\"predictions.csv\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that the rolling forecast can contain multiple predictions for each date, each from a different forecast origin. For example, consider 2012-09-05:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fcst_df[fcst_df.date == \"2012-09-05\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here, the forecast origin refers to the latest date of actuals available for a given forecast. The earliest origin in the rolling forecast, 2012-08-31, is the last day in the training data. For origin date 2012-09-01, the forecasts use actual recorded counts from the training data *and* the actual count recorded on 2012-09-01. Note that the model is not retrained for origin dates later than 2012-08-31, but the values for model features, such as lagged values of daily count, are updated.\n",
"\n",
"Let's calculate the metrics over all rolling forecasts:"
]
},
{
@@ -585,29 +670,17 @@
"from azureml.automl.core.shared import constants\n",
"from azureml.automl.runtime.shared.score import scoring\n",
"from sklearn.metrics import mean_absolute_error, mean_squared_error\n",
"from matplotlib import pyplot as plt\n",
"\n",
"# use automl metrics module\n",
"scores = scoring.score_regression(\n",
" y_test=df_all[target_column_name],\n",
" y_pred=df_all[\"predicted\"],\n",
" y_test=fcst_df[target_column_name],\n",
" y_pred=fcst_df[\"predicted\"],\n",
" metrics=list(constants.Metric.SCALAR_REGRESSION_SET),\n",
")\n",
"\n",
"print(\"[Test data scores]\\n\")\n",
"for key, value in scores.items():\n",
" print(\"{}: {:.3f}\".format(key, value))\n",
"\n",
"# Plot outputs\n",
"%matplotlib inline\n",
"test_pred = plt.scatter(df_all[target_column_name], df_all[\"predicted\"], color=\"b\")\n",
"test_test = plt.scatter(\n",
" df_all[target_column_name], df_all[target_column_name], color=\"g\"\n",
")\n",
"plt.legend(\n",
" (test_pred, test_test), (\"prediction\", \"truth\"), loc=\"upper left\", fontsize=8\n",
")\n",
"plt.show()"
" print(\"{}: {:.3f}\".format(key, value))"
]
},
{
@@ -616,36 +689,15 @@
"source": [
"For more details on what metrics are included and how they are calculated, please refer to [supported metrics](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-understand-automated-ml#regressionforecasting-metrics). You could also calculate residuals, like described [here](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-understand-automated-ml#residuals).\n",
"\n",
"\n",
"Since we did a rolling evaluation on the test set, we can analyze the predictions by their forecast horizon relative to the rolling origin. The model was initially trained at a forecast horizon of 14, so each prediction from the model is associated with a horizon value from 1 to 14. The horizon values are in a column named, \"horizon_origin,\" in the prediction set. For example, we can calculate some of the error metrics grouped by the horizon:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from metrics_helper import MAPE, APE\n",
"\n",
"df_all.groupby(\"horizon_origin\").apply(\n",
" lambda df: pd.Series(\n",
" {\n",
" \"MAPE\": MAPE(df[target_column_name], df[\"predicted\"]),\n",
" \"RMSE\": np.sqrt(\n",
" mean_squared_error(df[target_column_name], df[\"predicted\"])\n",
" ),\n",
" \"MAE\": mean_absolute_error(df[target_column_name], df[\"predicted\"]),\n",
" }\n",
" )\n",
")"
"The rolling forecast metric values are very high in comparison to the validation metrics reported by the AutoML job. What's going on here? We will investigate in the following cells!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To drill down more, we can look at the distributions of APE (absolute percentage error) by horizon. From the chart, it is clear that the overall MAPE is being skewed by one particular point where the actual value is of small absolute value."
"### Forecast versus actuals plot\n",
"We will plot predictions and actuals on a time series plot. Since there are many forecasts for each date, we select the 14-day-ahead forecast from each forecast origin for our comparison."
]
},
{
@@ -654,21 +706,55 @@
"metadata": {},
"outputs": [],
"source": [
"df_all_APE = df_all.assign(APE=APE(df_all[target_column_name], df_all[\"predicted\"]))\n",
"APEs = [\n",
" df_all_APE[df_all[\"horizon_origin\"] == h].APE.values\n",
" for h in range(1, forecast_horizon + 1)\n",
"]\n",
"from matplotlib import pyplot as plt\n",
"\n",
"%matplotlib inline\n",
"plt.boxplot(APEs)\n",
"plt.yscale(\"log\")\n",
"plt.xlabel(\"horizon\")\n",
"plt.ylabel(\"APE (%)\")\n",
"plt.title(\"Absolute Percentage Errors by Forecast Horizon\")\n",
"\n",
"fcst_df_h14 = (\n",
" fcst_df.groupby(\"forecast_origin\", as_index=False)\n",
" .last()\n",
" .drop(columns=[\"forecast_origin\"])\n",
")\n",
"fcst_df_h14.set_index(time_column_name, inplace=True)\n",
"plt.plot(fcst_df_h14[[target_column_name, \"predicted\"]])\n",
"plt.xticks(rotation=45)\n",
"plt.title(f\"Predicted vs. Actuals\")\n",
"plt.legend([\"actual\", \"14-day-ahead forecast\"])\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Looking at the plot, there are two clear issues:\n",
"1. An anomalously low count value on October 29th, 2012.\n",
"2. End-of-year holidays (Thanksgiving and Christmas) in late November and late December.\n",
"\n",
"What happened on Oct. 29th, 2012? That day, Hurricane Sandy brought severe storm surge flooding to the east coast of the United States, particularly around New York City. This is certainly an anomalous event that the model did not account for!\n",
"\n",
"As for the late year holidays, the model apparently did not learn to account for the full reduction of bike share rentals on these major holidays. The training data covers 2011 and early 2012, so the model fit only had access to a single occurrence of these holidays. This makes it challenging to resolve holiday effects; however, a larger AutoML model search may result in a better model that is more holiday-aware.\n",
"\n",
"If we filter the predictions prior to the Thanksgiving holiday and remove the anomalous day of 2012-10-29, the metrics are closer to validation levels:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"date_filter = (fcst_df.date != \"2012-10-29\") & (fcst_df.date < \"2012-11-22\")\n",
"scores = scoring.score_regression(\n",
" y_test=fcst_df[date_filter][target_column_name],\n",
" y_pred=fcst_df[date_filter][\"predicted\"],\n",
" metrics=list(constants.Metric.SCALAR_REGRESSION_SET),\n",
")\n",
"\n",
"print(\"[Test data scores (filtered)]\\n\")\n",
"for key, value in scores.items():\n",
" print(\"{}: {:.3f}\".format(key, value))"
]
}
],
"metadata": {
@@ -694,10 +780,13 @@
],
"friendly_name": "Forecasting BikeShare Demand",
"index_order": 1,
"kernel_info": {
"name": "python38-azureml"
},
"kernelspec": {
"display_name": "Python 3.6",
"display_name": "Python 3.8 - AzureML",
"language": "python",
"name": "python36"
"name": "python38-azureml"
},
"language_info": {
"codemirror_mode": {
@@ -709,17 +798,30 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.7"
"version": "3.8.10"
},
"microsoft": {
"ms_spell_check": {
"ms_spell_check_language": "en"
}
},
"mimetype": "text/x-python",
"name": "python",
"npconvert_exporter": "python",
"nteract": {
"version": "nteract-front-end@1.0.0"
},
"pygments_lexer": "ipython3",
"tags": [
"Forecasting"
],
"task": "Forecasting",
"version": 3
"version": 3,
"vscode": {
"interpreter": {
"hash": "6bd77c88278e012ef31757c15997a7bea8c943977c43d6909403c00ae11d43ca"
}
}
},
"nbformat": 4,
"nbformat_minor": 4

View File

@@ -1,4 +0,0 @@
name: auto-ml-forecasting-bike-share
dependencies:
- pip:
- azureml-sdk

View File

@@ -1,6 +1,6 @@
import argparse
from azureml.core import Dataset, Run
from sklearn.externals import joblib
import joblib
parser = argparse.ArgumentParser()
parser.add_argument(
@@ -36,18 +36,18 @@ y_test_df = (
fitted_model = joblib.load("model.pkl")
y_pred, X_trans = fitted_model.rolling_evaluation(X_test_df, y_test_df.values)
X_rf = fitted_model.rolling_forecast(X_test_df, y_test_df.values, step=1)
# Add predictions, actuals, and horizon relative to rolling origin to the test feature data
assign_dict = {
"horizon_origin": X_trans["horizon_origin"].values,
"predicted": y_pred,
target_column_name: y_test_df[target_column_name].values,
fitted_model.forecast_origin_column_name: "forecast_origin",
fitted_model.forecast_column_name: "predicted",
fitted_model.actual_column_name: target_column_name,
}
df_all = X_test_df.assign(**assign_dict)
X_rf.rename(columns=assign_dict, inplace=True)
file_name = "outputs/predictions.csv"
export_csv = df_all.to_csv(file_name, header=True)
export_csv = X_rf.to_csv(file_name, header=True)
# Upload the predictions into artifacts
run.upload_file(name=file_name, path_or_stream=file_name)

View File

@@ -16,6 +16,13 @@
"![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/forecasting-energy-demand/auto-ml-forecasting-energy-demand.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<font color=\"red\" size=\"5\"><strong>!Important!</strong> </br>This notebook is outdated and is not supported by the AutoML Team. Please use the supported version ([link](https://github.com/Azure/azureml-examples/blob/main/sdk/python/jobs/automl-standalone-jobs/automl-forecasting-task-energy-demand/automl-forecasting-task-energy-demand-advanced-mlflow.ipynb)).</font>"
]
},
{
"cell_type": "markdown",
"metadata": {},
@@ -43,7 +50,7 @@
"\n",
"In this example we use the associated New York City energy demand dataset to showcase how you can use AutoML for a simple forecasting problem and explore the results. The goal is predict the energy demand for the next 48 hours based on historic time-series data.\n",
"\n",
"If you are using an Azure Machine Learning Compute Instance, you are all set. Otherwise, go through the [configuration notebook](../../../configuration.ipynb) first, if you haven't already, to establish your connection to the AzureML Workspace.\n",
"If you are using an Azure Machine Learning Compute Instance, you are all set. Otherwise, go through the [configuration notebook](https://github.com/Azure/MachineLearningNotebooks/blob/master/configuration.ipynb) first, if you haven't already, to establish your connection to the AzureML Workspace.\n",
"\n",
"In this notebook you will learn how to:\n",
"1. Creating an Experiment using an existing Workspace\n",
@@ -260,8 +267,12 @@
"outputs": [],
"source": [
"# split into train based on time\n",
"train = dataset.time_before(datetime(2017, 8, 8, 5), include_boundary=True)\n",
"train.to_pandas_dataframe().reset_index(drop=True).sort_values(time_column_name).tail(5)"
"train = (\n",
" dataset.time_before(datetime(2017, 8, 8, 5), include_boundary=True)\n",
" .to_pandas_dataframe()\n",
" .reset_index(drop=True)\n",
")\n",
"train.sort_values(time_column_name).tail(5)"
]
},
{
@@ -271,8 +282,39 @@
"outputs": [],
"source": [
"# split into test based on time\n",
"test = dataset.time_between(datetime(2017, 8, 8, 6), datetime(2017, 8, 10, 5))\n",
"test.to_pandas_dataframe().reset_index(drop=True).head(5)"
"test = (\n",
" dataset.time_between(datetime(2017, 8, 8, 6), datetime(2017, 8, 10, 5))\n",
" .to_pandas_dataframe()\n",
" .reset_index(drop=True)\n",
")\n",
"test.head(5)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"jupyter": {
"outputs_hidden": false
},
"nteract": {
"transient": {
"deleting": false
}
}
},
"outputs": [],
"source": [
"# register the splitted train and test data in workspace storage\n",
"from azureml.data.dataset_factory import TabularDatasetFactory\n",
"\n",
"datastore = ws.get_default_datastore()\n",
"train_dataset = TabularDatasetFactory.register_pandas_dataframe(\n",
" train, target=(datastore, \"dataset/\"), name=\"nyc_energy_train\"\n",
")\n",
"test_dataset = TabularDatasetFactory.register_pandas_dataframe(\n",
" test, target=(datastore, \"dataset/\"), name=\"nyc_energy_test\"\n",
")"
]
},
{
@@ -308,7 +350,8 @@
"|-|-|\n",
"|**time_column_name**|The name of your time column.|\n",
"|**forecast_horizon**|The forecast horizon is how many periods forward you would like to forecast. This integer horizon is in units of the timeseries frequency (e.g. daily, weekly).|\n",
"|**freq**|Forecast frequency. This optional parameter represents the period with which the forecast is desired, for example, daily, weekly, yearly, etc. Use this parameter for the correction of time series containing irregular data points or for padding of short time series. The frequency needs to be a pandas offset alias. Please refer to [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects) for more information."
"|**freq**|Forecast frequency. This optional parameter represents the period with which the forecast is desired, for example, daily, weekly, yearly, etc. Use this parameter for the correction of time series containing irregular data points or for padding of short time series. The frequency needs to be a pandas offset alias. Please refer to [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects) for more information.\n",
"|**cv_step_size**|Number of periods between two consecutive cross-validation folds. The default value is \"auto\", in which case AutoMl determines the cross-validation step size automatically, if a validation set is not provided. Or users could specify an integer value."
]
},
{
@@ -328,7 +371,7 @@
"|**training_data**|The training data to be used within the experiment.|\n",
"|**label_column_name**|The name of the label column.|\n",
"|**compute_target**|The remote compute for training.|\n",
"|**n_cross_validations**|Number of cross validation splits. Rolling Origin Validation is used to split time-series in a temporally consistent way.|\n",
"|**n_cross_validations**|Number of cross-validation folds to use for model/pipeline selection. The default value is \"auto\", in which case AutoMl determines the number of cross-validations automatically, if a validation set is not provided. Or users could specify an integer value.\n",
"|**enable_early_stopping**|Flag to enble early termination if the score is not improving in the short term.|\n",
"|**forecasting_parameters**|A class holds all the forecasting related parameters.|\n"
]
@@ -352,6 +395,7 @@
" time_column_name=time_column_name,\n",
" forecast_horizon=forecast_horizon,\n",
" freq=\"H\", # Set the forecast frequency to be hourly\n",
" cv_step_size=\"auto\",\n",
")\n",
"\n",
"automl_config = AutoMLConfig(\n",
@@ -359,11 +403,11 @@
" primary_metric=\"normalized_root_mean_squared_error\",\n",
" blocked_models=[\"ExtremeRandomTrees\", \"AutoArima\", \"Prophet\"],\n",
" experiment_timeout_hours=0.3,\n",
" training_data=train,\n",
" training_data=train_dataset,\n",
" label_column_name=target_column_name,\n",
" compute_target=compute_target,\n",
" enable_early_stopping=True,\n",
" n_cross_validations=3,\n",
" n_cross_validations=\"auto\", # Feel free to set to a small integer (>=2) if runtime is an issue.\n",
" verbosity=logging.INFO,\n",
" forecasting_parameters=forecasting_parameters,\n",
")"
@@ -519,7 +563,7 @@
" test_experiment=test_experiment,\n",
" compute_target=compute_target,\n",
" train_run=best_run,\n",
" test_dataset=test,\n",
" test_dataset=test_dataset,\n",
" target_column_name=target_column_name,\n",
")\n",
"remote_run_infer.wait_for_completion(show_output=False)\n",
@@ -609,6 +653,7 @@
" forecast_horizon=forecast_horizon,\n",
" target_lags=12,\n",
" target_rolling_window_size=4,\n",
" cv_step_size=\"auto\",\n",
")\n",
"\n",
"automl_config = AutoMLConfig(\n",
@@ -624,11 +669,11 @@
" \"Prophet\",\n",
" ], # These models are blocked for tutorial purposes, remove this for real use cases.\n",
" experiment_timeout_hours=0.3,\n",
" training_data=train,\n",
" training_data=train_dataset,\n",
" label_column_name=target_column_name,\n",
" compute_target=compute_target,\n",
" enable_early_stopping=True,\n",
" n_cross_validations=3,\n",
" n_cross_validations=\"auto\", # Feel free to set to a small integer (>=2) if runtime is an issue.\n",
" verbosity=logging.INFO,\n",
" forecasting_parameters=advanced_forecasting_parameters,\n",
")"
@@ -695,7 +740,7 @@
" test_experiment=test_experiment_advanced,\n",
" compute_target=compute_target,\n",
" train_run=best_run_lags,\n",
" test_dataset=test,\n",
" test_dataset=test_dataset,\n",
" target_column_name=target_column_name,\n",
" inference_folder=\"./forecast_advanced\",\n",
")\n",
@@ -763,10 +808,13 @@
"how-to-use-azureml",
"automated-machine-learning"
],
"kernel_info": {
"name": "python3"
},
"kernelspec": {
"display_name": "Python 3.6",
"display_name": "Python 3.8 - AzureML",
"language": "python",
"name": "python36"
"name": "python38-azureml"
},
"language_info": {
"codemirror_mode": {
@@ -778,9 +826,22 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
"version": "3.8.10"
},
"microsoft": {
"ms_spell_check": {
"ms_spell_check_language": "en"
}
},
"nteract": {
"version": "nteract-front-end@1.0.0"
},
"vscode": {
"interpreter": {
"hash": "6bd77c88278e012ef31757c15997a7bea8c943977c43d6909403c00ae11d43ca"
}
}
},
"nbformat": 4,
"nbformat_minor": 2
"nbformat_minor": 4
}

View File

@@ -1,4 +0,0 @@
name: auto-ml-forecasting-energy-demand
dependencies:
- pip:
- azureml-sdk

View File

@@ -6,7 +6,7 @@ compute instance.
import argparse
from azureml.core import Dataset, Run
from sklearn.externals import joblib
import joblib
from pandas.tseries.frequencies import to_offset
parser = argparse.ArgumentParser()

View File

@@ -52,7 +52,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Please make sure you have followed the `configuration.ipynb` notebook so that your ML workspace information is saved in the config file."
"Please make sure you have followed the [configuration notebook](https://github.com/Azure/MachineLearningNotebooks/blob/master/configuration.ipynb) so that your ML workspace information is saved in the config file."
]
},
{
@@ -335,7 +335,8 @@
" forecast_horizon=forecast_horizon,\n",
" time_series_id_column_names=[TIME_SERIES_ID_COLUMN_NAME],\n",
" target_lags=lags,\n",
" freq=\"H\", # Set the forecast frequency to be hourly\n",
" freq=\"H\", # Set the forecast frequency to be hourly,\n",
" cv_step_size=\"auto\",\n",
")"
]
},
@@ -365,7 +366,7 @@
" enable_early_stopping=True,\n",
" training_data=train_data,\n",
" compute_target=compute_target,\n",
" n_cross_validations=3,\n",
" n_cross_validations=\"auto\", # Feel free to set to a small integer (>=2) if runtime is an issue.\n",
" verbosity=logging.INFO,\n",
" max_concurrent_iterations=4,\n",
" max_cores_per_iteration=-1,\n",
@@ -757,7 +758,15 @@
"metadata": {},
"source": [
"## Forecasting farther than the forecast horizon <a id=\"recursive forecasting\"></a>\n",
"When the forecast destination, or the latest date in the prediction data frame, is farther into the future than the specified forecast horizon, the `forecast()` function will still make point predictions out to the later date using a recursive operation mode. Internally, the method recursively applies the regular forecaster to generate context so that we can forecast further into the future. \n",
"When the forecast destination, or the latest date in the prediction data frame, is farther into the future than the specified forecast horizon, the forecaster must be iteratively applied. Here, we advance the forecast origin on each iteration over the prediction window, predicting `max_horizon` periods ahead on each iteration. There are two choices for the context data to use as the forecaster advances into the prediction window:\n",
"\n",
"1. We can use forecasted values from previous iterations (recursive forecast),\n",
"2. We can use known, actual values of the target if they are available (rolling forecast).\n",
"\n",
"The first method is useful in a true forecasting scenario when we do not yet know the actual target values while the second is useful in an evaluation scenario where we want to compute accuracy metrics for the `max_horizon`-period-ahead forecaster over a long test set. We refer to the first as a **recursive forecast** since we apply the forecaster recursively over the prediction window and the second as a **rolling forecast** since we roll forward over known actuals.\n",
"\n",
"### Recursive forecasting\n",
"By default, the `forecast()` function will make point predictions out to the later date using a recursive operation mode. Internally, the method recursively applies the regular forecaster to generate context so that we can forecast further into the future. \n",
"\n",
"To illustrate the use-case and operation of recursive forecasting, we'll consider an example with a single time-series where the forecasting period directly follows the training period and is twice as long as the forecasting horizon given at training time.\n",
"\n",
@@ -817,6 +826,35 @@
"np.array_equal(y_pred_all, y_pred_long)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Rolling forecasts\n",
"A rolling forecast is a similar concept to the recursive forecasts described above except that we use known actual values of the target for our context data. We have provided a different, public method for this called `rolling_forecast`. In addition to test data and actuals (`X_test` and `y_test`), `rolling_forecast` also accepts an optional `step` parameter that controls how far the origin advances on each iteration. The recursive forecast mode uses a fixed step of `max_horizon` while `rolling_forecast` defaults to a step size of 1, but can be set to any integer from 1 to `max_horizon`, inclusive.\n",
"\n",
"Let's see what the rolling forecast looks like on the long test set with the step set to 1:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X_rf = fitted_model.rolling_forecast(X_test_long, y_test_long, step=1)\n",
"X_rf.head(n=12)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice that `rolling_forecast` has returned a single DataFrame containing all results and has generated some new columns: `_automl_forecast_origin`, `_automl_forecast_y`, and `_automl_actual_y`. These are the origin date for each forecast, the forecasted value and the actual value, respectively. Note that \"y\" in the forecast and actual column names will generally be replaced by the target column name supplied to AutoML.\n",
"\n",
"The output above shows forecasts for two prediction windows, the first with origin at the end of the training set and the second including the first observation in the test set (2000-01-01 06:00:00). Since the forecast windows overlap, there are multiple forecasts for most dates which are associated with different origin dates."
]
},
{
"cell_type": "markdown",
"metadata": {},
@@ -865,9 +903,9 @@
"friendly_name": "Forecasting away from training data",
"index_order": 3,
"kernelspec": {
"display_name": "Python 3.6",
"display_name": "Python 3.8 - AzureML",
"language": "python",
"name": "python36"
"name": "python38-azureml"
},
"language_info": {
"codemirror_mode": {
@@ -879,14 +917,19 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
"version": "3.7.13"
},
"tags": [
"Forecasting",
"Confidence Intervals"
],
"task": "Forecasting"
"task": "Forecasting",
"vscode": {
"interpreter": {
"hash": "6bd77c88278e012ef31757c15997a7bea8c943977c43d6909403c00ae11d43ca"
}
}
},
"nbformat": 4,
"nbformat_minor": 2
"nbformat_minor": 4
}

View File

@@ -1,4 +0,0 @@
name: auto-ml-forecasting-function
dependencies:
- pip:
- azureml-sdk

View File

@@ -19,7 +19,14 @@
"hidePrompt": false
},
"source": [
"![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/forecasting-beer-remote/auto-ml-forecasting-beer-remote.png)"
"![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/forecasting-github-dau/auto-ml-forecasting-github-dau.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<font color=\"red\" size=\"5\"><strong>!Important!</strong> </br>This notebook is outdated and is not supported by the AutoML Team. Please use the supported version ([link](https://github.com/Azure/azureml-examples/tree/main/sdk/python/jobs/automl-standalone-jobs/automl-forecasting-github-dau)).</font>"
]
},
{
@@ -52,7 +59,7 @@
"\n",
"AutoML highlights here include using Deep Learning forecasts, Arima, Prophet, Remote Execution and Remote Inferencing, and working with the `forecast` function. Please also look at the additional forecasting notebooks, which document lagging, rolling windows, forecast quantiles, other ways to use the forecast function, and forecaster deployment.\n",
"\n",
"Make sure you have executed the [configuration](../../../configuration.ipynb) before running this notebook.\n",
"Make sure you have executed the [configuration](https://github.com/Azure/MachineLearningNotebooks/blob/master/configuration.ipynb) before running this notebook.\n",
"\n",
"Notebook synopsis:\n",
"\n",
@@ -325,7 +332,7 @@
"source": [
"### Setting forecaster maximum horizon \n",
"\n",
"The forecast horizon is the number of periods into the future that the model should predict. Here, we set the horizon to 12 periods (i.e. 12 months). Notice that this is much shorter than the number of months in the test set; we will need to use a rolling test to evaluate the performance on the whole test set. For more discussion of forecast horizons and guiding principles for setting them, please see the [energy demand notebook](https://github.com/Azure/MachineLearningNotebooks/tree/master/how-to-use-azureml/automated-machine-learning/forecasting-energy-demand). "
"The forecast horizon is the number of periods into the future that the model should predict. Here, we set the horizon to 14 periods (i.e. 14 days). Notice that this is much shorter than the number of months in the test set; we will need to use a rolling test to evaluate the performance on the whole test set. For more discussion of forecast horizons and guiding principles for setting them, please see the [energy demand notebook](https://github.com/Azure/MachineLearningNotebooks/tree/master/how-to-use-azureml/automated-machine-learning/forecasting-energy-demand). "
]
},
{
@@ -337,7 +344,7 @@
},
"outputs": [],
"source": [
"forecast_horizon = 12"
"forecast_horizon = 14"
]
},
{
@@ -382,7 +389,7 @@
"automl_config = AutoMLConfig(\n",
" task=\"forecasting\",\n",
" primary_metric=\"normalized_root_mean_squared_error\",\n",
" experiment_timeout_hours=1,\n",
" experiment_timeout_hours=1.5,\n",
" training_data=train_dataset,\n",
" label_column_name=target_column_name,\n",
" validation_data=valid_dataset,\n",
@@ -681,9 +688,9 @@
],
"hide_code_all_hidden": false,
"kernelspec": {
"display_name": "Python 3.6",
"display_name": "Python 3.8 - AzureML",
"language": "python",
"name": "python36"
"name": "python38-azureml"
},
"language_info": {
"codemirror_mode": {
@@ -695,9 +702,9 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
"version": "3.8.10"
}
},
"nbformat": 4,
"nbformat_minor": 2
"nbformat_minor": 4
}

View File

@@ -1,4 +0,0 @@
name: auto-ml-forecasting-github-dau
dependencies:
- pip:
- azureml-sdk

View File

@@ -4,8 +4,7 @@ import os
import numpy as np
import pandas as pd
from pandas.tseries.frequencies import to_offset
from sklearn.externals import joblib
import joblib
from sklearn.metrics import mean_absolute_error, mean_squared_error
from azureml.automl.runtime.shared.score import scoring, constants
@@ -19,219 +18,8 @@ except ImportError:
_torch_present = False
def align_outputs(
y_predicted,
X_trans,
X_test,
y_test,
predicted_column_name="predicted",
horizon_colname="horizon_origin",
):
"""
Demonstrates how to get the output aligned to the inputs
using pandas indexes. Helps understand what happened if
the output's shape differs from the input shape, or if
the data got re-sorted by time and grain during forecasting.
Typical causes of misalignment are:
* we predicted some periods that were missing in actuals -> drop from eval
* model was asked to predict past max_horizon -> increase max horizon
* data at start of X_test was needed for lags -> provide previous periods
"""
if horizon_colname in X_trans:
df_fcst = pd.DataFrame(
{
predicted_column_name: y_predicted,
horizon_colname: X_trans[horizon_colname],
}
)
else:
df_fcst = pd.DataFrame({predicted_column_name: y_predicted})
# y and X outputs are aligned by forecast() function contract
df_fcst.index = X_trans.index
# align original X_test to y_test
X_test_full = X_test.copy()
X_test_full[target_column_name] = y_test
# X_test_full's index does not include origin, so reset for merge
df_fcst.reset_index(inplace=True)
X_test_full = X_test_full.reset_index().drop(columns="index")
together = df_fcst.merge(X_test_full, how="right")
# drop rows where prediction or actuals are nan
# happens because of missing actuals
# or at edges of time due to lags/rolling windows
clean = together[
together[[target_column_name, predicted_column_name]].notnull().all(axis=1)
]
return clean
def do_rolling_forecast_with_lookback(
fitted_model, X_test, y_test, max_horizon, X_lookback, y_lookback, freq="D"
):
"""
Produce forecasts on a rolling origin over the given test set.
Each iteration makes a forecast for the next 'max_horizon' periods
with respect to the current origin, then advances the origin by the
horizon time duration. The prediction context for each forecast is set so
that the forecaster uses the actual target values prior to the current
origin time for constructing lag features.
This function returns a concatenated DataFrame of rolling forecasts.
"""
print("Using lookback of size: ", y_lookback.size)
df_list = []
origin_time = X_test[time_column_name].min()
X = X_lookback.append(X_test)
y = np.concatenate((y_lookback, y_test), axis=0)
while origin_time <= X_test[time_column_name].max():
# Set the horizon time - end date of the forecast
horizon_time = origin_time + max_horizon * to_offset(freq)
# Extract test data from an expanding window up-to the horizon
expand_wind = X[time_column_name] < horizon_time
X_test_expand = X[expand_wind]
y_query_expand = np.zeros(len(X_test_expand)).astype(float)
y_query_expand.fill(np.NaN)
if origin_time != X[time_column_name].min():
# Set the context by including actuals up-to the origin time
test_context_expand_wind = X[time_column_name] < origin_time
context_expand_wind = X_test_expand[time_column_name] < origin_time
y_query_expand[context_expand_wind] = y[test_context_expand_wind]
# Print some debug info
print(
"Horizon_time:",
horizon_time,
" origin_time: ",
origin_time,
" max_horizon: ",
max_horizon,
" freq: ",
freq,
)
print("expand_wind: ", expand_wind)
print("y_query_expand")
print(y_query_expand)
print("X_test")
print(X)
print("X_test_expand")
print(X_test_expand)
print("Type of X_test_expand: ", type(X_test_expand))
print("Type of y_query_expand: ", type(y_query_expand))
print("y_query_expand")
print(y_query_expand)
# Make a forecast out to the maximum horizon
# y_fcst, X_trans = y_query_expand, X_test_expand
y_fcst, X_trans = fitted_model.forecast(X_test_expand, y_query_expand)
print("y_fcst")
print(y_fcst)
# Align forecast with test set for dates within
# the current rolling window
trans_tindex = X_trans.index.get_level_values(time_column_name)
trans_roll_wind = (trans_tindex >= origin_time) & (trans_tindex < horizon_time)
test_roll_wind = expand_wind & (X[time_column_name] >= origin_time)
df_list.append(
align_outputs(
y_fcst[trans_roll_wind],
X_trans[trans_roll_wind],
X[test_roll_wind],
y[test_roll_wind],
)
)
# Advance the origin time
origin_time = horizon_time
return pd.concat(df_list, ignore_index=True)
def do_rolling_forecast(fitted_model, X_test, y_test, max_horizon, freq="D"):
"""
Produce forecasts on a rolling origin over the given test set.
Each iteration makes a forecast for the next 'max_horizon' periods
with respect to the current origin, then advances the origin by the
horizon time duration. The prediction context for each forecast is set so
that the forecaster uses the actual target values prior to the current
origin time for constructing lag features.
This function returns a concatenated DataFrame of rolling forecasts.
"""
df_list = []
origin_time = X_test[time_column_name].min()
while origin_time <= X_test[time_column_name].max():
# Set the horizon time - end date of the forecast
horizon_time = origin_time + max_horizon * to_offset(freq)
# Extract test data from an expanding window up-to the horizon
expand_wind = X_test[time_column_name] < horizon_time
X_test_expand = X_test[expand_wind]
y_query_expand = np.zeros(len(X_test_expand)).astype(float)
y_query_expand.fill(np.NaN)
if origin_time != X_test[time_column_name].min():
# Set the context by including actuals up-to the origin time
test_context_expand_wind = X_test[time_column_name] < origin_time
context_expand_wind = X_test_expand[time_column_name] < origin_time
y_query_expand[context_expand_wind] = y_test[test_context_expand_wind]
# Print some debug info
print(
"Horizon_time:",
horizon_time,
" origin_time: ",
origin_time,
" max_horizon: ",
max_horizon,
" freq: ",
freq,
)
print("expand_wind: ", expand_wind)
print("y_query_expand")
print(y_query_expand)
print("X_test")
print(X_test)
print("X_test_expand")
print(X_test_expand)
print("Type of X_test_expand: ", type(X_test_expand))
print("Type of y_query_expand: ", type(y_query_expand))
print("y_query_expand")
print(y_query_expand)
# Make a forecast out to the maximum horizon
y_fcst, X_trans = fitted_model.forecast(X_test_expand, y_query_expand)
print("y_fcst")
print(y_fcst)
# Align forecast with test set for dates within the
# current rolling window
trans_tindex = X_trans.index.get_level_values(time_column_name)
trans_roll_wind = (trans_tindex >= origin_time) & (trans_tindex < horizon_time)
test_roll_wind = expand_wind & (X_test[time_column_name] >= origin_time)
df_list.append(
align_outputs(
y_fcst[trans_roll_wind],
X_trans[trans_roll_wind],
X_test[test_roll_wind],
y_test[test_roll_wind],
)
)
# Advance the origin time
origin_time = horizon_time
return pd.concat(df_list, ignore_index=True)
def map_location_cuda(storage, loc):
return storage.cuda()
def APE(actual, pred):
@@ -254,10 +42,6 @@ def MAPE(actual, pred):
return np.mean(APE(actual_safe, pred_safe))
def map_location_cuda(storage, loc):
return storage.cuda()
parser = argparse.ArgumentParser()
parser.add_argument(
"--max_horizon",
@@ -303,7 +87,6 @@ print(model_path)
run = Run.get_context()
# get input dataset by name
test_dataset = run.input_datasets["test_data"]
lookback_dataset = run.input_datasets["lookback_data"]
grain_column_names = []
@@ -312,15 +95,8 @@ df = test_dataset.to_pandas_dataframe()
print("Read df")
print(df)
X_test_df = test_dataset.drop_columns(columns=[target_column_name])
y_test_df = test_dataset.with_timestamp_columns(None).keep_columns(
columns=[target_column_name]
)
X_lookback_df = lookback_dataset.drop_columns(columns=[target_column_name])
y_lookback_df = lookback_dataset.with_timestamp_columns(None).keep_columns(
columns=[target_column_name]
)
X_test_df = df
y_test = df.pop(target_column_name).to_numpy()
_, ext = os.path.splitext(model_path)
if ext == ".pt":
@@ -336,37 +112,20 @@ else:
# Load the sklearn pipeline.
fitted_model = joblib.load(model_path)
if hasattr(fitted_model, "get_lookback"):
lookback = fitted_model.get_lookback()
df_all = do_rolling_forecast_with_lookback(
fitted_model,
X_test_df.to_pandas_dataframe(),
y_test_df.to_pandas_dataframe().values.T[0],
max_horizon,
X_lookback_df.to_pandas_dataframe()[-lookback:],
y_lookback_df.to_pandas_dataframe().values.T[0][-lookback:],
freq,
)
else:
df_all = do_rolling_forecast(
fitted_model,
X_test_df.to_pandas_dataframe(),
y_test_df.to_pandas_dataframe().values.T[0],
max_horizon,
freq,
)
X_rf = fitted_model.rolling_forecast(X_test_df, y_test, step=1)
assign_dict = {
fitted_model.forecast_origin_column_name: "forecast_origin",
fitted_model.forecast_column_name: "predicted",
fitted_model.actual_column_name: target_column_name,
}
X_rf.rename(columns=assign_dict, inplace=True)
print(df_all)
print("target values:::")
print(df_all[target_column_name])
print("predicted values:::")
print(df_all["predicted"])
print(X_rf.head())
# Use the AutoML scoring module
regression_metrics = list(constants.REGRESSION_SCALAR_SET)
y_test = np.array(df_all[target_column_name])
y_pred = np.array(df_all["predicted"])
y_test = np.array(X_rf[target_column_name])
y_pred = np.array(X_rf["predicted"])
scores = scoring.score_regression(y_test, y_pred, regression_metrics)
print("scores:")
@@ -376,11 +135,11 @@ for key, value in scores.items():
run.log(key, value)
print("Simple forecasting model")
rmse = np.sqrt(mean_squared_error(df_all[target_column_name], df_all["predicted"]))
rmse = np.sqrt(mean_squared_error(X_rf[target_column_name], X_rf["predicted"]))
print("[Test Data] \nRoot Mean squared error: %.2f" % rmse)
mae = mean_absolute_error(df_all[target_column_name], df_all["predicted"])
mae = mean_absolute_error(X_rf[target_column_name], X_rf["predicted"])
print("mean_absolute_error score: %.2f" % mae)
print("MAPE: %.2f" % MAPE(df_all[target_column_name], df_all["predicted"]))
print("MAPE: %.2f" % MAPE(X_rf[target_column_name], X_rf["predicted"]))
run.log("rmse", rmse)
run.log("mae", mae)

View File

@@ -16,6 +16,13 @@
"![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/forecasting-hierarchical-timeseries/auto-ml-forecasting-hierarchical-timeseries.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<font color=\"red\" size=\"5\"><strong>!Important!</strong> </br>This notebook is outdated and is not supported by the AutoML Team. Please use the supported version ([link](https://github.com/Azure/azureml-examples/tree/main/sdk/python/jobs/pipelines/1k_demand_forecasting_with_pipeline_components/automl-forecasting-demand-hierarchical-timeseries-in-pipeline)).</font>"
]
},
{
"cell_type": "markdown",
"metadata": {},
@@ -40,7 +47,7 @@
"metadata": {},
"source": [
"### Prerequisites\n",
"You'll need to create a compute Instance by following the instructions in the [EnvironmentSetup.md](../Setup_Resources/EnvironmentSetup.md)."
"You'll need to create a compute Instance by following [these](https://learn.microsoft.com/en-us/azure/machine-learning/v1/how-to-create-manage-compute-instance?tabs=python) instructions."
]
},
{
@@ -251,8 +258,17 @@
"source": [
"### Set up training parameters\n",
"\n",
"This dictionary defines the AutoML and hierarchy settings. For this forecasting task we need to define several settings inncluding the name of the time column, the maximum forecast horizon, the hierarchy definition, and the level of the hierarchy at which to train.\n",
"We need to provide ``ForecastingParameters``, ``AutoMLConfig`` and ``HTSTrainParameters`` objects. For the forecasting task we need to define several settings including the name of the time column, the maximum forecast horizon, the hierarchy definition, and the level of the hierarchy at which to train.\n",
"\n",
"#### ``ForecastingParameters`` arguments\n",
"| Property | Description|\n",
"| :--------------- | :------------------- |\n",
"| **forecast_horizon** | The forecast horizon is how many periods forward you would like to forecast. This integer horizon is in units of the timeseries frequency (e.g. daily, weekly). Periods are inferred from your data. |\n",
"| **time_column_name** | The name of your time column. |\n",
"| **time_series_id_column_names** | The column names used to uniquely identify timeseries in data that has multiple rows with the same timestamp. |\n",
"| **cv_step_size** | Number of periods between two consecutive cross-validation folds. The default value is \\\"auto\\\", in which case AutoMl determines the cross-validation step size automatically, if a validation set is not provided. Or users could specify an integer value. |\n",
"\n",
"#### ``AutoMLConfig`` arguments\n",
"| Property | Description|\n",
"| :--------------- | :------------------- |\n",
"| **task** | forecasting |\n",
@@ -260,19 +276,22 @@
"| **blocked_models** | Blocked models won't be used by AutoML. |\n",
"| **iteration_timeout_minutes** | Maximum amount of time in minutes that the model can train. This is optional but provides customers with greater control on exit criteria. |\n",
"| **iterations** | Number of models to train. This is optional but provides customers with greater control on exit criteria. |\n",
"| **experiment_timeout_hours** | Maximum amount of time in hours that the experiment can take before it terminates. This is optional but provides customers with greater control on exit criteria. |\n",
"| **experiment_timeout_hours** | Maximum amount of time in hours that each experiment can take before it terminates. This is optional but provides customers with greater control on exit criteria. **It does not control the overall timeout for the pipeline run, instead controls the timeout for each training run per partitioned time series.** |\n",
"| **label_column_name** | The name of the label column. |\n",
"| **forecast_horizon** | The forecast horizon is how many periods forward you would like to forecast. This integer horizon is in units of the timeseries frequency (e.g. daily, weekly). Periods are inferred from your data. |\n",
"| **n_cross_validations** | Number of cross validation splits. Rolling Origin Validation is used to split time-series in a temporally consistent way. |\n",
"| **enable_early_stopping** | Flag to enable early termination if the score is not improving in the short term. |\n",
"| **time_column_name** | The name of your time column. |\n",
"| **hierarchy_column_names** | The names of columns that define the hierarchical structure of the data from highest level to most granular. |\n",
"| **training_level** | The level of the hierarchy to be used for training models. |\n",
"| **n_cross_validations** | Number of cross validation splits. The default value is \\\"auto\\\", in which case AutoMl determines the number of cross-validations automatically, if a validation set is not provided. Or users could specify an integer value. Rolling Origin Validation is used to split time-series in a temporally consistent way. |\n",
"| **enable_early_stopping** | Flag to enable early termination if the primary metric is no longer improving. |\n",
"| **enable_engineered_explanations** | Engineered feature explanations will be downloaded if enable_engineered_explanations flag is set to True. By default it is set to False to save storage space. |\n",
"| **time_series_id_column_name** | The column names used to uniquely identify timeseries in data that has multiple rows with the same timestamp. |\n",
"| **track_child_runs** | Flag to disable tracking of child runs. Only best run is tracked if the flag is set to False (this includes the model and metrics of the run). |\n",
"| **pipeline_fetch_max_batch_size** | Determines how many pipelines (training algorithms) to fetch at a time for training, this helps reduce throttling when training at large scale. |\n",
"| **model_explainability** | Flag to disable explaining the best automated ML model at the end of all training iterations. The default is True and will block non-explainable models which may impact the forecast accuracy. For more information, see [Interpretability: model explanations in automated machine learning](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-machine-learning-interpretability-automl). |"
"| **model_explainability** | Flag to disable explaining the best automated ML model at the end of all training iterations. The default is True and will block non-explainable models which may impact the forecast accuracy. For more information, see [Interpretability: model explanations in automated machine learning](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-machine-learning-interpretability-automl). |\n",
"\n",
"#### ``HTSTrainParameters`` arguments\n",
"| Property | Description|\n",
"| :--------------- | :------------------- |\n",
"| **automl_settings** | The ``AutoMLConfig`` object defined above. |\n",
"| **hierarchy_column_names** | The names of columns that define the hierarchical structure of the data from highest level to most granular. |\n",
"| **training_level** | The level of the hierarchy to be used for training models. |\n",
"| **enable_engineered_explanations** | The switch controls engineered explanations. |"
]
},
{
@@ -286,6 +305,9 @@
"outputs": [],
"source": [
"from azureml.train.automl.runtime._hts.hts_parameters import HTSTrainParameters\n",
"from azureml.automl.core.forecasting_parameters import ForecastingParameters\n",
"from azureml.train.automl.automlconfig import AutoMLConfig\n",
"\n",
"\n",
"model_explainability = True\n",
"\n",
@@ -299,23 +321,26 @@
"label_column_name = \"quantity\"\n",
"forecast_horizon = 7\n",
"\n",
"forecasting_parameters = ForecastingParameters(\n",
" time_column_name=time_column_name,\n",
" forecast_horizon=forecast_horizon,\n",
")\n",
"\n",
"automl_settings = {\n",
" \"task\": \"forecasting\",\n",
" \"primary_metric\": \"normalized_root_mean_squared_error\",\n",
" \"label_column_name\": label_column_name,\n",
" \"time_column_name\": time_column_name,\n",
" \"forecast_horizon\": forecast_horizon,\n",
" \"hierarchy_column_names\": hierarchy,\n",
" \"hierarchy_training_level\": training_level,\n",
" \"track_child_runs\": False,\n",
" \"pipeline_fetch_max_batch_size\": 15,\n",
" \"model_explainability\": model_explainability,\n",
"automl_settings = AutoMLConfig(\n",
" task=\"forecasting\",\n",
" primary_metric=\"normalized_root_mean_squared_error\",\n",
" experiment_timeout_hours=1,\n",
" label_column_name=label_column_name,\n",
" track_child_runs=False,\n",
" forecasting_parameters=forecasting_parameters,\n",
" pipeline_fetch_max_batch_size=15,\n",
" model_explainability=model_explainability,\n",
" n_cross_validations=\"auto\", # Feel free to set to a small integer (>=2) if runtime is an issue.\n",
" cv_step_size=\"auto\",\n",
" # The following settings are specific to this sample and should be adjusted according to your own needs.\n",
" \"iteration_timeout_minutes\": 10,\n",
" \"iterations\": 10,\n",
" \"n_cross_validations\": 2,\n",
"}\n",
" iteration_timeout_minutes=10,\n",
" iterations=15,\n",
")\n",
"\n",
"hts_parameters = HTSTrainParameters(\n",
" automl_settings=automl_settings,\n",
@@ -336,15 +361,25 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Parallel run step is leveraged to train the hierarchy. To configure the ParallelRunConfig you will need to determine the appropriate number of workers and nodes for your use case. The `process_count_per_node` is based off the number of cores of the compute VM. The node_count will determine the number of master nodes to use, increasing the node count will speed up the training process.\n",
"Parallel run step is leveraged to train multiple models at once. To configure the ParallelRunConfig you will need to determine the appropriate number of workers and nodes for your use case. The ``process_count_per_node`` is based off the number of cores of the compute VM. The node_count will determine the number of master nodes to use, increasing the node count will speed up the training process.\n",
"\n",
"* **experiment:** The experiment used for training.\n",
"* **train_data:** The tabular dataset to be used as input to the training run.\n",
"* **node_count:** The number of compute nodes to be used for running the user script. We recommend to start with 3 and increase the node_count if the training time is taking too long.\n",
"* **process_count_per_node:** Process count per node, we recommend 2:1 ratio for number of cores: number of processes per node. eg. If node has 16 cores then configure 8 or less process count per node or optimal performance.\n",
"* **train_pipeline_parameters:** The set of configuration parameters defined in the previous section. \n",
"| Property | Description|\n",
"| :--------------- | :------------------- |\n",
"| **experiment** | The experiment used for training. |\n",
"| **train_data** | The file dataset to be used as input to the training run. |\n",
"| **node_count** | The number of compute nodes to be used for running the user script. We recommend to start with 3 and increase the node_count if the training time is taking too long. |\n",
"| **process_count_per_node** | Process count per node, we recommend 2:1 ratio for number of cores: number of processes per node. eg. If node has 16 cores then configure 8 or less process count per node for optimal performance. |\n",
"| **train_pipeline_parameters** | The set of configuration parameters defined in the previous section. |\n",
"| **run_invocation_timeout** | Maximum amount of time in seconds that the ``ParallelRunStep`` class is allowed. This is optional but provides customers with greater control on exit criteria. This must be greater than ``experiment_timeout_hours`` by at least 300 seconds. |\n",
"\n",
"Calling this method will create a new aggregated dataset which is generated dynamically on pipeline execution."
"Calling this method will create a new aggregated dataset which is generated dynamically on pipeline execution.\n",
"\n",
"**Note**: Total time taken for the **training step** in the pipeline to complete = $ \\frac{t}{ p \\times n } \\times ts $\n",
"where,\n",
"- $ t $ is time taken for training one partition (can be viewed in the training logs)\n",
"- $ p $ is ``process_count_per_node``\n",
"- $ n $ is ``node_count``\n",
"- $ ts $ is total number of partitions in time series based on ``partition_column_names``"
]
},
{
@@ -363,6 +398,7 @@
" node_count=2,\n",
" process_count_per_node=8,\n",
" train_pipeline_parameters=hts_parameters,\n",
" run_invocation_timeout=3900,\n",
")"
]
},
@@ -507,19 +543,24 @@
"source": [
"## 5.0 Forecasting\n",
"For hierarchical forecasting we need to provide the HTSInferenceParameters object.\n",
"#### HTSInferenceParameters arguments\n",
"* **hierarchy_forecast_level:** The default level of the hierarchy to produce prediction/forecast on.\n",
"* **allocation_method:** \\[Optional] The disaggregation method to use if the hierarchy forecast level specified is below the define hierarchy training level. <br><i>(average historical proportions) 'average_historical_proportions'</i><br><i>(proportions of the historical averages) 'proportions_of_historical_average'</i>\n",
"#### ``HTSInferenceParameters`` arguments\n",
"| Property | Description|\n",
"| :--------------- | :------------------- |\n",
"| **hierarchy_forecast_level:** | The default level of the hierarchy to produce prediction/forecast on. |\n",
"| **allocation_method:** | \\[Optional] The disaggregation method to use if the hierarchy forecast level specified is below the define hierarchy training level. <br><i>(average historical proportions) 'average_historical_proportions'</i><br><i>(proportions of the historical averages) 'proportions_of_historical_average'</i> |\n",
"\n",
"#### get_many_models_batch_inference_steps arguments\n",
"* **experiment:** The experiment used for inference run.\n",
"* **inference_data:** The data to use for inferencing. It should be the same schema as used for training.\n",
"* **compute_target:** The compute target that runs the inference pipeline.\n",
"* **node_count:** The number of compute nodes to be used for running the user script. We recommend to start with the number of cores per node (varies by compute sku).\n",
"* **process_count_per_node:** The number of processes per node.\n",
"* **train_run_id:** \\[Optional] The run id of the hierarchy training, by default it is the latest successful training hts run in the experiment.\n",
"* **train_experiment_name:** \\[Optional] The train experiment that contains the train pipeline. This one is only needed when the train pipeline is not in the same experiement as the inference pipeline.\n",
"* **process_count_per_node:** \\[Optional] The number of processes per node, by default it's 4."
"#### ``get_many_models_batch_inference_steps`` arguments\n",
"| Property | Description|\n",
"| :--------------- | :------------------- |\n",
"| **experiment** | The experiment used for inference run. |\n",
"| **inference_data** | The data to use for inferencing. It should be the same schema as used for training.\n",
"| **compute_target** | The compute target that runs the inference pipeline. |\n",
"| **node_count** | The number of compute nodes to be used for running the user script. We recommend to start with the number of cores per node (varies by compute sku). |\n",
"| **process_count_per_node** | \\[Optional] The number of processes per node. By default it's 2 (should be at most half of the number of cores in a single node of the compute cluster that will be used for the experiment).\n",
"| **inference_pipeline_parameters** | \\[Optional] The ``HTSInferenceParameters`` object defined above. |\n",
"| **train_run_id** | \\[Optional] The run id of the **training pipeline**. By default it is the latest successful training pipeline run in the experiment. |\n",
"| **train_experiment_name** | \\[Optional] The train experiment that contains the train pipeline. This one is only needed when the train pipeline is not in the same experiement as the inference pipeline. |\n",
"| **run_invocation_timeout** | \\[Optional] Maximum amount of time in seconds that the ``ParallelRunStep`` class is allowed. This is optional but provides customers with greater control on exit criteria. |"
]
},
{
@@ -618,9 +659,9 @@
"automated-machine-learning"
],
"kernelspec": {
"display_name": "Python 3.6",
"display_name": "Python 3.8 - AzureML",
"language": "python",
"name": "python36"
"name": "python38-azureml"
},
"language_info": {
"codemirror_mode": {
@@ -632,7 +673,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
"version": "3.8.10"
}
},
"nbformat": 4,

View File

@@ -1,4 +0,0 @@
name: auto-ml-forecasting-hierarchical-timeseries
dependencies:
- pip:
- azureml-sdk

View File

@@ -16,6 +16,13 @@
"![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/forecasting-hierarchical-timeseries/auto-ml-forecasting-hierarchical-timeseries.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<font color=\"red\" size=\"5\"><strong>!Important!</strong> </br>This notebook is outdated and is not supported by the AutoML Team. Please use the supported version ([link](https://github.com/Azure/azureml-examples/tree/main/sdk/python/jobs/pipelines/1k_demand_forecasting_with_pipeline_components/automl-forecasting-demand-many-models-in-pipeline)).</font>"
]
},
{
"cell_type": "markdown",
"metadata": {},
@@ -40,7 +47,7 @@
"metadata": {},
"source": [
"### Prerequisites\n",
"You'll need to create a compute Instance by following the instructions in the [EnvironmentSetup.md](../Setup_Resources/EnvironmentSetup.md)."
"You'll need to create a compute Instance by following [these](https://learn.microsoft.com/en-us/azure/machine-learning/v1/how-to-create-manage-compute-instance?tabs=python) instructions."
]
},
{
@@ -306,7 +313,7 @@
"from azureml.core.compute import ComputeTarget, AmlCompute\n",
"\n",
"# Name your cluster\n",
"compute_name = \"mm-compute\"\n",
"compute_name = \"mm-compute-v1\"\n",
"\n",
"\n",
"if compute_name in ws.compute_targets:\n",
@@ -316,7 +323,7 @@
"else:\n",
" print(\"Creating a new compute target...\")\n",
" provisioning_config = AmlCompute.provisioning_configuration(\n",
" vm_size=\"STANDARD_D16S_V3\", max_nodes=20\n",
" vm_size=\"STANDARD_D14_V2\", max_nodes=20\n",
" )\n",
" # Create the compute target\n",
" compute_target = ComputeTarget.create(ws, compute_name, provisioning_config)\n",
@@ -359,7 +366,7 @@
"USE_CURATED_ENV = True\n",
"if USE_CURATED_ENV:\n",
" curated_environment = Environment.get(\n",
" workspace=ws, name=\"AzureML-sklearn-0.24-ubuntu18.04-py37-cpu\"\n",
" workspace=ws, name=\"AzureML-sklearn-1.5\"\n",
" )\n",
" aml_run_config.environment = curated_environment\n",
"else:\n",
@@ -379,8 +386,17 @@
"source": [
"### Set up training parameters\n",
"\n",
"This dictionary defines the AutoML and many models settings. For this forecasting task we need to define several settings inncluding the name of the time column, the maximum forecast horizon, and the partition column name definition.\n",
"We need to provide ``ForecastingParameters``, ``AutoMLConfig`` and ``ManyModelsTrainParameters`` objects. For the forecasting task we also need to define several settings including the name of the time column, the maximum forecast horizon, and the partition column name(s) definition.\n",
"\n",
"#### ``ForecastingParameters`` arguments\n",
"| Property | Description|\n",
"| :--------------- | :------------------- |\n",
"| **forecast_horizon** | The forecast horizon is how many periods forward you would like to forecast. This integer horizon is in units of the timeseries frequency (e.g. daily, weekly). Periods are inferred from your data. |\n",
"| **time_column_name** | The name of your time column. |\n",
"| **time_series_id_column_names** | The column names used to uniquely identify timeseries in data that has multiple rows with the same timestamp. |\n",
"| **cv_step_size** | Number of periods between two consecutive cross-validation folds. The default value is \\\"auto\\\", in which case AutoMl determines the cross-validation step size automatically, if a validation set is not provided. Or users could specify an integer value. |\n",
"\n",
"#### ``AutoMLConfig`` arguments\n",
"| Property | Description|\n",
"| :--------------- | :------------------- |\n",
"| **task** | forecasting |\n",
@@ -388,16 +404,19 @@
"| **blocked_models** | Blocked models won't be used by AutoML. |\n",
"| **iteration_timeout_minutes** | Maximum amount of time in minutes that the model can train. This is optional but provides customers with greater control on exit criteria. |\n",
"| **iterations** | Number of models to train. This is optional but provides customers with greater control on exit criteria. |\n",
"| **experiment_timeout_hours** | Maximum amount of time in hours that the experiment can take before it terminates. This is optional but provides customers with greater control on exit criteria. |\n",
"| **experiment_timeout_hours** | Maximum amount of time in hours that each experiment can take before it terminates. This is optional but provides customers with greater control on exit criteria. **It does not control the overall timeout for the pipeline run, instead controls the timeout for each training run per partitioned time series.** |\n",
"| **label_column_name** | The name of the label column. |\n",
"| **forecast_horizon** | The forecast horizon is how many periods forward you would like to forecast. This integer horizon is in units of the timeseries frequency (e.g. daily, weekly). Periods are inferred from your data. |\n",
"| **n_cross_validations** | Number of cross validation splits. Rolling Origin Validation is used to split time-series in a temporally consistent way. |\n",
"| **enable_early_stopping** | Flag to enable early termination if the score is not improving in the short term. |\n",
"| **time_column_name** | The name of your time column. |\n",
"| **n_cross_validations** | Number of cross validation splits. The default value is \\\"auto\\\", in which case AutoMl determines the number of cross-validations automatically, if a validation set is not provided. Or users could specify an integer value. Rolling Origin Validation is used to split time-series in a temporally consistent way. |\n",
"| **enable_early_stopping** | Flag to enable early termination if the primary metric is no longer improving. |\n",
"| **enable_engineered_explanations** | Engineered feature explanations will be downloaded if enable_engineered_explanations flag is set to True. By default it is set to False to save storage space. |\n",
"| **time_series_id_column_names** | The column names used to uniquely identify timeseries in data that has multiple rows with the same timestamp. |\n",
"| **track_child_runs** | Flag to disable tracking of child runs. Only best run is tracked if the flag is set to False (this includes the model and metrics of the run). |\n",
"| **pipeline_fetch_max_batch_size** | Determines how many pipelines (training algorithms) to fetch at a time for training, this helps reduce throttling when training at large scale. |\n",
"\n",
"\n",
"#### ``ManyModelsTrainParameters`` arguments\n",
"| Property | Description|\n",
"| :--------------- | :------------------- |\n",
"| **automl_settings** | The ``AutoMLConfig`` object defined above. |\n",
"| **partition_column_names** | The names of columns used to group your models. For timeseries, the groups must not split up individual time-series. That is, each group must contain one or more whole time-series. |"
]
},
@@ -414,22 +433,29 @@
"from azureml.train.automl.runtime._many_models.many_models_parameters import (\n",
" ManyModelsTrainParameters,\n",
")\n",
"from azureml.automl.core.forecasting_parameters import ForecastingParameters\n",
"from azureml.train.automl.automlconfig import AutoMLConfig\n",
"\n",
"partition_column_names = [\"Store\", \"Brand\"]\n",
"automl_settings = {\n",
" \"task\": \"forecasting\",\n",
" \"primary_metric\": \"normalized_root_mean_squared_error\",\n",
" \"iteration_timeout_minutes\": 10, # This needs to be changed based on the dataset. We ask customer to explore how long training is taking before settings this value\n",
" \"iterations\": 15,\n",
" \"experiment_timeout_hours\": 0.25,\n",
" \"label_column_name\": \"Quantity\",\n",
" \"n_cross_validations\": 3,\n",
" \"time_column_name\": \"WeekStarting\",\n",
" \"drop_column_names\": \"Revenue\",\n",
" \"forecast_horizon\": 6,\n",
" \"time_series_id_column_names\": partition_column_names,\n",
" \"track_child_runs\": False,\n",
"}\n",
"\n",
"forecasting_parameters = ForecastingParameters(\n",
" time_column_name=\"WeekStarting\",\n",
" forecast_horizon=6,\n",
" time_series_id_column_names=partition_column_names,\n",
" cv_step_size=\"auto\",\n",
")\n",
"\n",
"automl_settings = AutoMLConfig(\n",
" task=\"forecasting\",\n",
" primary_metric=\"normalized_root_mean_squared_error\",\n",
" iteration_timeout_minutes=10,\n",
" iterations=15,\n",
" experiment_timeout_hours=0.25,\n",
" label_column_name=\"Quantity\",\n",
" n_cross_validations=\"auto\", # Feel free to set to a small integer (>=2) if runtime is an issue.\n",
" track_child_runs=False,\n",
" forecasting_parameters=forecasting_parameters,\n",
")\n",
"\n",
"mm_paramters = ManyModelsTrainParameters(\n",
" automl_settings=automl_settings, partition_column_names=partition_column_names\n",
@@ -449,7 +475,9 @@
"\n",
"Reuse of previous results (``allow_reuse``) is key when using pipelines in a collaborative environment since eliminating unnecessary reruns offers agility. Reuse is the default behavior when the ``script_name``, ``inputs``, and the parameters of a step remain the same. When reuse is allowed, results from the previous run are immediately sent to the next step. If ``allow_reuse`` is set to False, a new run will always be generated for this step during pipeline execution.\n",
"\n",
"> Note that we only support partitioned FileDataset and TabularDataset without partition when using such output as input."
"> Note that we only support partitioned FileDataset and TabularDataset without partition when using such output as input.\n",
"\n",
"> Note that we **drop column** \"Revenue\" from the dataset in this step to avoid information leak as \"Quantity\" = \"Revenue\" / \"Price\". **Please modify the logic based on your data**."
]
},
{
@@ -487,17 +515,25 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Parallel run step is leveraged to train multiple models at once. To configure the ParallelRunConfig you will need to determine the appropriate number of workers and nodes for your use case. The process_count_per_node is based off the number of cores of the compute VM. The node_count will determine the number of master nodes to use, increasing the node count will speed up the training process.\n",
"Parallel run step is leveraged to train multiple models at once. To configure the ParallelRunConfig you will need to determine the appropriate number of workers and nodes for your use case. The ``process_count_per_node`` is based off the number of cores of the compute VM. The node_count will determine the number of master nodes to use, increasing the node count will speed up the training process.\n",
"\n",
"| Property | Description|\n",
"| :--------------- | :------------------- |\n",
"| **experiment** | The experiment used for training. |\n",
"| **train_data** | The file dataset to be used as input to the training run. |\n",
"| **node_count** | The number of compute nodes to be used for running the user script. We recommend to start with 3 and increase the node_count if the training time is taking too long. |\n",
"| **process_count_per_node** | Process count per node, we recommend 2:1 ratio for number of cores: number of processes per node. eg. If node has 16 cores then configure 8 or less process count per node or optimal performance. |\n",
"| **process_count_per_node** | Process count per node, we recommend 2:1 ratio for number of cores: number of processes per node. eg. If node has 16 cores then configure 8 or less process count per node for optimal performance. |\n",
"| **train_pipeline_parameters** | The set of configuration parameters defined in the previous section. |\n",
"| **run_invocation_timeout** | Maximum amount of time in seconds that the ``ParallelRunStep`` class is allowed. This is optional but provides customers with greater control on exit criteria. This must be greater than ``experiment_timeout_hours`` by at least 300 seconds. |\n",
"\n",
"Calling this method will create a new aggregated dataset which is generated dynamically on pipeline execution."
"Calling this method will create a new aggregated dataset which is generated dynamically on pipeline execution.\n",
"\n",
"**Note**: Total time taken for the **training step** in the pipeline to complete = $ \\frac{t}{ p \\times n } \\times ts $\n",
"where,\n",
"- $ t $ is time taken for training one partition (can be viewed in the training logs)\n",
"- $ p $ is ``process_count_per_node``\n",
"- $ n $ is ``node_count``\n",
"- $ ts $ is total number of partitions in time series based on ``partition_column_names``"
]
},
{
@@ -515,7 +551,7 @@
" compute_target=compute_target,\n",
" node_count=2,\n",
" process_count_per_node=8,\n",
" run_invocation_timeout=920,\n",
" run_invocation_timeout=1200,\n",
" train_pipeline_parameters=mm_paramters,\n",
")"
]
@@ -596,7 +632,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### 7.2 Schedule the pipeline\n",
"### 5.2 Schedule the pipeline\n",
"You can also [schedule the pipeline](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-schedule-pipelines) to run on a time-based or change-based schedule. This could be used to automatically retrain models every month or based on another trigger such as data drift."
]
},
@@ -652,25 +688,31 @@
"source": [
"For many models we need to provide the ManyModelsInferenceParameters object.\n",
"\n",
"#### ManyModelsInferenceParameters arguments\n",
"#### ``ManyModelsInferenceParameters`` arguments\n",
"| Property | Description|\n",
"| :--------------- | :------------------- |\n",
"| **partition_column_names** | List of column names that identifies groups. |\n",
"| **partition_column_names** | List of column names that identifies groups. |\n",
"| **target_column_name** | \\[Optional] Column name only if the inference dataset has the target. |\n",
"| **time_column_name** | \\[Optional] Column name only if it is timeseries. |\n",
"| **many_models_run_id** | \\[Optional] Many models run id where models were trained. |\n",
"| **time_column_name** | \\[Optional] Time column name only if it is timeseries. |\n",
"| **inference_type** | \\[Optional] Which inference method to use on the model. Possible values are 'forecast', 'predict_proba', and 'predict'. |\n",
"| **forecast_mode** | \\[Optional] The type of forecast to be used, either 'rolling' or 'recursive'; defaults to 'recursive'. |\n",
"| **step** | \\[Optional] Number of periods to advance the forecasting window in each iteration **(for rolling forecast only)**; defaults to 1. |\n",
"\n",
"#### get_many_models_batch_inference_steps arguments\n",
"#### ``get_many_models_batch_inference_steps`` arguments\n",
"| Property | Description|\n",
"| :--------------- | :------------------- |\n",
"| **experiment** | The experiment used for inference run. |\n",
"| **inference_data** | The data to use for inferencing. It should be the same schema as used for training.\n",
"| **compute_target** The compute target that runs the inference pipeline.|\n",
"| **compute_target** | The compute target that runs the inference pipeline. |\n",
"| **node_count** | The number of compute nodes to be used for running the user script. We recommend to start with the number of cores per node (varies by compute sku). |\n",
"| **process_count_per_node** The number of processes per node.\n",
"| **train_run_id** | \\[Optional] The run id of the hierarchy training, by default it is the latest successful training many model run in the experiment. |\n",
"| **process_count_per_node** | \\[Optional] The number of processes per node. By default it's 2 (should be at most half of the number of cores in a single node of the compute cluster that will be used for the experiment).\n",
"| **inference_pipeline_parameters** | \\[Optional] The ``ManyModelsInferenceParameters`` object defined above. |\n",
"| **append_row_file_name** | \\[Optional] The name of the output file (optional, default value is 'parallel_run_step.txt'). Supports 'txt' and 'csv' file extension. A 'txt' file extension generates the output in 'txt' format with space as separator without column names. A 'csv' file extension generates the output in 'csv' format with comma as separator and with column names. |\n",
"| **train_run_id** | \\[Optional] The run id of the **training pipeline**. By default it is the latest successful training pipeline run in the experiment. |\n",
"| **train_experiment_name** | \\[Optional] The train experiment that contains the train pipeline. This one is only needed when the train pipeline is not in the same experiement as the inference pipeline. |\n",
"| **process_count_per_node** | \\[Optional] The number of processes per node, by default it's 4. |"
"| **run_invocation_timeout** | \\[Optional] Maximum amount of time in seconds that the ``ParallelRunStep`` class is allowed. This is optional but provides customers with greater control on exit criteria. |\n",
"| **output_datastore** | \\[Optional] The ``Datastore`` or ``OutputDatasetConfig`` to be used for output. If specified any pipeline output will be written to that location. If unspecified the default datastore will be used. |\n",
"| **arguments** | \\[Optional] Arguments to be passed to inference script. Possible argument is '--forecast_quantiles' followed by quantile values. |"
]
},
{
@@ -690,6 +732,8 @@
" target_column_name=\"Quantity\",\n",
")\n",
"\n",
"output_file_name = \"parallel_run_step.csv\"\n",
"\n",
"inference_steps = AutoMLPipelineBuilder.get_many_models_batch_inference_steps(\n",
" experiment=experiment,\n",
" inference_data=inference_ds_small,\n",
@@ -701,6 +745,8 @@
" train_run_id=training_run.id,\n",
" train_experiment_name=training_run.experiment.name,\n",
" inference_pipeline_parameters=mm_parameters,\n",
" append_row_file_name=output_file_name,\n",
" arguments=[\"--forecast_quantiles\", 0.1, 0.9],\n",
")"
]
},
@@ -735,7 +781,7 @@
"\n",
"The following code snippet:\n",
"1. Downloads the contents of the output folder that is passed in the parallel run step \n",
"2. Reads the parallel_run_step.txt file that has the predictions as pandas dataframe and \n",
"2. Reads the output file that has the predictions as pandas dataframe and \n",
"3. Displays the top 10 rows of the predictions"
]
},
@@ -750,19 +796,9 @@
"forecasting_results_name = \"forecasting_results\"\n",
"forecasting_output_name = \"many_models_inference_output\"\n",
"forecast_file = get_output_from_mm_pipeline(\n",
" inference_run, forecasting_results_name, forecasting_output_name\n",
" inference_run, forecasting_results_name, forecasting_output_name, output_file_name\n",
")\n",
"df = pd.read_csv(forecast_file, delimiter=\" \", header=None)\n",
"df.columns = [\n",
" \"Week Starting\",\n",
" \"Store\",\n",
" \"Brand\",\n",
" \"Quantity\",\n",
" \"Advert\",\n",
" \"Price\",\n",
" \"Revenue\",\n",
" \"Predicted\",\n",
"]\n",
"df = pd.read_csv(forecast_file)\n",
"print(\n",
" \"Prediction has \", df.shape[0], \" rows. Here the first 10 rows are being displayed.\"\n",
")\n",
@@ -835,9 +871,9 @@
"automated-machine-learning"
],
"kernelspec": {
"display_name": "Python 3.6",
"display_name": "Python 3.8 - AzureML",
"language": "python",
"name": "python36"
"name": "python38-azureml"
},
"language_info": {
"codemirror_mode": {
@@ -849,7 +885,12 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
"version": "3.8.10"
},
"vscode": {
"interpreter": {
"hash": "6bd77c88278e012ef31757c15997a7bea8c943977c43d6909403c00ae11d43ca"
}
}
},
"nbformat": 4,

View File

@@ -1,4 +0,0 @@
name: auto-ml-forecasting-many-models
dependencies:
- pip:
- azureml-sdk

View File

@@ -11,6 +11,12 @@ def main(args):
dataset = run_context.input_datasets["train_10_models"]
df = dataset.to_pandas_dataframe()
# Drop the column "Revenue" from the dataset to avoid information leak as
# "Quantity" = "Revenue" / "Price". Please modify the logic based on your data.
drop_column_name = "Revenue"
if drop_column_name in df.columns:
df.drop(drop_column_name, axis=1, inplace=True)
# Apply any data pre-processing techniques here
df.to_parquet(output / "data_prepared_result.parquet", compression=None)

View File

@@ -16,6 +16,13 @@
"![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/forecasting-orange-juice-sales/auto-ml-forecasting-orange-juice-sales.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<font color=\"red\" size=\"5\"><strong>!Important!</strong> </br>This notebook is outdated and is not supported by the AutoML Team. Please use the supported version ([link](https://github.com/Azure/azureml-examples/tree/main/sdk/python/jobs/automl-standalone-jobs/automl-forecasting-orange-juice-sales)).</font>"
]
},
{
"cell_type": "markdown",
"metadata": {},
@@ -40,7 +47,7 @@
"## Introduction<a id=\"introduction\"></a>\n",
"In this example, we use AutoML to train, select, and operationalize a time-series forecasting model for multiple time-series.\n",
"\n",
"Make sure you have executed the [configuration notebook](../../../configuration.ipynb) before running this notebook.\n",
"Make sure you have executed the [configuration notebook](https://github.com/Azure/MachineLearningNotebooks/blob/master/configuration.ipynb) before running this notebook.\n",
"\n",
"The examples in the follow code samples use the University of Chicago's Dominick's Finer Foods dataset to forecast orange juice sales. Dominick's was a grocery chain in the Chicago metropolitan area."
]
@@ -242,7 +249,9 @@
" time_series_id_column_names, group_keys=False\n",
" )\n",
" df_head = df_grouped.apply(lambda dfg: dfg.iloc[:-n])\n",
" df_head.reset_index(inplace=True, drop=True)\n",
" df_tail = df_grouped.apply(lambda dfg: dfg.iloc[-n:])\n",
" df_tail.reset_index(inplace=True, drop=True)\n",
" return df_head, df_tail\n",
"\n",
"\n",
@@ -368,7 +377,8 @@
"|**time_column_name**|The name of your time column.|\n",
"|**forecast_horizon**|The forecast horizon is how many periods forward you would like to forecast. This integer horizon is in units of the timeseries frequency (e.g. daily, weekly).|\n",
"|**time_series_id_column_names**|The column names used to uniquely identify the time series in data that has multiple rows with the same timestamp. If the time series identifiers are not defined, the data set is assumed to be one time series.|\n",
"|**freq**|Forecast frequency. This optional parameter represents the period with which the forecast is desired, for example, daily, weekly, yearly, etc. Use this parameter for the correction of time series containing irregular data points or for padding of short time series. The frequency needs to be a pandas offset alias. Please refer to [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects) for more information."
"|**freq**|Forecast frequency. This optional parameter represents the period with which the forecast is desired, for example, daily, weekly, yearly, etc. Use this parameter for the correction of time series containing irregular data points or for padding of short time series. The frequency needs to be a pandas offset alias. Please refer to [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects) for more information.\n",
"|**cv_step_size**|Number of periods between two consecutive cross-validation folds. The default value is \"auto\", in which case AutoMl determines the cross-validation step size automatically, if a validation set is not provided. Or users could specify an integer value."
]
},
{
@@ -390,7 +400,7 @@
"In the first case, AutoML loops over all time-series in your dataset and trains one model (e.g. AutoArima or Prophet, as the case may be) for each series. This can result in long runtimes to train these models if there are a lot of series in the data. One way to mitigate this problem is to fit models for different series in parallel if you have multiple compute cores available. To enable this behavior, set the `max_cores_per_iteration` parameter in your AutoMLConfig as shown in the example in the next cell. \n",
"\n",
"\n",
"Finally, a note about the cross-validation (CV) procedure for time-series data. AutoML uses out-of-sample error estimates to select a best pipeline/model, so it is important that the CV fold splitting is done correctly. Time-series can violate the basic statistical assumptions of the canonical K-Fold CV strategy, so AutoML implements a [rolling origin validation](https://robjhyndman.com/hyndsight/tscv/) procedure to create CV folds for time-series data. To use this procedure, you just need to specify the desired number of CV folds in the AutoMLConfig object. It is also possible to bypass CV and use your own validation set by setting the *validation_data* parameter of AutoMLConfig.\n",
"Finally, a note about the cross-validation (CV) procedure for time-series data. AutoML uses out-of-sample error estimates to select a best pipeline/model, so it is important that the CV fold splitting is done correctly. Time-series can violate the basic statistical assumptions of the canonical K-Fold CV strategy, so AutoML implements a [rolling origin validation](https://robjhyndman.com/hyndsight/tscv/) procedure to create CV folds for time-series data. To use this procedure, you could specify the desired number of CV folds and the number of periods between two consecutive folds in the AutoMLConfig object, or AutoMl could set them automatically if you don't specify them. It is also possible to bypass CV and use your own validation set by setting the *validation_data* parameter of AutoMLConfig.\n",
"\n",
"Here is a summary of AutoMLConfig parameters used for training the OJ model:\n",
"\n",
@@ -403,7 +413,7 @@
"|**training_data**|Input dataset, containing both features and label column.|\n",
"|**label_column_name**|The name of the label column.|\n",
"|**compute_target**|The remote compute for training.|\n",
"|**n_cross_validations**|Number of cross-validation folds to use for model/pipeline selection|\n",
"|**n_cross_validations**|Number of cross-validation folds to use for model/pipeline selection. The default value is \"auto\", in which case AutoMl determines the number of cross-validations automatically, if a validation set is not provided. Or users could specify an integer value.\n",
"|**enable_voting_ensemble**|Allow AutoML to create a Voting ensemble of the best performing models|\n",
"|**enable_stack_ensemble**|Allow AutoML to create a Stack ensemble of the best performing models|\n",
"|**debug_log**|Log file path for writing debugging information|\n",
@@ -424,6 +434,7 @@
" forecast_horizon=n_test_periods,\n",
" time_series_id_column_names=time_series_id_column_names,\n",
" freq=\"W-THU\", # Set the forecast frequency to be weekly (start on each Thursday)\n",
" cv_step_size=\"auto\",\n",
")\n",
"\n",
"automl_config = AutoMLConfig(\n",
@@ -436,7 +447,7 @@
" compute_target=compute_target,\n",
" enable_early_stopping=True,\n",
" featurization=featurization_config,\n",
" n_cross_validations=3,\n",
" n_cross_validations=\"auto\", # Feel free to set to a small integer (>=2) if runtime is an issue.\n",
" verbosity=logging.INFO,\n",
" max_cores_per_iteration=-1,\n",
" forecasting_parameters=forecasting_parameters,\n",
@@ -713,7 +724,7 @@
" description=\"Automl forecasting sample service\",\n",
")\n",
"\n",
"aci_service_name = \"automl-oj-forecast-01\"\n",
"aci_service_name = \"automl-oj-forecast-03\"\n",
"print(aci_service_name)\n",
"aci_service = Model.deploy(ws, aci_service_name, [model], inference_config, aciconfig)\n",
"aci_service.wait_for_deployment(True)\n",
@@ -790,7 +801,7 @@
"metadata": {},
"outputs": [],
"source": [
"serv = Webservice(ws, \"automl-oj-forecast-01\")\n",
"serv = Webservice(ws, \"automl-oj-forecast-03\")\n",
"serv.delete() # don't do it accidentally"
]
}
@@ -819,9 +830,9 @@
"friendly_name": "Forecasting orange juice sales with deployment",
"index_order": 1,
"kernelspec": {
"display_name": "Python 3.6",
"display_name": "Python 3.8 - AzureML",
"language": "python",
"name": "python36"
"name": "python38-azureml"
},
"language_info": {
"codemirror_mode": {
@@ -833,12 +844,17 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
"version": "3.8.10"
},
"tags": [
"None"
],
"task": "Forecasting"
"task": "Forecasting",
"vscode": {
"interpreter": {
"hash": "6bd77c88278e012ef31757c15997a7bea8c943977c43d6909403c00ae11d43ca"
}
}
},
"nbformat": 4,
"nbformat_minor": 4

View File

@@ -1,4 +0,0 @@
name: auto-ml-forecasting-orange-juice-sales
dependencies:
- pip:
- azureml-sdk

View File

@@ -6,7 +6,7 @@ compute instance.
import argparse
from azureml.core import Dataset, Run
from sklearn.externals import joblib
import joblib
from pandas.tseries.frequencies import to_offset
parser = argparse.ArgumentParser()

View File

@@ -1,5 +1,37 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Copyright (c) Microsoft Corporation. All rights reserved.\n",
"\n",
"Licensed under the MIT License."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/forecasting-pipelines/auto-ml-forecasting-pipelines.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<font color=\"red\" size=\"5\"><strong>!Important!</strong> </br>This notebook is outdated and is not supported by the AutoML Team. Please use the supported version ([link](https://github.com/Azure/azureml-examples/tree/main/sdk/python/jobs/pipelines/1h_automl_in_pipeline/automl-forecasting-in-pipeline)).</font>\n",
"</br>\n",
"</br>\n",
"<font color=\"red\" size=\"5\">\n",
"For examples illustrating how to build pipelines with components, please use the following links:</font>\n",
"<ul>\n",
" <li><a href=\"https://github.com/Azure/azureml-examples/tree/main/sdk/python/jobs/pipelines/1k_demand_forecasting_with_pipeline_components/automl-forecasting-demand-many-models-in-pipeline\">Many Models</a></li>\n",
" <li><a href=\"https://github.com/Azure/azureml-examples/tree/main/sdk/python/jobs/pipelines/1k_demand_forecasting_with_pipeline_components/automl-forecasting-demand-hierarchical-timeseries-in-pipeline\">Hierarchical Time Series</a></li>\n",
" <li><a href=\"https://github.com/Azure/azureml-examples/tree/main/sdk/python/jobs/automl-standalone-jobs/automl-forecasting-distributed-tcn\">Distributed TCN</a></li>\n",
"</ul>"
]
},
{
"cell_type": "markdown",
"metadata": {},
@@ -13,7 +45,7 @@
"source": [
"## Introduction\n",
"\n",
"In this notebook, we demonstrate how to use piplines to train and inference on AutoML Forecasting model. Two pipelines will be created: one for training AutoML model, and the other is for inference on AutoML model. We'll also demonstrate how to schedule the inference pipeline so you can get inference results periodically (with refreshed test dataset). Make sure you have executed the configuration notebook before running this notebook. In this notebook you will learn how to:\n",
"In this notebook, we demonstrate how to use piplines to train and inference on AutoML Forecasting model. Two pipelines will be created: one for training AutoML model, and the other is for inference on AutoML model. We'll also demonstrate how to schedule the inference pipeline so you can get inference results periodically (with refreshed test dataset). Make sure you have executed the [configuration notebook](https://github.com/Azure/MachineLearningNotebooks/blob/master/configuration.ipynb) before running this notebook. In this notebook you will learn how to:\n",
"\n",
"- Configure AutoML using AutoMLConfig for forecasting tasks using pipeline AutoMLSteps.\n",
"- Create and register an AutoML model using AzureML pipeline.\n",
@@ -39,7 +71,6 @@
"import logging\n",
"import os\n",
"\n",
"from matplotlib import pyplot as plt\n",
"import pandas as pd\n",
"\n",
"import azureml.core\n",
@@ -292,7 +323,8 @@
"|**time_column_name**|The name of your time column.|\n",
"|**forecast_horizon**|The forecast horizon is how many periods forward you would like to forecast. This integer horizon is in units of the timeseries frequency (e.g. daily, weekly).|\n",
"|**time_series_id_column_names**|The column names used to uniquely identify the time series in data that has multiple rows with the same timestamp. If the time series identifiers are not defined, the data set is assumed to be one time series.|\n",
"|**freq**|Forecast frequency. This optional parameter represents the period with which the forecast is desired, for example, daily, weekly, yearly, etc. Use this parameter for the correction of time series containing irregular data points or for padding of short time series. The frequency needs to be a pandas offset alias. Please refer to [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects) for more information."
"|**freq**|Forecast frequency. This optional parameter represents the period with which the forecast is desired, for example, daily, weekly, yearly, etc. Use this parameter for the correction of time series containing irregular data points or for padding of short time series. The frequency needs to be a pandas offset alias. Please refer to [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects) for more information.\n",
"|**cv_step_size**|Number of periods between two consecutive cross-validation folds. The default value is \"auto\", in which case AutoMl determines the cross-validation step size automatically, if a validation set is not provided. Or users could specify an integer value."
]
},
{
@@ -307,7 +339,8 @@
" time_column_name=time_column_name,\n",
" forecast_horizon=n_test_periods,\n",
" time_series_id_column_names=time_series_id_column_names,\n",
" freq=\"W-THU\", # Set the forecast frequency to be weekly (start on each Thursday)\n",
" freq=\"W-THU\", # Set the forecast frequency to be weekly (start on each Thursday),\n",
" cv_step_size=\"auto\",\n",
")\n",
"\n",
"automl_config = AutoMLConfig(\n",
@@ -319,7 +352,7 @@
" label_column_name=target_column_name,\n",
" compute_target=compute_target,\n",
" enable_early_stopping=True,\n",
" n_cross_validations=5,\n",
" n_cross_validations=\"auto\", # Feel free to set to a small integer (>=2) if runtime is an issue.\n",
" verbosity=logging.INFO,\n",
" max_cores_per_iteration=-1,\n",
" forecasting_parameters=forecasting_parameters,\n",
@@ -423,8 +456,6 @@
"metadata": {},
"outputs": [],
"source": [
"from azureml.pipeline.core import PipelineData\n",
"\n",
"# The model name with which to register the trained model in the workspace.\n",
"model_name_str = \"ojmodel\"\n",
"model_name = PipelineParameter(\"model_name\", default_value=model_name_str)\n",
@@ -553,40 +584,15 @@
"metadata": {},
"outputs": [],
"source": [
"from azureml.core import Model\n",
"from azureml.train.automl.run import AutoMLRun\n",
"\n",
"model = Model(ws, model_name_str)\n",
"download_path = model.download(model_name_str, exist_ok=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After all the files are downloaded, we can generate the run config for inference runs."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core import Environment, RunConfiguration\n",
"from azureml.core.conda_dependencies import CondaDependencies\n",
"for step in training_pipeline_run.get_steps():\n",
" if step.properties.get(\"StepType\") == \"AutoMLStep\":\n",
" automl_run = AutoMLRun(experiment, step.id)\n",
" break\n",
"\n",
"env_file = os.path.join(download_path, \"conda_env_v_1_0_0.yml\")\n",
"inference_env = Environment(\"oj-inference-env\")\n",
"inference_env.python.conda_dependencies = CondaDependencies(\n",
" conda_dependencies_file_path=env_file\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[Optional] The enviroment can also be assessed from the training run using `get_environment()` API."
"best_run = automl_run.get_best_child()\n",
"inference_env = best_run.get_environment()"
]
},
{
@@ -797,9 +803,9 @@
"friendly_name": "Forecasting orange juice sales with deployment",
"index_order": 1,
"kernelspec": {
"display_name": "Python 3.6",
"display_name": "Python 3.8 - AzureML",
"language": "python",
"name": "python36"
"name": "python38-azureml"
},
"language_info": {
"codemirror_mode": {
@@ -811,12 +817,17 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
"version": "3.8.5"
},
"tags": [
"None"
],
"task": "Forecasting"
"task": "Forecasting",
"vscode": {
"interpreter": {
"hash": "6bd77c88278e012ef31757c15997a7bea8c943977c43d6909403c00ae11d43ca"
}
}
},
"nbformat": 4,
"nbformat_minor": 4

View File

@@ -1,4 +0,0 @@
name: auto-ml-forecasting-pipelines
dependencies:
- pip:
- azureml-sdk

View File

@@ -6,7 +6,7 @@ import numpy as np
import pandas as pd
from pandas.tseries.frequencies import to_offset
from sklearn.externals import joblib
import joblib
from sklearn.metrics import mean_absolute_error, mean_squared_error
from azureml.data.dataset_factory import TabularDatasetFactory
@@ -30,7 +30,7 @@ def infer_forecasting_dataset_tcn(
run = Run.get_context()
registered_train = TabularDatasetFactory.register_pandas_dataframe(
TabularDatasetFactory.register_pandas_dataframe(
df_all,
target=(
run.experiment.workspace.get_default_datastore(),

View File

@@ -20,7 +20,14 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"In this notebook we will explore the univaraite time-series data to determine the settings for an automated ML experiment. We will follow the thought process depicted in the following diagram:<br/>\n",
"<font color=\"red\" size=\"5\"><strong>!Important!</strong> </br>This notebook is outdated and is not supported by the AutoML Team. Please use the supported version ([link](https://github.com/Azure/azureml-examples/tree/main/sdk/python/jobs/automl-standalone-jobs/automl-forecasting-recipes-univariate)).</font>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this notebook we will explore the univariate time-series data to determine the settings for an automated ML experiment. We will follow the thought process depicted in the following diagram:<br/>\n",
"![Forecasting after training](figures/univariate_settings_map_20210408.jpg)\n",
"\n",
"The objective is to answer the following questions:\n",
@@ -32,11 +39,11 @@
" </ul>\n",
" <li>Is the data stationary? </li>\n",
" <ul style=\"margin-top:-1px; list-style-type:none\"> \n",
" <li> Importance: In the absense of features that capture trend behavior, ML models (regression and tree based) are not well equiped to predict stochastic trends. Working with stationary data solves this problem. </li>\n",
" <li> Importance: In the absence of features that capture trend behavior, ML models (regression and tree based) are not well equipped to predict stochastic trends. Working with stationary data solves this problem. </li>\n",
" </ul>\n",
" <li>Is there a detectable auto-regressive pattern in the stationary data? </li>\n",
" <ul style=\"margin-top:-1px; list-style-type:none\"> \n",
" <li> Importance: The accuracy of ML models can be improved if serial correlation is modeled by including lags of the dependent/target varaible as features. Including target lags in every experiment by default will result in a regression in accuracy scores if such setting is not warranted. </li>\n",
" <li> Importance: The accuracy of ML models can be improved if serial correlation is modeled by including lags of the dependent/target variable as features. Including target lags in every experiment by default will result in a regression in accuracy scores if such setting is not warranted. </li>\n",
" </ul>\n",
"</ol>\n",
"\n",
@@ -109,7 +116,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The graph plots the alcohol sales in the United States. Because the data is trending, it can be difficult to see cycles, seasonality or other interestng behaviors due to the scaling issues. For example, if there is a seasonal pattern, which we will discuss later, we cannot see them on the trending data. In such case, it is worth plotting the same data in first differences."
"The graph plots the alcohol sales in the United States. Because the data is trending, it can be difficult to see cycles, seasonality or other interesting behaviors due to the scaling issues. For example, if there is a seasonal pattern, which we will discuss later, we cannot see them on the trending data. In such case, it is worth plotting the same data in first differences."
]
},
{
@@ -335,8 +342,8 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# 3 Check if there is a clear autoregressive pattern\n",
"We need to determine if we should include lags of the target variable as features in order to improve forecast accuracy. To do this, we will examine the ACF and partial ACF (PACF) plots of the stationary series. In our case, it is a series in first diffrences.\n",
"# 3 Check if there is a clear auto-regressive pattern\n",
"We need to determine if we should include lags of the target variable as features in order to improve forecast accuracy. To do this, we will examine the ACF and partial ACF (PACF) plots of the stationary series. In our case, it is a series in first differences.\n",
"\n",
"<ul>\n",
" <li> Question: What is an Auto-regressive pattern? What are we looking for? </li>\n",
@@ -418,11 +425,11 @@
" </li>\n",
" where $\\sigma_{xzy}$ is the covariance between two random variables $X$ and $Z$; $\\sigma_x$ and $\\sigma_z$ is the variance for $X$ and $Z$, respectively. The correlation coefficient measures the strength of linear relationship between two random variables. This metric can take any value from -1 to 1. <li/>\n",
" <br/>\n",
" <li> The auto-correlation coefficient $\\rho_{Y_{t} Y_{t-k}}$ is the time series equivalent of the correlation coefficient, except instead of measuring linear association between two random variables $X$ and $Z$, it measures the strength of a linear relationship between a random variable $Y_t$ and its lag $Y_{t-k}$ for any positive interger value of $k$. </li> \n",
" <li> The auto-correlation coefficient $\\rho_{Y_{t} Y_{t-k}}$ is the time series equivalent of the correlation coefficient, except instead of measuring linear association between two random variables $X$ and $Z$, it measures the strength of a linear relationship between a random variable $Y_t$ and its lag $Y_{t-k}$ for any positive integer value of $k$. </li> \n",
" <br />\n",
" <li> To visualize the ACF for a particular lag, say lag 2, plot the second lag of a series $y_{t-2}$ on the x-axis, and plot the series itself $y_t$ on the y-axis. The autocorrelation coefficient is the slope of the best fitted regression line and can be interpreted as follows. A one unit increase in the lag of a variable one period ago leads to a $\\rho_{Y_{t} Y_{t-2}}$ units change in the variable in the current period. This interpreation can be applied to any lag. </li> \n",
" <li> To visualize the ACF for a particular lag, say lag 2, plot the second lag of a series $y_{t-2}$ on the x-axis, and plot the series itself $y_t$ on the y-axis. The autocorrelation coefficient is the slope of the best fitted regression line and can be interpreted as follows. A one unit increase in the lag of a variable one period ago leads to a $\\rho_{Y_{t} Y_{t-2}}$ units change in the variable in the current period. This interpretation can be applied to any lag. </li> \n",
" <br />\n",
" <li> In the interpretation posted above we need to be careful not to confuse the word \"leads\" with \"causes\" since these are not the same thing. We do not know the lagged value of the varaible causes it to change. Afterall, there are probably many other features that may explain the movement in $Y_t$. All we are trying to do in this section is to identify situations when the variable contains the strong auto-regressive components that needs to be included in the model to improve forecast accuracy. </li>\n",
" <li> In the interpretation posted above we need to be careful not to confuse the word \"leads\" with \"causes\" since these are not the same thing. We do not know the lagged value of the variable causes it to change. After all, there are probably many other features that may explain the movement in $Y_t$. All we are trying to do in this section is to identify situations when the variable contains the strong auto-regressive components that needs to be included in the model to improve forecast accuracy. </li>\n",
" </ul>\n",
"</ul>"
]
@@ -434,7 +441,7 @@
"<ul>\n",
" <li> Question: What is the PACF? </li>\n",
" <ul style=\"list-style-type:none;\">\n",
" <li> When describing the ACF we essentially running a regression between a partigular lag of a series, say, lag 4, and the series itself. What this implies is the regression coefficient for lag 4 captures the impact of everything that happens in lags 1, 2 and 3. In other words, if lag 1 is the most important lag and we exclude it from the regression, naturally, the regression model will assign the importance of the 1st lag to the 4th one. Partial auto-correlation function fixes this problem since it measures the contribution of each lag accounting for the information added by the intermediary lags. If we were to illustrate ACF and PACF for the fourth lag using the regression analogy, the difference is a follows: \n",
" <li> When describing the ACF we essentially running a regression between a particular lag of a series, say, lag 4, and the series itself. What this implies is the regression coefficient for lag 4 captures the impact of everything that happens in lags 1, 2 and 3. In other words, if lag 1 is the most important lag and we exclude it from the regression, naturally, the regression model will assign the importance of the 1st lag to the 4th one. Partial auto-correlation function fixes this problem since it measures the contribution of each lag accounting for the information added by the intermediary lags. If we were to illustrate ACF and PACF for the fourth lag using the regression analogy, the difference is a follows: \n",
" \\begin{align}\n",
" Y_{t} &= a_{0} + a_{4} Y_{t-4} + e_{t} \\\\\n",
" Y_{t} &= b_{0} + b_{1} Y_{t-1} + b_{2} Y_{t-2} + b_{3} Y_{t-3} + b_{4} Y_{t-4} + \\varepsilon_{t} \\\\\n",
@@ -442,7 +449,7 @@
" </li>\n",
" <br/>\n",
" <li>\n",
" Here, you can think of $a_4$ and $b_{4}$ as the auto- and partial auto-correlation coefficients for lag 4. Notice, in the second equation we explicitely accounting for the intermediate lags by adding them as regrerssors.\n",
" Here, you can think of $a_4$ and $b_{4}$ as the auto- and partial auto-correlation coefficients for lag 4. Notice, in the second equation we explicitly accounting for the intermediate lags by adding them as regressors.\n",
" </li>\n",
" </ul>\n",
"</ul>"
@@ -455,11 +462,11 @@
"<ul>\n",
" <li> Question: Auto-regressive pattern? What are we looking for? </li>\n",
" <ul style=\"list-style-type:none;\">\n",
" <li> We are looking for a classical profiles for an AR(p) process such as an exponential decay of an ACF and a the first $p$ significant lags of the PACF. Let's examine the ACF/PACF profiles of the same simulated AR(2) shown in Section 3, and check if the ACF/PACF explanation are refelcted in these plots. <li/>\n",
" <li> We are looking for a classical profiles for an AR(p) process such as an exponential decay of an ACF and a the first $p$ significant lags of the PACF. Let's examine the ACF/PACF profiles of the same simulated AR(2) shown in Section 3, and check if the ACF/PACF explanation are reflected in these plots. <li/>\n",
" <li><img src=\"figures/ACF_PACF_for_AR2.png\" class=\"img_class\">\n",
" <li> The autocorrelation coefficient for the 3rd lag is 0.6, which can be interpreted that a one unit increase in the value of the target varaible three periods ago leads to 0.6 units increase in the current period. However, the PACF plot shows that the partial autocorrealtion coefficient is zero (from a statistical point of view since it lies within the shaded region). This is happening because the 1st and 2nd lags are good predictors of the target variable. Ommiting these two lags from the regression results in the misleading conclusion that the third lag is a good prediciton. <li/>\n",
" <li> The autocorrelation coefficient for the 3rd lag is 0.6, which can be interpreted that a one unit increase in the value of the target variable three periods ago leads to 0.6 units increase in the current period. However, the PACF plot shows that the partial autocorrelation coefficient is zero (from a statistical point of view since it lies within the shaded region). This is happening because the 1st and 2nd lags are good predictors of the target variable. Omitting these two lags from the regression results in the misleading conclusion that the third lag is a good prediction. <li/>\n",
" <br/>\n",
" <li> This is why it is important to examine both the ACF and the PACF plots when tring to determine the auto regressive order for the variable in question. <li/>\n",
" <li> This is why it is important to examine both the ACF and the PACF plots when trying to determine the auto regressive order for the variable in question. <li/>\n",
" </ul>\n",
"</ul> "
]
@@ -471,10 +478,13 @@
"name": "vlbejan"
}
],
"kernel_info": {
"name": "python38-azureml"
},
"kernelspec": {
"display_name": "Python 3.6",
"display_name": "Python 3.8 - AzureML",
"language": "python",
"name": "python36"
"name": "python38-azureml"
},
"language_info": {
"codemirror_mode": {
@@ -486,7 +496,15 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
"version": "3.8.10"
},
"microsoft": {
"ms_spell_check": {
"ms_spell_check_language": "en"
}
},
"nteract": {
"version": "nteract-front-end@1.0.0"
}
},
"nbformat": 4,

View File

@@ -1,4 +0,0 @@
name: auto-ml-forecasting-univariate-recipe-experiment-settings
dependencies:
- pip:
- azureml-sdk

View File

@@ -16,6 +16,13 @@
"![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/forecasting-recipes-univariate/2_run_experiment.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<font color=\"red\" size=\"5\"><strong>!Important!</strong> </br>This notebook is outdated and is not supported by the AutoML Team. Please use the supported version ([link](https://github.com/Azure/azureml-examples/tree/main/sdk/python/jobs/automl-standalone-jobs/automl-forecasting-recipes-univariate)).</font>"
]
},
{
"cell_type": "markdown",
"metadata": {},
@@ -300,27 +307,14 @@
"df_train.to_csv(\"train.csv\", index=False)\n",
"df_test.to_csv(\"test.csv\", index=False)\n",
"\n",
"from azureml.data.dataset_factory import TabularDatasetFactory\n",
"\n",
"datastore = ws.get_default_datastore()\n",
"datastore.upload_files(\n",
" files=[\"./train.csv\"],\n",
" target_path=\"uni-recipe-dataset/tabular/\",\n",
" overwrite=True,\n",
" show_progress=True,\n",
"train_dataset = TabularDatasetFactory.register_pandas_dataframe(\n",
" df_train, target=(datastore, \"dataset/\"), name=\"train\"\n",
")\n",
"datastore.upload_files(\n",
" files=[\"./test.csv\"],\n",
" target_path=\"uni-recipe-dataset/tabular/\",\n",
" overwrite=True,\n",
" show_progress=True,\n",
")\n",
"\n",
"from azureml.core import Dataset\n",
"\n",
"train_dataset = Dataset.Tabular.from_delimited_files(\n",
" path=[(datastore, \"uni-recipe-dataset/tabular/train.csv\")]\n",
")\n",
"test_dataset = Dataset.Tabular.from_delimited_files(\n",
" path=[(datastore, \"uni-recipe-dataset/tabular/test.csv\")]\n",
"test_dataset = TabularDatasetFactory.register_pandas_dataframe(\n",
" df_test, target=(datastore, \"dataset/\"), name=\"test\"\n",
")\n",
"\n",
"# print the first 5 rows of the Dataset\n",
@@ -358,7 +352,8 @@
" enable_early_stopping=True,\n",
" training_data=train_dataset,\n",
" label_column_name=TARGET_COLNAME,\n",
" n_cross_validations=5,\n",
" n_cross_validations=\"auto\", # Feel free to set to a small integer (>=2) if runtime is an issue.\n",
" cv_step_size=\"auto\",\n",
" verbosity=logging.INFO,\n",
" max_cores_per_iteration=-1,\n",
" compute_target=compute_target,\n",
@@ -570,10 +565,13 @@
"name": "vlbejan"
}
],
"kernel_info": {
"name": "python3"
},
"kernelspec": {
"display_name": "Python 3.6",
"display_name": "Python 3.8 - AzureML",
"language": "python",
"name": "python36"
"name": "python38-azureml"
},
"language_info": {
"codemirror_mode": {
@@ -585,7 +583,20 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
"version": "3.8.10"
},
"microsoft": {
"ms_spell_check": {
"ms_spell_check_language": "en"
}
},
"nteract": {
"version": "nteract-front-end@1.0.0"
},
"vscode": {
"interpreter": {
"hash": "6bd77c88278e012ef31757c15997a7bea8c943977c43d6909403c00ae11d43ca"
}
}
},
"nbformat": 4,

View File

@@ -1,4 +0,0 @@
name: auto-ml-forecasting-univariate-recipe-run-experiment
dependencies:
- pip:
- azureml-sdk

View File

@@ -7,7 +7,7 @@ compute instance.
import argparse
from azureml.core import Dataset, Run
from azureml.automl.core.shared.constants import TimeSeriesInternal
from sklearn.externals import joblib
import joblib
parser = argparse.ArgumentParser()
parser.add_argument(

View File

@@ -1,5 +1,21 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Copyright (c) Microsoft Corporation. All rights reserved.\n",
"\n",
"Licensed under the MIT License."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/local-run-classification-credit-card-fraud/auto-ml-classification-credit-card-fraud-local.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
@@ -554,273 +570,6 @@
"automl_run.upload_file(\"outputs/scoring_explainer.pkl\", scoring_explainer_file_name)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Deploying the scoring and explainer models to a web service to Azure Kubernetes Service (AKS)\n",
"\n",
"We use the TreeScoringExplainer from azureml.interpret package to create the scoring explainer which will be used to compute the raw and engineered feature importances at the inference time. In the cell below, we register the AutoML model and the scoring explainer with the Model Management Service."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Register trained automl model present in the 'outputs' folder in the artifacts\n",
"original_model = automl_run.register_model(\n",
" model_name=\"automl_model\", model_path=\"outputs/model.pkl\"\n",
")\n",
"scoring_explainer_model = automl_run.register_model(\n",
" model_name=\"scoring_explainer\", model_path=\"outputs/scoring_explainer.pkl\"\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Create the conda dependencies for setting up the service\n",
"\n",
"We need to download the conda dependencies using the automl_run object."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.automl.core.shared import constants\n",
"from azureml.core.environment import Environment\n",
"\n",
"automl_run.download_file(constants.CONDA_ENV_FILE_PATH, \"myenv.yml\")\n",
"myenv = Environment.from_conda_specification(name=\"myenv\", file_path=\"myenv.yml\")\n",
"myenv"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Write the Entry Script\n",
"Write the script that will be used to predict on your model"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%writefile score.py\n",
"import joblib\n",
"import pandas as pd\n",
"from azureml.core.model import Model\n",
"from azureml.train.automl.runtime.automl_explain_utilities import (\n",
" automl_setup_model_explanations,\n",
")\n",
"\n",
"\n",
"def init():\n",
" global automl_model\n",
" global scoring_explainer\n",
"\n",
" # Retrieve the path to the model file using the model name\n",
" # Assume original model is named original_prediction_model\n",
" automl_model_path = Model.get_model_path(\"automl_model\")\n",
" scoring_explainer_path = Model.get_model_path(\"scoring_explainer\")\n",
"\n",
" automl_model = joblib.load(automl_model_path)\n",
" scoring_explainer = joblib.load(scoring_explainer_path)\n",
"\n",
"\n",
"def run(raw_data):\n",
" data = pd.read_json(raw_data, orient=\"records\")\n",
" # Make prediction\n",
" predictions = automl_model.predict(data)\n",
" # Setup for inferencing explanations\n",
" automl_explainer_setup_obj = automl_setup_model_explanations(\n",
" automl_model, X_test=data, task=\"classification\"\n",
" )\n",
" # Retrieve model explanations for engineered explanations\n",
" engineered_local_importance_values = scoring_explainer.explain(\n",
" automl_explainer_setup_obj.X_test_transform\n",
" )\n",
" # Retrieve model explanations for raw explanations\n",
" raw_local_importance_values = scoring_explainer.explain(\n",
" automl_explainer_setup_obj.X_test_transform, get_raw=True\n",
" )\n",
" # You can return any data type as long as it is JSON-serializable\n",
" return {\n",
" \"predictions\": predictions.tolist(),\n",
" \"engineered_local_importance_values\": engineered_local_importance_values,\n",
" \"raw_local_importance_values\": raw_local_importance_values,\n",
" }"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Create the InferenceConfig \n",
"Create the inference config that will be used when deploying the model"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.model import InferenceConfig\n",
"\n",
"inf_config = InferenceConfig(entry_script=\"score.py\", environment=myenv)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Provision the AKS Cluster\n",
"This is a one time setup. You can reuse this cluster for multiple deployments after it has been created. If you delete the cluster or the resource group that contains it, then you would have to recreate it."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.compute import ComputeTarget, AksCompute\n",
"from azureml.core.compute_target import ComputeTargetException\n",
"\n",
"# Choose a name for your cluster.\n",
"aks_name = \"scoring-explain\"\n",
"\n",
"# Verify that cluster does not exist already\n",
"try:\n",
" aks_target = ComputeTarget(workspace=ws, name=aks_name)\n",
" print(\"Found existing cluster, use it.\")\n",
"except ComputeTargetException:\n",
" prov_config = AksCompute.provisioning_configuration(vm_size=\"STANDARD_D3_V2\")\n",
" aks_target = ComputeTarget.create(\n",
" workspace=ws, name=aks_name, provisioning_configuration=prov_config\n",
" )\n",
"aks_target.wait_for_completion(show_output=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Deploy web service to AKS"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Set the web service configuration (using default here)\n",
"from azureml.core.webservice import AksWebservice\n",
"from azureml.core.model import Model\n",
"\n",
"aks_config = AksWebservice.deploy_configuration()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"aks_service_name = \"model-scoring-local-aks\"\n",
"\n",
"aks_service = Model.deploy(\n",
" workspace=ws,\n",
" name=aks_service_name,\n",
" models=[scoring_explainer_model, original_model],\n",
" inference_config=inf_config,\n",
" deployment_config=aks_config,\n",
" deployment_target=aks_target,\n",
")\n",
"\n",
"aks_service.wait_for_deployment(show_output=True)\n",
"print(aks_service.state)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### View the service logs"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"aks_service.get_logs()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Consume the web service using run method to do the scoring and explanation of scoring.\n",
"We test the web sevice by passing data. Run() method retrieves API keys behind the scenes to make sure that call is authenticated."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Serialize the first row of the test data into json\n",
"X_test_json = X_test_df[:1].to_json(orient=\"records\")\n",
"print(X_test_json)\n",
"\n",
"# Call the service to get the predictions and the engineered and raw explanations\n",
"output = aks_service.run(X_test_json)\n",
"\n",
"# Print the predicted value\n",
"print(\"predictions:\\n{}\\n\".format(output[\"predictions\"]))\n",
"# Print the engineered feature importances for the predicted value\n",
"print(\n",
" \"engineered_local_importance_values:\\n{}\\n\".format(\n",
" output[\"engineered_local_importance_values\"]\n",
" )\n",
")\n",
"# Print the raw feature importances for the predicted value\n",
"print(\n",
" \"raw_local_importance_values:\\n{}\\n\".format(output[\"raw_local_importance_values\"])\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Clean up\n",
"Delete the service."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"aks_service.delete()"
]
},
{
"cell_type": "markdown",
"metadata": {},
@@ -870,9 +619,9 @@
"friendly_name": "Classification of credit card fraudulent transactions using Automated ML",
"index_order": 5,
"kernelspec": {
"display_name": "Python 3.6",
"display_name": "Python 3.8 - AzureML",
"language": "python",
"name": "python36"
"name": "python38-azureml"
},
"language_info": {
"codemirror_mode": {

View File

@@ -1,4 +0,0 @@
name: auto-ml-classification-credit-card-fraud-local
dependencies:
- pip:
- azureml-sdk

View File

@@ -1,5 +1,27 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"hideCode": false,
"hidePrompt": false
},
"source": [
"Copyright (c) Microsoft Corporation. All rights reserved.\n",
"\n",
"Licensed under the MIT License."
]
},
{
"cell_type": "markdown",
"metadata": {
"hideCode": false,
"hidePrompt": false
},
"source": [
"![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/regression-explanation-featurization/auto-ml-regression-explanation-featurization.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
@@ -495,6 +517,30 @@
"#### Create conda configuration for model explanations experiment from automl_run object"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"from azureml.core import Environment\n",
"\n",
"\n",
"def get_environment_safe(parent_run):\n",
" \"\"\"Get the environment from parent run\"\"\"\n",
" try:\n",
" return parent_run.get_environment()\n",
" except BaseException:\n",
" run_details = parent_run.get_details()\n",
" run_def = run_details.get(\"runDefinition\")\n",
" env = run_def.get(\"environment\")\n",
" if env is None:\n",
" raise\n",
" json.dump(env, open(\"azureml_environment.json\", \"w\"))\n",
" return Environment.load_from_directory(\".\")"
]
},
{
"cell_type": "code",
"execution_count": null,
@@ -502,8 +548,6 @@
"outputs": [],
"source": [
"from azureml.core.runconfig import RunConfiguration\n",
"from azureml.core.conda_dependencies import CondaDependencies\n",
"import pkg_resources\n",
"\n",
"# create a new RunConfig object\n",
"conda_run_config = RunConfiguration(framework=\"python\")\n",
@@ -513,7 +557,7 @@
"conda_run_config.environment.docker.enabled = True\n",
"\n",
"# specify CondaDependencies obj\n",
"conda_run_config.environment = automl_run.get_environment()"
"conda_run_config.environment = get_environment_safe(automl_run)"
]
},
{
@@ -686,7 +730,7 @@
" description=\"Get local explanations for Machine test data\",\n",
")\n",
"\n",
"myenv = automl_run.get_environment()\n",
"myenv = get_environment_safe(automl_run)\n",
"inference_config = InferenceConfig(entry_script=\"score_explain.py\", environment=myenv)\n",
"\n",
"# Use configs and models generated above\n",
@@ -859,8 +903,8 @@
"outputs": [],
"source": [
"%matplotlib inline\n",
"test_pred = plt.scatter(y_test, y_pred_test, color=\"\")\n",
"test_test = plt.scatter(y_test, y_test, color=\"g\")\n",
"test_pred = plt.scatter(y_test, y_pred_test, c=[\"b\"])\n",
"test_test = plt.scatter(y_test, y_test, c=[\"g\"])\n",
"plt.legend(\n",
" (test_pred, test_test), (\"prediction\", \"truth\"), loc=\"upper left\", fontsize=8\n",
")\n",
@@ -895,9 +939,9 @@
"friendly_name": "Automated ML run with featurization and model explainability.",
"index_order": 5,
"kernelspec": {
"display_name": "Python 3.6",
"display_name": "Python 3.8 - AzureML",
"language": "python",
"name": "python36"
"name": "python38-azureml"
},
"language_info": {
"codemirror_mode": {
@@ -909,7 +953,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.7"
"version": "3.8.7"
},
"tags": [
"featurization",

View File

@@ -1,4 +0,0 @@
name: auto-ml-regression-explanation-featurization
dependencies:
- pip:
- azureml-sdk

View File

@@ -1,5 +1,21 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Copyright (c) Microsoft Corporation. All rights reserved.\n",
"\n",
"Licensed under the MIT License."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/regression/auto-ml-regression.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
@@ -422,8 +438,8 @@
"outputs": [],
"source": [
"%matplotlib inline\n",
"test_pred = plt.scatter(y_test, y_pred_test, color=\"\")\n",
"test_test = plt.scatter(y_test, y_test, color=\"g\")\n",
"test_pred = plt.scatter(y_test, y_pred_test, c=[\"b\"])\n",
"test_test = plt.scatter(y_test, y_test, c=[\"g\"])\n",
"plt.legend(\n",
" (test_pred, test_test), (\"prediction\", \"truth\"), loc=\"upper left\", fontsize=8\n",
")\n",
@@ -449,9 +465,9 @@
"automated-machine-learning"
],
"kernelspec": {
"display_name": "Python 3.6",
"display_name": "Python 3.8 - AzureML",
"language": "python",
"name": "python36"
"name": "python38-azureml"
},
"language_info": {
"codemirror_mode": {

View File

@@ -1,4 +0,0 @@
name: auto-ml-regression
dependencies:
- pip:
- azureml-sdk

View File

@@ -429,9 +429,9 @@
}
],
"kernelspec": {
"display_name": "Python 3.6",
"display_name": "Python 3.8 - AzureML",
"language": "python",
"name": "python36"
"name": "python38-azureml"
},
"language_info": {
"codemirror_mode": {

View File

@@ -557,9 +557,9 @@
}
],
"kernelspec": {
"display_name": "Python 3.6",
"display_name": "Python 3.8 - AzureML",
"language": "python",
"name": "python36"
"name": "python38-azureml"
},
"language_info": {
"codemirror_mode": {

View File

@@ -161,9 +161,9 @@
}
],
"kernelspec": {
"display_name": "Python 3.6",
"display_name": "Python 3.8 - AzureML",
"language": "python",
"name": "python36"
"name": "python38-azureml"
},
"language_info": {
"codemirror_mode": {

View File

@@ -215,9 +215,9 @@
}
],
"kernelspec": {
"display_name": "Python 3.6",
"display_name": "Python 3.8 - AzureML",
"language": "python",
"name": "python36"
"name": "python38-azureml"
},
"language_info": {
"codemirror_mode": {

View File

@@ -482,9 +482,9 @@
}
],
"kernelspec": {
"display_name": "Python 3.6",
"display_name": "Python 3.8 - AzureML",
"language": "python",
"name": "python36"
"name": "python38-azureml"
},
"language_info": {
"codemirror_mode": {

View File

@@ -302,9 +302,9 @@
}
],
"kernelspec": {
"display_name": "Python 3.6",
"display_name": "Python 3.8 - AzureML",
"language": "python",
"name": "python36"
"name": "python38-azureml"
},
"language_info": {
"codemirror_mode": {

View File

@@ -1,217 +0,0 @@
NOTICES AND INFORMATION
Do Not Translate or Localize
This Azure Machine Learning service example notebooks repository includes material from the projects listed below.
1. SSD-Tensorflow (https://github.com/balancap/ssd-tensorflow)
%% SSD-Tensorflow NOTICES AND INFORMATION BEGIN HERE
=========================================
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
=========================================
END OF SSD-Tensorflow NOTICES AND INFORMATION

View File

@@ -1,104 +0,0 @@
# Notebooks for Microsoft Azure Machine Learning Hardware Accelerated Models SDK
Easily create and train a model using various deep neural networks (DNNs) as a featurizer for deployment to Azure or a Data Box Edge device for ultra-low latency inferencing using FPGA's. These models are currently available:
* ResNet 50
* ResNet 152
* DenseNet-121
* VGG-16
* SSD-VGG
To learn more about the azureml-accel-model classes, see the section [Model Classes](#model-classes) below or the [Azure ML Accel Models SDK documentation](https://docs.microsoft.com/en-us/python/api/azureml-accel-models/azureml.accel?view=azure-ml-py).
### Step 1: Create an Azure ML workspace
Follow [these instructions](https://docs.microsoft.com/en-us/azure/machine-learning/service/setup-create-workspace) to install the Azure ML SDK on your local machine, create an Azure ML workspace, and set up your notebook environment, which is required for the next step.
### Step 2: Check your FPGA quota
Use the Azure CLI to check whether you have quota.
```shell
az vm list-usage --location "eastus" -o table
```
The other locations are ``southeastasia``, ``westeurope``, and ``westus2``.
Under the "Name" column, look for "Standard PBS Family vCPUs" and ensure you have at least 6 vCPUs under "CurrentValue."
If you do not have quota, then submit a request form [here](https://aka.ms/accelerateAI).
### Step 3: Install the Azure ML Accelerated Models SDK
Once you have set up your environment, install the Azure ML Accel Models SDK. This package requires tensorflow >= 1.6,<2.0 to be installed.
If you already have tensorflow >= 1.6,<2.0 installed in your development environment, you can install the SDK package using:
```
pip install azureml-accel-models
```
If you do not have tensorflow >= 1.6,<2.0 and are using a CPU-only development environment, our SDK with tensorflow can be installed using:
```
pip install azureml-accel-models[cpu]
```
If your machine supports GPU (for example, on an [Azure DSVM](https://docs.microsoft.com/en-us/azure/machine-learning/data-science-virtual-machine/overview)), then you can leverage the tensorflow-gpu functionality using:
```
pip install azureml-accel-models[gpu]
```
### Step 4: Follow our notebooks
We provide notebooks to walk through the following scenarios, linked below:
* [Quickstart](https://github.com/Azure/MachineLearningNotebooks/blob/33d6def8c30d3dd3a5bfbea50b9c727788185faf/how-to-use-azureml/deployment/accelerated-models/accelerated-models-quickstart.ipynb), deploy and inference a ResNet50 model trained on ImageNet
* [Object Detection](https://github.com/Azure/MachineLearningNotebooks/blob/33d6def8c30d3dd3a5bfbea50b9c727788185faf/how-to-use-azureml/deployment/accelerated-models/accelerated-models-object-detection.ipynb), deploy and inference an SSD-VGG model that can do object detection
* [Training models](https://github.com/Azure/MachineLearningNotebooks/blob/33d6def8c30d3dd3a5bfbea50b9c727788185faf/how-to-use-azureml/deployment/accelerated-models/accelerated-models-training.ipynb), train one of our accelerated models on the Kaggle Cats and Dogs dataset to see how to improve accuracy on custom datasets
**Note**: the above notebooks work only for tensorflow >= 1.6,<2.0.
<a name="model-classes"></a>
## Model Classes
As stated above, we support 5 Accelerated Models. Here's more information on their input and output tensors.
**Available models and output tensors**
The available models and the corresponding default classifier output tensors are below. This is the value that you would use during inferencing if you used the default classifier.
* Resnet50, QuantizedResnet50
``
output_tensors = "classifier_1/resnet_v1_50/predictions/Softmax:0"
``
* Resnet152, QuantizedResnet152
``
output_tensors = "classifier/resnet_v1_152/predictions/Softmax:0"
``
* Densenet121, QuantizedDensenet121
``
output_tensors = "classifier/densenet121/predictions/Softmax:0"
``
* Vgg16, QuantizedVgg16
``
output_tensors = "classifier/vgg_16/fc8/squeezed:0"
``
* SsdVgg, QuantizedSsdVgg
``
output_tensors = ['ssd_300_vgg/block4_box/Reshape_1:0', 'ssd_300_vgg/block7_box/Reshape_1:0', 'ssd_300_vgg/block8_box/Reshape_1:0', 'ssd_300_vgg/block9_box/Reshape_1:0', 'ssd_300_vgg/block10_box/Reshape_1:0', 'ssd_300_vgg/block11_box/Reshape_1:0', 'ssd_300_vgg/block4_box/Reshape:0', 'ssd_300_vgg/block7_box/Reshape:0', 'ssd_300_vgg/block8_box/Reshape:0', 'ssd_300_vgg/block9_box/Reshape:0', 'ssd_300_vgg/block10_box/Reshape:0', 'ssd_300_vgg/block11_box/Reshape:0']
``
For more information, please reference the azureml.accel.models package in the [Azure ML Python SDK documentation](https://docs.microsoft.com/en-us/python/api/azureml-accel-models/azureml.accel.models?view=azure-ml-py).
**Input tensors**
The input_tensors value defaults to "Placeholder:0" and is created in the [Image Preprocessing](#construct-model) step in the line:
``
in_images = tf.placeholder(tf.string)
``
You can change the input_tensors name by doing this:
``
in_images = tf.placeholder(tf.string, name="images")
``
## Resources
* [Read more about FPGAs](https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-accelerate-with-fpgas)

View File

@@ -1,14 +0,0 @@
# Model Deployment with Azure ML service
You can use Azure Machine Learning to package, debug, validate and deploy inference containers to a variety of compute targets. This process is known as "MLOps" (ML operationalization).
For more information please check out this article: https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-deploy-and-where
## Get Started
To begin, you will need an ML workspace.
For more information please check out this article: https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-workspace
## Deploy to the cloud
You can deploy to the cloud using the Azure ML CLI or the Azure ML SDK.
- CLI example: https://aka.ms/azmlcli
- Notebook example: [model-register-and-deploy](./model-register-and-deploy.ipynb).
![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/deployment/deploy-multi-model/README.png)

View File

@@ -1,395 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Copyright (c) Microsoft Corporation. All rights reserved.\n",
"\n",
"Licensed under the MIT License."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/deployment/deploy-multi-model/multi-model-register-and-deploy.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Deploy Multiple Models as Webservice\n",
"\n",
"This example shows how to deploy a Webservice with multiple models in step-by-step fashion:\n",
"\n",
" 1. Register Models\n",
" 2. Deploy Models as Webservice"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Prerequisites\n",
"If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, make sure you go through the [configuration](../../../configuration.ipynb) Notebook first if you haven't."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Check core SDK version number\n",
"import azureml.core\n",
"\n",
"print(\"SDK version:\", azureml.core.VERSION)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Initialize Workspace\n",
"\n",
"Initialize a workspace object from persisted configuration."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"create workspace"
]
},
"outputs": [],
"source": [
"from azureml.core import Workspace\n",
"\n",
"ws = Workspace.from_config()\n",
"print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep='\\n')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Register Models"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this example, we will be using and registering two models. \n",
"\n",
"First we will train two simple models on the [diabetes dataset](https://scikit-learn.org/stable/datasets/index.html#diabetes-dataset) included with scikit-learn, serializing them to files in the current directory."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import joblib\n",
"import sklearn\n",
"\n",
"from sklearn.datasets import load_diabetes\n",
"from sklearn.linear_model import BayesianRidge, Ridge\n",
"\n",
"x, y = load_diabetes(return_X_y=True)\n",
"\n",
"first_model = Ridge().fit(x, y)\n",
"second_model = BayesianRidge().fit(x, y)\n",
"\n",
"joblib.dump(first_model, \"first_model.pkl\")\n",
"joblib.dump(second_model, \"second_model.pkl\")\n",
"\n",
"print(\"Trained models using scikit-learn {}.\".format(sklearn.__version__))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we have our trained models locally, we will register them as Models with the names `my_first_model` and `my_second_model` in the workspace."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"register model from file"
]
},
"outputs": [],
"source": [
"from azureml.core.model import Model\n",
"\n",
"my_model_1 = Model.register(model_path=\"first_model.pkl\",\n",
" model_name=\"my_first_model\",\n",
" workspace=ws)\n",
"\n",
"my_model_2 = Model.register(model_path=\"second_model.pkl\",\n",
" model_name=\"my_second_model\",\n",
" workspace=ws)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Write the Entry Script\n",
"Write the script that will be used to predict on your models"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Model.get_model_path()\n",
"\n",
"To get the paths of your models, use `Model.get_model_path(model_name, version=None, _workspace=None)` method. This method will find the path to a model using the name of the model registered under the workspace.\n",
"\n",
"In this example, we do not use the optional arguments `version` and `_workspace`.\n",
"\n",
"#### Using environment variable AZUREML_MODEL_DIR\n",
"\n",
"In other [examples](../deploy-to-cloud/score.py) with a single model deployment, we use the environment variable `AZUREML_MODEL_DIR` and model file name to get the model path. \n",
"\n",
"For single model deployments, this environment variable is the path to the model folder (`./azureml-models/$MODEL_NAME/$VERSION`). When we deploy multiple models, the environment variable is set to the folder containing all models (./azureml-models).\n",
"\n",
"If you're using multiple models and you know the versions of the models you deploy, you can use this method to get the model path:\n",
"\n",
"```python\n",
"# Construct the model path using the registered model name, version, and model file name\n",
"model_1_path = os.path.join(os.getenv('AZUREML_MODEL_DIR'), 'my_first_model', '1', 'first_model.pkl')\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%writefile score.py\n",
"import joblib\n",
"import json\n",
"import numpy as np\n",
"\n",
"from azureml.core.model import Model\n",
"\n",
"def init():\n",
" global model_1, model_2\n",
" # Here \"my_first_model\" is the name of the model registered under the workspace.\n",
" # This call will return the path to the .pkl file on the local disk.\n",
" model_1_path = Model.get_model_path(model_name='my_first_model')\n",
" model_2_path = Model.get_model_path(model_name='my_second_model')\n",
" \n",
" # Deserialize the model files back into scikit-learn models.\n",
" model_1 = joblib.load(model_1_path)\n",
" model_2 = joblib.load(model_2_path)\n",
"\n",
"# Note you can pass in multiple rows for scoring.\n",
"def run(raw_data):\n",
" try:\n",
" data = json.loads(raw_data)['data']\n",
" data = np.array(data)\n",
" \n",
" # Call predict() on each model\n",
" result_1 = model_1.predict(data)\n",
" result_2 = model_2.predict(data)\n",
"\n",
" # You can return any JSON-serializable value.\n",
" return {\"prediction1\": result_1.tolist(), \"prediction2\": result_2.tolist()}\n",
" except Exception as e:\n",
" result = str(e)\n",
" return result"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create Environment"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can now create and/or use an Environment object when deploying a Webservice. The Environment can have been previously registered with your Workspace, or it will be registered with it as a part of the Webservice deployment. Please note that your environment must include azureml-defaults with verion >= 1.0.45 as a pip dependency, because it contains the functionality needed to host the model as a web service.\n",
"\n",
"More information can be found in our [using environments notebook](../training/using-environments/using-environments.ipynb)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core import Environment\n",
"\n",
"env = Environment(\"deploytocloudenv\")\n",
"env.python.conda_dependencies.add_pip_package(\"joblib\")\n",
"env.python.conda_dependencies.add_pip_package(\"numpy\")\n",
"env.python.conda_dependencies.add_pip_package(\"scikit-learn=={}\".format(sklearn.__version__))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create Inference Configuration\n",
"\n",
"There is now support for a source directory, you can upload an entire folder from your local machine as dependencies for the Webservice.\n",
"Note: in that case, environments's entry_script and file_path are relative paths to the source_directory path; myenv.docker.base_dockerfile is a string containing extra docker steps or contents of the docker file.\n",
"\n",
"Sample code for using a source directory:\n",
"\n",
"```python\n",
"from azureml.core.environment import Environment\n",
"from azureml.core.model import InferenceConfig\n",
"\n",
"myenv = Environment.from_conda_specification(name='myenv', file_path='env/myenv.yml')\n",
"\n",
"# explicitly set base_image to None when setting base_dockerfile\n",
"myenv.docker.base_image = None\n",
"# add extra docker commends to execute\n",
"myenv.docker.base_dockerfile = \"FROM ubuntu\\n RUN echo \\\"hello\\\"\"\n",
"\n",
"inference_config = InferenceConfig(source_directory=\"C:/abc\",\n",
" entry_script=\"x/y/score.py\",\n",
" environment=myenv)\n",
"```\n",
"\n",
" - file_path: input parameter to Environment constructor. Manages conda and python package dependencies.\n",
" - env.docker.base_dockerfile: any extra steps you want to inject into docker file\n",
" - source_directory: holds source path as string, this entire folder gets added in image so its really easy to access any files within this folder or subfolder\n",
" - entry_script: contains logic specific to initializing your model and running predictions"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"create image"
]
},
"outputs": [],
"source": [
"from azureml.core.model import InferenceConfig\n",
"\n",
"inference_config = InferenceConfig(entry_script=\"score.py\", environment=env)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Deploy Model as Webservice on Azure Container Instance\n",
"\n",
"Note that the service creation can take few minutes."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"azuremlexception-remarks-sample"
]
},
"outputs": [],
"source": [
"from azureml.core.webservice import AciWebservice\n",
"\n",
"aci_service_name = \"aciservice-multimodel\"\n",
"\n",
"deployment_config = AciWebservice.deploy_configuration(cpu_cores=1, memory_gb=1)\n",
"\n",
"service = Model.deploy(ws, aci_service_name, [my_model_1, my_model_2], inference_config, deployment_config, overwrite=True)\n",
"service.wait_for_deployment(True)\n",
"\n",
"print(service.state)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Test web service"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"\n",
"test_sample = json.dumps({'data': x[0:2].tolist()})\n",
"\n",
"prediction = service.run(test_sample)\n",
"\n",
"print(prediction)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Delete ACI to clean up"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"deploy service",
"aci"
]
},
"outputs": [],
"source": [
"service.delete()"
]
}
],
"metadata": {
"authors": [
{
"name": "jenns"
}
],
"kernelspec": {
"display_name": "Python 3.6",
"language": "python",
"name": "python36"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

View File

@@ -1,6 +0,0 @@
name: multi-model-register-and-deploy
dependencies:
- pip:
- azureml-sdk
- numpy
- scikit-learn

View File

@@ -1,12 +0,0 @@
# Model Deployment with Azure ML service
You can use Azure Machine Learning to package, debug, validate and deploy inference containers to a variety of compute targets. This process is known as "MLOps" (ML operationalization).
For more information please check out this article: https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-deploy-and-where
## Get Started
To begin, you will need an ML workspace.
For more information please check out this article: https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-workspace
## Deploy to the cloud
You can deploy to the cloud using the Azure ML CLI or the Azure ML SDK.
- CLI example: https://aka.ms/azmlcli
- Notebook example: [model-register-and-deploy](./model-register-and-deploy.ipynb).

View File

@@ -1,597 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Copyright (c) Microsoft Corporation. All rights reserved.\n",
"\n",
"Licensed under the MIT License."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/deployment/deploy-to-cloud/model-register-and-deploy.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Register model and deploy as webservice in ACI\n",
"\n",
"Following this notebook, you will:\n",
"\n",
" - Learn how to register a model in your Azure Machine Learning Workspace.\n",
" - Deploy your model as a web service in an Azure Container Instance."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Prerequisites\n",
"\n",
"If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, make sure you go through the [configuration notebook](../../../configuration.ipynb) to install the Azure Machine Learning Python SDK and create a workspace."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import azureml.core\n",
"\n",
"\n",
"# Check core SDK version number.\n",
"print('SDK version:', azureml.core.VERSION)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Initialize workspace\n",
"\n",
"Create a [Workspace](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.workspace%28class%29?view=azure-ml-py) object from your persisted configuration."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"create workspace"
]
},
"outputs": [],
"source": [
"from azureml.core import Workspace\n",
"\n",
"\n",
"ws = Workspace.from_config()\n",
"print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep='\\n')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create trained model\n",
"\n",
"For this example, we will train a small model on scikit-learn's [diabetes dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_diabetes.html). "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import joblib\n",
"\n",
"from sklearn.datasets import load_diabetes\n",
"from sklearn.linear_model import Ridge\n",
"\n",
"\n",
"dataset_x, dataset_y = load_diabetes(return_X_y=True)\n",
"\n",
"model = Ridge().fit(dataset_x, dataset_y)\n",
"\n",
"joblib.dump(model, 'sklearn_regression_model.pkl')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Register input and output datasets\n",
"\n",
"Here, you will register the data used to create the model in your workspace."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"from azureml.core import Dataset\n",
"\n",
"\n",
"np.savetxt('features.csv', dataset_x, delimiter=',')\n",
"np.savetxt('labels.csv', dataset_y, delimiter=',')\n",
"\n",
"datastore = ws.get_default_datastore()\n",
"datastore.upload_files(files=['./features.csv', './labels.csv'],\n",
" target_path='sklearn_regression/',\n",
" overwrite=True)\n",
"\n",
"input_dataset = Dataset.Tabular.from_delimited_files(path=[(datastore, 'sklearn_regression/features.csv')])\n",
"output_dataset = Dataset.Tabular.from_delimited_files(path=[(datastore, 'sklearn_regression/labels.csv')])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Register model\n",
"\n",
"Register a file or folder as a model by calling [Model.register()](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.model.model?view=azure-ml-py#register-workspace--model-path--model-name--tags-none--properties-none--description-none--datasets-none--model-framework-none--model-framework-version-none--child-paths-none-).\n",
"\n",
"In addition to the content of the model file itself, your registered model will also store model metadata -- model description, tags, and framework information -- that will be useful when managing and deploying models in your workspace. Using tags, for instance, you can categorize your models and apply filters when listing models in your workspace. Also, marking this model with the scikit-learn framework will simplify deploying it as a web service, as we'll see later."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"register model from file",
"sample-model-register"
]
},
"outputs": [],
"source": [
"import sklearn\n",
"\n",
"from azureml.core import Model\n",
"from azureml.core.resource_configuration import ResourceConfiguration\n",
"\n",
"\n",
"model = Model.register(workspace=ws,\n",
" model_name='my-sklearn-model', # Name of the registered model in your workspace.\n",
" model_path='./sklearn_regression_model.pkl', # Local file to upload and register as a model.\n",
" model_framework=Model.Framework.SCIKITLEARN, # Framework used to create the model.\n",
" model_framework_version=sklearn.__version__, # Version of scikit-learn used to create the model.\n",
" sample_input_dataset=input_dataset,\n",
" sample_output_dataset=output_dataset,\n",
" resource_configuration=ResourceConfiguration(cpu=1, memory_in_gb=0.5),\n",
" description='Ridge regression model to predict diabetes progression.',\n",
" tags={'area': 'diabetes', 'type': 'regression'})\n",
"\n",
"print('Name:', model.name)\n",
"print('Version:', model.version)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Deploy model\n",
"\n",
"Deploy your model as a web service using [Model.deploy()](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.model.model?view=azure-ml-py#deploy-workspace--name--models--inference-config--deployment-config-none--deployment-target-none-). Web services take one or more models, load them in an environment, and run them on one of several supported deployment targets. For more information on all your options when deploying models, see the [next steps](#Next-steps) section at the end of this notebook.\n",
"\n",
"For this example, we will deploy your scikit-learn model to an Azure Container Instance (ACI)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Use a default environment (for supported models)\n",
"\n",
"The Azure Machine Learning service provides a default environment for supported model frameworks, including scikit-learn, based on the metadata you provided when registering your model. This is the easiest way to deploy your model.\n",
"\n",
"Even when you deploy your model to ACI with a default environment you can still customize the deploy configuration (i.e. the number of cores and amount of memory made available for the deployment) using the [AciWebservice.deploy_configuration()](https://docs.microsoft.com/python/api/azureml-core/azureml.core.webservice.aci.aciwebservice#deploy-configuration-cpu-cores-none--memory-gb-none--tags-none--properties-none--description-none--location-none--auth-enabled-none--ssl-enabled-none--enable-app-insights-none--ssl-cert-pem-file-none--ssl-key-pem-file-none--ssl-cname-none--dns-name-label-none--). Look at the \"Use a custom environment\" section of this notebook for more information on deploy configuration.\n",
"\n",
"**Note**: This step can take several minutes."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"service_name = 'my-sklearn-service'\n",
"\n",
"service = Model.deploy(ws, service_name, [model], overwrite=True)\n",
"service.wait_for_deployment(show_output=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After your model is deployed, perform a call to the web service using [service.run()](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.webservice%28class%29?view=azure-ml-py#run-input-)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"\n",
"\n",
"input_payload = json.dumps({\n",
" 'data': dataset_x[0:2].tolist(),\n",
" 'method': 'predict' # If you have a classification model, you can get probabilities by changing this to 'predict_proba'.\n",
"})\n",
"\n",
"output = service.run(input_payload)\n",
"\n",
"print(output)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When you are finished testing your service, clean up the deployment with [service.delete()](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.webservice%28class%29?view=azure-ml-py#delete--)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"service.delete()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Use a custom environment\n",
"\n",
"If you want more control over how your model is run, if it uses another framework, or if it has special runtime requirements, you can instead specify your own environment and scoring method. Custom environments can be used for any model you want to deploy.\n",
"\n",
"Specify the model's runtime environment by creating an [Environment](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.environment%28class%29?view=azure-ml-py) object and providing the [CondaDependencies](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.conda_dependencies.condadependencies?view=azure-ml-py) needed by your model."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core import Environment\n",
"from azureml.core.conda_dependencies import CondaDependencies\n",
"\n",
"\n",
"environment = Environment('my-sklearn-environment')\n",
"environment.python.conda_dependencies = CondaDependencies.create(conda_packages=[\n",
" 'pip==20.2.4'],\n",
" pip_packages=[\n",
" 'azureml-defaults',\n",
" 'inference-schema[numpy-support]',\n",
" 'joblib',\n",
" 'numpy',\n",
" 'scikit-learn=={}'.format(sklearn.__version__)\n",
"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When using a custom environment, you must also provide Python code for initializing and running your model. An example script is included with this notebook."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"with open('score.py') as f:\n",
" print(f.read())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Deploy your model in the custom environment by providing an [InferenceConfig](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.model.inferenceconfig?view=azure-ml-py) object to [Model.deploy()](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.model.model?view=azure-ml-py#deploy-workspace--name--models--inference-config--deployment-config-none--deployment-target-none-). In this case we are also using the [AciWebservice.deploy_configuration()](https://docs.microsoft.com/python/api/azureml-core/azureml.core.webservice.aci.aciwebservice#deploy-configuration-cpu-cores-none--memory-gb-none--tags-none--properties-none--description-none--location-none--auth-enabled-none--ssl-enabled-none--enable-app-insights-none--ssl-cert-pem-file-none--ssl-key-pem-file-none--ssl-cname-none--dns-name-label-none--) method to generate a custom deploy configuration.\n",
"\n",
"**Note**: This step can take several minutes."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"azuremlexception-remarks-sample",
"sample-aciwebservice-deploy-config"
]
},
"outputs": [],
"source": [
"from azureml.core.model import InferenceConfig\n",
"from azureml.core.webservice import AciWebservice\n",
"\n",
"\n",
"service_name = 'my-custom-env-service'\n",
"\n",
"inference_config = InferenceConfig(entry_script='score.py', environment=environment)\n",
"aci_config = AciWebservice.deploy_configuration(cpu_cores=1, memory_gb=1)\n",
"\n",
"service = Model.deploy(workspace=ws,\n",
" name=service_name,\n",
" models=[model],\n",
" inference_config=inference_config,\n",
" deployment_config=aci_config,\n",
" overwrite=True)\n",
"service.wait_for_deployment(show_output=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After your model is deployed, make a call to the web service using [service.run()](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.webservice%28class%29?view=azure-ml-py#run-input-)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"input_payload = json.dumps({\n",
" 'data': dataset_x[0:2].tolist()\n",
"})\n",
"\n",
"output = service.run(input_payload)\n",
"\n",
"print(output)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When you are finished testing your service, clean up the deployment with [service.delete()](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.webservice%28class%29?view=azure-ml-py#delete--)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"service.delete()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Model Profiling\n",
"\n",
"Profile your model to understand how much CPU and memory the service, created as a result of its deployment, will need. Profiling returns information such as CPU usage, memory usage, and response latency. It also provides a CPU and memory recommendation based on the resource usage. You can profile your model (or more precisely the service built based on your model) on any CPU and/or memory combination where 0.1 <= CPU <= 3.5 and 0.1GB <= memory <= 15GB. If you do not provide a CPU and/or memory requirement, we will test it on the default configuration of 3.5 CPU and 15GB memory.\n",
"\n",
"In order to profile your model you will need:\n",
"- a registered model\n",
"- an entry script\n",
"- an inference configuration\n",
"- a single column tabular dataset, where each row contains a string representing sample request data sent to the service.\n",
"\n",
"Please, note that profiling is a long running operation and can take up to 25 minutes depending on the size of the dataset.\n",
"\n",
"At this point we only support profiling of services that expect their request data to be a string, for example: string serialized json, text, string serialized image, etc. The content of each row of the dataset (string) will be put into the body of the HTTP request and sent to the service encapsulating the model for scoring.\n",
"\n",
"Below is an example of how you can construct an input dataset to profile a service which expects its incoming requests to contain serialized json. In this case we created a dataset based one hundred instances of the same request data. In real world scenarios however, we suggest that you use larger datasets with various inputs, especially if your model resource usage/behavior is input dependent."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You may want to register datasets using the register() method to your workspace so they can be shared with others, reused and referred to by name in your script.\n",
"You can try get the dataset first to see if it's already registered."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core import Datastore\n",
"from azureml.core.dataset import Dataset\n",
"from azureml.data import dataset_type_definitions\n",
"\n",
"dataset_name='diabetes_sample_request_data'\n",
"\n",
"dataset_registered = False\n",
"try:\n",
" sample_request_data = Dataset.get_by_name(workspace = ws, name = dataset_name)\n",
" dataset_registered = True\n",
"except:\n",
" print(\"The dataset {} is not registered in workspace yet.\".format(dataset_name))\n",
"\n",
"if not dataset_registered:\n",
" # create a string that can be utf-8 encoded and\n",
" # put in the body of the request\n",
" serialized_input_json = json.dumps({\n",
" 'data': [\n",
" [ 0.03807591, 0.05068012, 0.06169621, 0.02187235, -0.0442235,\n",
" -0.03482076, -0.04340085, -0.00259226, 0.01990842, -0.01764613]\n",
" ]\n",
" })\n",
" dataset_content = []\n",
" for i in range(100):\n",
" dataset_content.append(serialized_input_json)\n",
" dataset_content = '\\n'.join(dataset_content)\n",
" file_name = \"{}.txt\".format(dataset_name)\n",
" f = open(file_name, 'w')\n",
" f.write(dataset_content)\n",
" f.close()\n",
"\n",
" # upload the txt file created above to the Datastore and create a dataset from it\n",
" data_store = Datastore.get_default(ws)\n",
" data_store.upload_files(['./' + file_name], target_path='sample_request_data')\n",
" datastore_path = [(data_store, 'sample_request_data' +'/' + file_name)]\n",
" sample_request_data = Dataset.Tabular.from_delimited_files(\n",
" datastore_path,\n",
" separator='\\n',\n",
" infer_column_types=True,\n",
" header=dataset_type_definitions.PromoteHeadersBehavior.NO_HEADERS)\n",
" sample_request_data = sample_request_data.register(workspace=ws,\n",
" name=dataset_name,\n",
" create_new_version=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we have an input dataset we are ready to go ahead with profiling. In this case we are testing the previously introduced sklearn regression model on 1 CPU and 0.5 GB memory. The memory usage and recommendation presented in the result is measured in Gigabytes. The CPU usage and recommendation is measured in CPU cores."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from datetime import datetime\n",
"\n",
"\n",
"environment = Environment('my-sklearn-environment')\n",
"environment.python.conda_dependencies = CondaDependencies.create(conda_packages=[\n",
" 'pip==20.2.4'],\n",
" pip_packages=[\n",
" 'azureml-defaults',\n",
" 'inference-schema[numpy-support]',\n",
" 'joblib',\n",
" 'numpy',\n",
" 'scikit-learn=={}'.format(sklearn.__version__)\n",
"])\n",
"inference_config = InferenceConfig(entry_script='score.py', environment=environment)\n",
"# if cpu and memory_in_gb parameters are not provided\n",
"# the model will be profiled on default configuration of\n",
"# 3.5CPU and 15GB memory\n",
"profile = Model.profile(ws,\n",
" 'rgrsn-%s' % datetime.now().strftime('%m%d%Y-%H%M%S'),\n",
" [model],\n",
" inference_config,\n",
" input_dataset=sample_request_data,\n",
" cpu=1.0,\n",
" memory_in_gb=0.5)\n",
"\n",
"# profiling is a long running operation and may take up to 25 min\n",
"profile.wait_for_completion(True)\n",
"details = profile.get_details()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Model packaging\n",
"\n",
"If you want to build a Docker image that encapsulates your model and its dependencies, you can use the model packaging option. The output image will be pushed to your workspace's ACR.\n",
"\n",
"You must include an Environment object in your inference configuration to use `Model.package()`.\n",
"\n",
"```python\n",
"package = Model.package(ws, [model], inference_config)\n",
"package.wait_for_creation(show_output=True) # Or show_output=False to hide the Docker build logs.\n",
"package.pull()\n",
"```\n",
"\n",
"Instead of a fully-built image, you can also generate a Dockerfile and download all the assets needed to build an image on top of your Environment.\n",
"\n",
"```python\n",
"package = Model.package(ws, [model], inference_config, generate_dockerfile=True)\n",
"package.wait_for_creation(show_output=True)\n",
"package.save(\"./local_context_dir\")\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Next steps\n",
"\n",
" - To run a production-ready web service, see the [notebook on deployment to Azure Kubernetes Service](../production-deploy-to-aks/production-deploy-to-aks.ipynb).\n",
" - To run a local web service, see the [notebook on deployment to a local Docker container](../deploy-to-local/register-model-deploy-local.ipynb).\n",
" - For more information on datasets, see the [notebook on training with datasets](../../work-with-data/datasets-tutorial/train-with-datasets/train-with-datasets.ipynb).\n",
" - For more information on environments, see the [notebook on using environments](../../training/using-environments/using-environments.ipynb).\n",
" - For information on all the available deployment targets, see [&ldquo;How and where to deploy models&rdquo;](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-deploy-and-where#choose-a-compute-target)."
]
}
],
"metadata": {
"authors": [
{
"name": "vaidyas"
}
],
"category": "deployment",
"compute": [
"None"
],
"datasets": [
"Diabetes"
],
"deployment": [
"Azure Container Instance"
],
"exclude_from_index": false,
"framework": [
"Scikit-learn"
],
"friendly_name": "Register model and deploy as webservice",
"index_order": 3,
"kernelspec": {
"display_name": "Python 3.6",
"language": "python",
"name": "python36"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.0"
},
"star_tag": [
"featured"
],
"tags": [
"None"
],
"task": "Deploy a model with Azure Machine Learning"
},
"nbformat": 4,
"nbformat_minor": 2
}

View File

@@ -1,6 +0,0 @@
name: model-register-and-deploy
dependencies:
- pip:
- azureml-sdk
- numpy
- scikit-learn

View File

@@ -1,38 +0,0 @@
import joblib
import numpy as np
import os
from inference_schema.schema_decorators import input_schema, output_schema
from inference_schema.parameter_types.numpy_parameter_type import NumpyParameterType
# The init() method is called once, when the web service starts up.
#
# Typically you would deserialize the model file, as shown here using joblib,
# and store it in a global variable so your run() method can access it later.
def init():
global model
# The AZUREML_MODEL_DIR environment variable indicates
# a directory containing the model file you registered.
model_filename = 'sklearn_regression_model.pkl'
model_path = os.path.join(os.environ['AZUREML_MODEL_DIR'], model_filename)
model = joblib.load(model_path)
# The run() method is called each time a request is made to the scoring API.
#
# Shown here are the optional input_schema and output_schema decorators
# from the inference-schema pip package. Using these decorators on your
# run() method parses and validates the incoming payload against
# the example input you provide here. This will also generate a Swagger
# API document for your web service.
@input_schema('data', NumpyParameterType(np.array([[0.1, 1.2, 2.3, 3.4, 4.5, 5.6, 6.7, 7.8, 8.9, 9.0]])))
@output_schema(NumpyParameterType(np.array([4429.929236457418])))
def run(data):
# Use the model object loaded by init().
result = model.predict(data)
# You can return any JSON-serializable object.
return result.tolist()

View File

@@ -1,12 +0,0 @@
# Model Deployment with Azure ML service
You can use Azure Machine Learning to package, debug, validate and deploy inference containers to a variety of compute targets. This process is known as "MLOps" (ML operationalization).
For more information please check out this article: https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-deploy-and-where
## Get Started
To begin, you will need an ML workspace.
For more information please check out this article: https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-workspace
## Deploy locally
You can deploy a model locally for testing & debugging using the Azure ML CLI or the Azure ML SDK.
- CLI example: https://aka.ms/azmlcli
- Notebook example: [register-model-deploy-local](./register-model-deploy-local.ipynb).

View File

@@ -1 +0,0 @@
RUN echo "this is test"

View File

@@ -1,495 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Copyright (c) Microsoft Corporation. All rights reserved.\n",
"\n",
"Licensed under the MIT License."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/deployment/deploy-to-local/register-model-deploy-local-advanced.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Register model and deploy locally with advanced usages\n",
"\n",
"This example shows how to deploy a web service in step-by-step fashion:\n",
"\n",
" 1. Register model\n",
" 2. Deploy the image as a web service in a local Docker container.\n",
" 3. Quickly test changes to your entry script by reloading the local service.\n",
" 4. Optionally, you can also make changes to model, conda or extra_docker_file_steps and update local service"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Prerequisites\n",
"If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, make sure you go through the [configuration](../../../configuration.ipynb) Notebook first if you haven't."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Check core SDK version number\n",
"import azureml.core\n",
"\n",
"print(\"SDK version:\", azureml.core.VERSION)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Initialize Workspace\n",
"\n",
"Initialize a workspace object from persisted configuration."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"create workspace"
]
},
"outputs": [],
"source": [
"from azureml.core import Workspace\n",
"\n",
"ws = Workspace.from_config()\n",
"print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep='\\n')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create trained model\n",
"\n",
"For this example, we will train a small model on scikit-learn's [diabetes dataset](https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset). "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import joblib\n",
"\n",
"from sklearn.datasets import load_diabetes\n",
"from sklearn.linear_model import Ridge\n",
"\n",
"dataset_x, dataset_y = load_diabetes(return_X_y=True)\n",
"\n",
"sk_model = Ridge().fit(dataset_x, dataset_y)\n",
"\n",
"joblib.dump(sk_model, \"sklearn_regression_model.pkl\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Register Model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can add tags and descriptions to your models. we are using `sklearn_regression_model.pkl` file in the current directory as a model with the name `sklearn_regression_model` in the workspace.\n",
"\n",
"Using tags, you can track useful information such as the name and version of the machine learning library used to train the model, framework, category, target customer etc. Note that tags must be alphanumeric."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"register model from file",
"sample-model-register"
]
},
"outputs": [],
"source": [
"from azureml.core.model import Model\n",
"\n",
"model = Model.register(model_path=\"sklearn_regression_model.pkl\",\n",
" model_name=\"sklearn_regression_model\",\n",
" tags={'area': \"diabetes\", 'type': \"regression\"},\n",
" description=\"Ridge regression model to predict diabetes\",\n",
" workspace=ws)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Manage your dependencies in a folder"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"\n",
"source_directory = \"source_directory\"\n",
"\n",
"os.makedirs(source_directory, exist_ok=True)\n",
"os.makedirs(os.path.join(source_directory, \"x/y\"), exist_ok=True)\n",
"os.makedirs(os.path.join(source_directory, \"env\"), exist_ok=True)\n",
"os.makedirs(os.path.join(source_directory, \"dockerstep\"), exist_ok=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Show `score.py`. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%writefile source_directory/x/y/score.py\n",
"import joblib\n",
"import json\n",
"import numpy as np\n",
"import os\n",
"\n",
"from inference_schema.schema_decorators import input_schema, output_schema\n",
"from inference_schema.parameter_types.numpy_parameter_type import NumpyParameterType\n",
"\n",
"def init():\n",
" global model\n",
" # AZUREML_MODEL_DIR is an environment variable created during deployment. Join this path with the filename of the model file.\n",
" # It holds the path to the directory that contains the deployed model (./azureml-models/$MODEL_NAME/$VERSION)\n",
" # If there are multiple models, this value is the path to the directory containing all deployed models (./azureml-models)\n",
" model_path = os.path.join(os.getenv('AZUREML_MODEL_DIR'), 'sklearn_regression_model.pkl')\n",
" # Deserialize the model file back into a sklearn model.\n",
" model = joblib.load(model_path)\n",
"\n",
" global name\n",
" # Note here, the entire source directory from inference config gets added into image.\n",
" # Below is an example of how you can use any extra files in image.\n",
" with open('./source_directory/extradata.json') as json_file:\n",
" data = json.load(json_file)\n",
" name = data[\"people\"][0][\"name\"]\n",
"\n",
"input_sample = np.array([[10.0, 9.0, 8.0, 7.0, 6.0, 5.0, 4.0, 3.0, 2.0, 1.0]])\n",
"output_sample = np.array([3726.995])\n",
"\n",
"@input_schema('data', NumpyParameterType(input_sample))\n",
"@output_schema(NumpyParameterType(output_sample))\n",
"def run(data):\n",
" try:\n",
" result = model.predict(data)\n",
" # You can return any JSON-serializable object.\n",
" return \"Hello \" + name + \" here is your result = \" + str(result)\n",
" except Exception as e:\n",
" error = str(e)\n",
" return error"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%writefile source_directory/extradata.json\n",
"{\n",
" \"people\": [\n",
" {\n",
" \"website\": \"microsoft.com\", \n",
" \"from\": \"Seattle\", \n",
" \"name\": \"Mrudula\"\n",
" }\n",
" ]\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create Inference Configuration\n",
"\n",
" - file_path: input parameter to Environment constructor. Manages conda and python package dependencies.\n",
" - env.docker.base_dockerfile: any extra steps you want to inject into docker file\n",
" - source_directory: holds source path as string, this entire folder gets added in image so its really easy to access any files within this folder or subfolder\n",
" - entry_script: contains logic specific to initializing your model and running predictions"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import sklearn\n",
"\n",
"from azureml.core.environment import Environment\n",
"from azureml.core.model import InferenceConfig\n",
"\n",
"\n",
"myenv = Environment('myenv')\n",
"myenv.python.conda_dependencies.add_pip_package(\"inference-schema[numpy-support]\")\n",
"myenv.python.conda_dependencies.add_pip_package(\"joblib\")\n",
"myenv.python.conda_dependencies.add_pip_package(\"scikit-learn=={}\".format(sklearn.__version__))\n",
"\n",
"# explicitly set base_image to None when setting base_dockerfile\n",
"myenv.docker.base_image = None\n",
"myenv.docker.base_dockerfile = \"FROM mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04\\nRUN echo \\\"this is test\\\"\"\n",
"myenv.inferencing_stack_version = \"latest\"\n",
"\n",
"inference_config = InferenceConfig(source_directory=source_directory,\n",
" entry_script=\"x/y/score.py\",\n",
" environment=myenv)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Deploy Model as a Local Docker Web Service\n",
"\n",
"*Make sure you have Docker installed and running.*\n",
"\n",
"Note that the service creation can take few minutes.\n",
"\n",
"NOTE:\n",
"\n",
"The Docker image runs as a Linux container. If you are running Docker for Windows, you need to ensure the Linux Engine is running:\n",
"\n",
" # PowerShell command to switch to Linux engine\n",
" & 'C:\\Program Files\\Docker\\Docker\\DockerCli.exe' -SwitchLinuxEngine"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"deploy service",
"aci"
]
},
"outputs": [],
"source": [
"from azureml.core.webservice import LocalWebservice\n",
"\n",
"# This is optional, if not provided Docker will choose a random unused port.\n",
"deployment_config = LocalWebservice.deploy_configuration(port=6789)\n",
"\n",
"local_service = Model.deploy(ws, \"test\", [model], inference_config, deployment_config)\n",
"\n",
"local_service.wait_for_deployment()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print('Local service port: {}'.format(local_service.port))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Check Status and Get Container Logs\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(local_service.get_logs())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Test Web Service"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Call the web service with some input data to get a prediction."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"\n",
"sample_input = json.dumps({\n",
" 'data': dataset_x[0:2].tolist()\n",
"})\n",
"\n",
"print(local_service.run(sample_input))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Reload Service\n",
"\n",
"You can update your score.py file and then call `reload()` to quickly restart the service. This will only reload your execution script and dependency files, it will not rebuild the underlying Docker image. As a result, `reload()` is fast, but if you do need to rebuild the image -- to add a new Conda or pip package, for instance -- you will have to call `update()`, instead (see below)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%writefile source_directory/x/y/score.py\n",
"import joblib\n",
"import json\n",
"import numpy as np\n",
"import os\n",
"\n",
"from inference_schema.schema_decorators import input_schema, output_schema\n",
"from inference_schema.parameter_types.numpy_parameter_type import NumpyParameterType\n",
"\n",
"def init():\n",
" global model\n",
" # AZUREML_MODEL_DIR is an environment variable created during deployment.\n",
" # It is the path to the model folder (./azureml-models/$MODEL_NAME/$VERSION)\n",
" # For multiple models, it points to the folder containing all deployed models (./azureml-models)\n",
" model_path = os.path.join(os.getenv('AZUREML_MODEL_DIR'), 'sklearn_regression_model.pkl')\n",
" # Deserialize the model file back into a sklearn model.\n",
" model = joblib.load(model_path)\n",
"\n",
" global name, from_location\n",
" # Note here, the entire source directory from inference config gets added into image.\n",
" # Below is an example of how you can use any extra files in image.\n",
" with open('source_directory/extradata.json') as json_file: \n",
" data = json.load(json_file)\n",
" name = data[\"people\"][0][\"name\"]\n",
" from_location = data[\"people\"][0][\"from\"]\n",
"\n",
"input_sample = np.array([[10.0, 9.0, 8.0, 7.0, 6.0, 5.0, 4.0, 3.0, 2.0, 1.0]])\n",
"output_sample = np.array([3726.995])\n",
"\n",
"@input_schema('data', NumpyParameterType(input_sample))\n",
"@output_schema(NumpyParameterType(output_sample))\n",
"def run(data):\n",
" try:\n",
" result = model.predict(data)\n",
" # You can return any JSON-serializable object.\n",
" return \"Hello \" + name + \" from \" + from_location + \" here is your result = \" + str(result)\n",
" except Exception as e:\n",
" error = str(e)\n",
" return error"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"local_service.reload()\n",
"print(\"--------------------------------------------------------------\")\n",
"\n",
"# After calling reload(), run() will return the updated message.\n",
"local_service.run(sample_input)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Update Service\n",
"\n",
"If you want to change your model(s), Conda dependencies, or deployment configuration, call `update()` to rebuild the Docker image.\n",
"\n",
"```python\n",
"\n",
"local_service.update(models=[SomeOtherModelObject],\n",
" deployment_config=local_config,\n",
" inference_config=inference_config)\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Delete Service"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"local_service.delete()"
]
}
],
"metadata": {
"authors": [
{
"name": "keriehm"
}
],
"kernelspec": {
"display_name": "Python 3.6",
"language": "python",
"name": "python36"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

View File

@@ -1,556 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Copyright (c) Microsoft Corporation. All rights reserved.\n",
"\n",
"Licensed under the MIT License."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/deployment/deploy-to-local/register-model-deploy-local.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Register model and deploy locally\n",
"\n",
"This example shows how to deploy a web service in step-by-step fashion:\n",
"\n",
" 1. Register model\n",
" 2. Deploy the image as a web service in a local Docker container.\n",
" 3. Quickly test changes to your entry script by reloading the local service.\n",
" 4. Optionally, you can also make changes to model, conda or extra_docker_file_steps and update local service"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Prerequisites\n",
"If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, make sure you go through the [configuration](../../../configuration.ipynb) Notebook first if you haven't."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Check core SDK version number\n",
"import azureml.core\n",
"\n",
"print(\"SDK version:\", azureml.core.VERSION)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Initialize Workspace\n",
"\n",
"Initialize a workspace object from persisted configuration."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core import Workspace\n",
"\n",
"ws = Workspace.from_config()\n",
"print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep='\\n')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create trained model\n",
"\n",
"For this example, we will train a small model on scikit-learn's [diabetes dataset](https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset). "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import joblib\n",
"\n",
"from sklearn.datasets import load_diabetes\n",
"from sklearn.linear_model import Ridge\n",
"\n",
"dataset_x, dataset_y = load_diabetes(return_X_y=True)\n",
"\n",
"sk_model = Ridge().fit(dataset_x, dataset_y)\n",
"\n",
"joblib.dump(sk_model, \"sklearn_regression_model.pkl\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Register Model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we are registering the serialized file `sklearn_regression_model.pkl` in the current directory as a model with the name `sklearn_regression_model` in the workspace.\n",
"\n",
"You can add tags and descriptions to your models. Using tags, you can track useful information such as the name and version of the machine learning library used to train the model, framework, category, target customer etc. Note that tags must be alphanumeric."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"register model from file"
]
},
"outputs": [],
"source": [
"from azureml.core.model import Model\n",
"\n",
"model = Model.register(model_path=\"sklearn_regression_model.pkl\",\n",
" model_name=\"sklearn_regression_model\",\n",
" tags={'area': \"diabetes\", 'type': \"regression\"},\n",
" description=\"Ridge regression model to predict diabetes\",\n",
" workspace=ws)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create Environment"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import sklearn\n",
"\n",
"from azureml.core.environment import Environment\n",
"\n",
"environment = Environment(\"LocalDeploy\")\n",
"environment.python.conda_dependencies.add_pip_package(\"inference-schema[numpy-support]\")\n",
"environment.python.conda_dependencies.add_pip_package(\"joblib\")\n",
"environment.python.conda_dependencies.add_pip_package(\"scikit-learn=={}\".format(sklearn.__version__))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Provide the Scoring Script\n",
"\n",
"This Python script handles the model execution inside the service container. The `init()` method loads the model file, and `run(data)` is called for every input to the service."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%writefile score.py\n",
"import joblib\n",
"import json\n",
"import numpy as np\n",
"import os\n",
"\n",
"from inference_schema.schema_decorators import input_schema, output_schema\n",
"from inference_schema.parameter_types.numpy_parameter_type import NumpyParameterType\n",
"\n",
"def init():\n",
" global model\n",
" # AZUREML_MODEL_DIR is an environment variable created during deployment.\n",
" # It is the path to the model folder (./azureml-models/$MODEL_NAME/$VERSION)\n",
" # For multiple models, it points to the folder containing all deployed models (./azureml-models)\n",
" model_path = os.path.join(os.getenv('AZUREML_MODEL_DIR'), 'sklearn_regression_model.pkl')\n",
" # Deserialize the model file back into a sklearn model.\n",
" model = joblib.load(model_path)\n",
"\n",
"input_sample = np.array([[10.0, 9.0, 8.0, 7.0, 6.0, 5.0, 4.0, 3.0, 2.0, 1.0]])\n",
"output_sample = np.array([3726.995])\n",
"\n",
"@input_schema('data', NumpyParameterType(input_sample))\n",
"@output_schema(NumpyParameterType(output_sample))\n",
"def run(data):\n",
" try:\n",
" result = model.predict(data)\n",
" # You can return any JSON-serializable object.\n",
" return result.tolist()\n",
" except Exception as e:\n",
" error = str(e)\n",
" return error"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create Inference Configuration"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.model import InferenceConfig\n",
"\n",
"inference_config = InferenceConfig(entry_script=\"score.py\",\n",
" environment=environment)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Deploy Model as a Local Docker Web Service\n",
"\n",
"*Make sure you have Docker installed and running.*\n",
"\n",
"Note that the service creation can take few minutes.\n",
"\n",
"NOTE:\n",
"\n",
"The Docker image runs as a Linux container. If you are running Docker for Windows, you need to ensure the Linux Engine is running:\n",
"\n",
" # PowerShell command to switch to Linux engine\n",
" & 'C:\\Program Files\\Docker\\Docker\\DockerCli.exe' -SwitchLinuxEngine"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"sample-localwebservice-deploy"
]
},
"outputs": [],
"source": [
"from azureml.core.webservice import LocalWebservice\n",
"\n",
"# This is optional, if not provided Docker will choose a random unused port.\n",
"deployment_config = LocalWebservice.deploy_configuration(port=6789)\n",
"\n",
"local_service = Model.deploy(ws, \"test\", [model], inference_config, deployment_config)\n",
"\n",
"local_service.wait_for_deployment()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print('Local service port: {}'.format(local_service.port))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Check Status and Get Container Logs\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(local_service.get_logs())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Test Web Service"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Call the web service with some input data to get a prediction."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"\n",
"sample_input = json.dumps({\n",
" 'data': dataset_x[0:2].tolist()\n",
"})\n",
"\n",
"local_service.run(sample_input)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Reload Service\n",
"\n",
"You can update your score.py file and then call `reload()` to quickly restart the service. This will only reload your execution script and dependency files, it will not rebuild the underlying Docker image. As a result, `reload()` is fast, but if you do need to rebuild the image -- to add a new Conda or pip package, for instance -- you will have to call `update()`, instead (see below)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%writefile score.py\n",
"import joblib\n",
"import json\n",
"import numpy as np\n",
"import os\n",
"\n",
"from inference_schema.schema_decorators import input_schema, output_schema\n",
"from inference_schema.parameter_types.numpy_parameter_type import NumpyParameterType\n",
"\n",
"def init():\n",
" global model\n",
" # AZUREML_MODEL_DIR is an environment variable created during deployment.\n",
" # It is the path to the model folder (./azureml-models/$MODEL_NAME/$VERSION)\n",
" # For multiple models, it points to the folder containing all deployed models (./azureml-models)\n",
" model_path = os.path.join(os.getenv('AZUREML_MODEL_DIR'), 'sklearn_regression_model.pkl')\n",
" # Deserialize the model file back into a sklearn model.\n",
" model = joblib.load(model_path)\n",
"\n",
"input_sample = np.array([[10.0, 9.0, 8.0, 7.0, 6.0, 5.0, 4.0, 3.0, 2.0, 1.0]])\n",
"output_sample = np.array([3726.995])\n",
"\n",
"@input_schema('data', NumpyParameterType(input_sample))\n",
"@output_schema(NumpyParameterType(output_sample))\n",
"def run(data):\n",
" try:\n",
" result = model.predict(data)\n",
" # You can return any JSON-serializable object.\n",
" return 'Hello from the updated score.py: ' + str(result.tolist())\n",
" except Exception as e:\n",
" error = str(e)\n",
" return error"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"local_service.reload()\n",
"print(\"--------------------------------------------------------------\")\n",
"\n",
"# After calling reload(), run() will return the updated message.\n",
"local_service.run(sample_input)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Update Service\n",
"\n",
"If you want to change your model(s), Conda dependencies or deployment configuration, call `update()` to rebuild the Docker image.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"local_service.update(models=[model],\n",
" inference_config=inference_config,\n",
" deployment_config=deployment_config)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Deploy model to AKS cluster based on the LocalWebservice's configuration."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# This is a one time setup for AKS Cluster. You can reuse this cluster for multiple deployments after it has been created. If you delete the cluster or the resource group that contains it, then you would have to recreate it.\n",
"from azureml.core.compute import AksCompute, ComputeTarget\n",
"from azureml.core.compute_target import ComputeTargetException\n",
"\n",
"# Choose a name for your AKS cluster\n",
"aks_name = 'my-aks-9' \n",
"\n",
"# Verify the cluster does not exist already\n",
"try:\n",
" aks_target = ComputeTarget(workspace=ws, name=aks_name)\n",
" print('Found existing cluster, use it.')\n",
"except ComputeTargetException:\n",
" # Use the default configuration (can also provide parameters to customize)\n",
" prov_config = AksCompute.provisioning_configuration()\n",
"\n",
" # Create the cluster\n",
" aks_target = ComputeTarget.create(workspace = ws, \n",
" name = aks_name, \n",
" provisioning_configuration = prov_config)\n",
"\n",
"if aks_target.get_status() != \"Succeeded\":\n",
" aks_target.wait_for_completion(show_output=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.webservice import AksWebservice\n",
"# Set the web service configuration (using default here)\n",
"aks_config = AksWebservice.deploy_configuration()\n",
"\n",
"# # Enable token auth and disable (key) auth on the webservice\n",
"# aks_config = AksWebservice.deploy_configuration(token_auth_enabled=True, auth_enabled=False)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"aks_service_name ='aks-service-1'\n",
"\n",
"aks_service = local_service.deploy_to_cloud(name=aks_service_name,\n",
" deployment_config=aks_config,\n",
" deployment_target=aks_target)\n",
"\n",
"aks_service.wait_for_deployment(show_output = True)\n",
"print(aks_service.state)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Test aks service\n",
"\n",
"sample_input = json.dumps({\n",
" 'data': dataset_x[0:2].tolist()\n",
"})\n",
"\n",
"aks_service.run(sample_input)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Delete the service if not needed.\n",
"aks_service.delete()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Delete Service"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"local_service.delete()"
]
}
],
"metadata": {
"authors": [
{
"name": "keriehm"
}
],
"category": "tutorial",
"compute": [
"Local"
],
"datasets": [
"None"
],
"deployment": [
"Local"
],
"exclude_from_index": false,
"framework": [
"None"
],
"friendly_name": "Register a model and deploy locally",
"index_order": 1,
"kernelspec": {
"display_name": "Python 3.6",
"language": "python",
"name": "python36"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
},
"star_tag": [],
"tags": [
"None"
],
"task": "Deployment"
},
"nbformat": 4,
"nbformat_minor": 2
}

View File

@@ -1,373 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Copyright (c) Microsoft Corporation. All rights reserved.\n",
"\n",
"Licensed under the MIT License."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/deployment/production-deploy-to-aks/production-deploy-to-aks.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Deploy models to Azure Kubernetes Service (AKS) using controlled roll out\n",
"This notebook will show you how to deploy mulitple AKS webservices with the same scoring endpoint and how to roll out your models in a controlled manner by configuring % of scoring traffic going to each webservice. If you are using a Notebook VM, you are all set. Otherwise, go through the [configuration notebook](../../../configuration.ipynb) to install the Azure Machine Learning Python SDK and create an Azure ML Workspace."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Check for latest version\n",
"import azureml.core\n",
"print(azureml.core.VERSION)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Initialize workspace\n",
"Create a [Workspace](https://docs.microsoft.com/python/api/azureml-core/azureml.core.workspace%28class%29?view=azure-ml-py) object from your persisted configuration."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.workspace import Workspace\n",
"\n",
"ws = Workspace.from_config()\n",
"print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\\n')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Register the model\n",
"Register a file or folder as a model by calling [Model.register()](https://docs.microsoft.com/python/api/azureml-core/azureml.core.model.model?view=azure-ml-py#register-workspace--model-path--model-name--tags-none--properties-none--description-none--datasets-none--model-framework-none--model-framework-version-none--child-paths-none-).\n",
"In addition to the content of the model file itself, your registered model will also store model metadata -- model description, tags, and framework information -- that will be useful when managing and deploying models in your workspace. Using tags, for instance, you can categorize your models and apply filters when listing models in your workspace."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core import Model\n",
"\n",
"model = Model.register(workspace=ws,\n",
" model_name='sklearn_regression_model.pkl', # Name of the registered model in your workspace.\n",
" model_path='./sklearn_regression_model.pkl', # Local file to upload and register as a model.\n",
" model_framework=Model.Framework.SCIKITLEARN, # Framework used to create the model.\n",
" model_framework_version='0.19.1', # Version of scikit-learn used to create the model.\n",
" description='Ridge regression model to predict diabetes progression.',\n",
" tags={'area': 'diabetes', 'type': 'regression'})\n",
"\n",
"print('Name:', model.name)\n",
"print('Version:', model.version)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Register an environment (for all models)\n",
"\n",
"If you control over how your model is run, or if it has special runtime requirements, you can specify your own environment and scoring method.\n",
"\n",
"Specify the model's runtime environment by creating an [Environment](https://docs.microsoft.com/python/api/azureml-core/azureml.core.environment%28class%29?view=azure-ml-py) object and providing the [CondaDependencies](https://docs.microsoft.com/python/api/azureml-core/azureml.core.conda_dependencies.condadependencies?view=azure-ml-py) needed by your model."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core import Environment\n",
"from azureml.core.conda_dependencies import CondaDependencies\n",
"\n",
"environment=Environment('my-sklearn-environment')\n",
"environment.python.conda_dependencies = CondaDependencies.create(conda_packages=[\n",
" 'pip==20.2.4'],\n",
" pip_packages=[\n",
" 'azureml-defaults',\n",
" 'inference-schema[numpy-support]',\n",
" 'numpy',\n",
" 'scikit-learn==0.22.1',\n",
" 'scipy'\n",
"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When using a custom environment, you must also provide Python code for initializing and running your model. An example script is included with this notebook."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"with open('score.py') as f:\n",
" print(f.read())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create the InferenceConfig\n",
"Create the inference configuration to reference your environment and entry script during deployment"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.model import InferenceConfig\n",
"\n",
"inference_config = InferenceConfig(entry_script='score.py', \n",
" source_directory='.',\n",
" environment=environment)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Provision the AKS Cluster\n",
"If you already have an AKS cluster attached to this workspace, skip the step below and provide the name of the cluster.\n",
"\n",
"> Note that if you have an AzureML Data Scientist role, you will not have permission to create compute resources. Talk to your workspace or IT admin to create the compute targets described in this section, if they do not already exist."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.compute import AksCompute\n",
"from azureml.core.compute import ComputeTarget\n",
"# Use the default configuration (can also provide parameters to customize)\n",
"prov_config = AksCompute.provisioning_configuration()\n",
"\n",
"aks_name = 'my-aks' \n",
"# Create the cluster\n",
"aks_target = ComputeTarget.create(workspace = ws, \n",
" name = aks_name, \n",
" provisioning_configuration = prov_config) \n",
"aks_target.wait_for_completion(show_output=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create an Endpoint and add a version (AKS service)\n",
"This creates a new endpoint and adds a version behind it. By default the first version added is the default version. You can specify the traffic percentile a version takes behind an endpoint. \n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# deploying the model and create a new endpoint\n",
"from azureml.core.webservice import AksEndpoint\n",
"# from azureml.core.compute import ComputeTarget\n",
"\n",
"#select a created compute\n",
"compute = ComputeTarget(ws, 'my-aks')\n",
"namespace_name=\"endpointnamespace\"\n",
"# define the endpoint name\n",
"endpoint_name = \"myendpoint1\"\n",
"# define the service name\n",
"version_name= \"versiona\"\n",
"\n",
"endpoint_deployment_config = AksEndpoint.deploy_configuration(tags = {'modelVersion':'firstversion', 'department':'finance'}, \n",
" description = \"my first version\", namespace = namespace_name, \n",
" version_name = version_name, traffic_percentile = 40)\n",
"\n",
"endpoint = Model.deploy(ws, endpoint_name, [model], inference_config, endpoint_deployment_config, compute)\n",
"endpoint.wait_for_deployment(True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"endpoint.get_logs()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Add another version of the service to an existing endpoint\n",
"This adds another version behind an existing endpoint. You can specify the traffic percentile the new version takes. If no traffic_percentile is specified then it defaults to 0. All the unspecified traffic percentile (in this example 50) across all versions goes to default version."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Adding a new version to an existing Endpoint.\n",
"version_name_add=\"versionb\" \n",
"\n",
"endpoint.create_version(version_name = version_name_add, inference_config=inference_config, models=[model], tags = {'modelVersion':'secondversion', 'department':'finance'}, \n",
" description = \"my second version\", traffic_percentile = 10)\n",
"endpoint.wait_for_deployment(True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Update an existing version in an endpoint\n",
"There are two types of versions: control and treatment. An endpoint contains one or more treatment versions but only one control version. This categorization helps compare the different versions against the defined control version."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"endpoint.update_version(version_name=endpoint.versions[version_name_add].name, description=\"my second version update\", traffic_percentile=40, is_default=True, is_control_version_type=True)\n",
"endpoint.wait_for_deployment(True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Test the web service using run method\n",
"Test the web sevice by passing in data. Run() method retrieves API keys behind the scenes to make sure that call is authenticated."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Scoring on endpoint\n",
"import json\n",
"test_sample = json.dumps({'data': [\n",
" [1,2,3,4,5,6,7,8,9,10], \n",
" [10,9,8,7,6,5,4,3,2,1]\n",
"]})\n",
"\n",
"test_sample_encoded = bytes(test_sample, encoding='utf8')\n",
"prediction = endpoint.run(input_data=test_sample_encoded)\n",
"print(prediction)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Delete Resources"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# deleting a version in an endpoint\n",
"endpoint.delete_version(version_name=version_name)\n",
"endpoint.wait_for_deployment(True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# deleting an endpoint, this will delete all versions in the endpoint and the endpoint itself\n",
"endpoint.delete()"
]
}
],
"metadata": {
"authors": [
{
"name": "shipatel"
}
],
"category": "deployment",
"compute": [
"None"
],
"datasets": [
"Diabetes"
],
"deployment": [
"Azure Kubernetes Service"
],
"exclude_from_index": false,
"framework": [
"Scikit-learn"
],
"friendly_name": "Deploy models to AKS using controlled roll out",
"index_order": 3,
"kernelspec": {
"display_name": "Python 3.6",
"language": "python",
"name": "python36"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.0"
},
"star_tag": [
"featured"
],
"tags": [
"None"
],
"task": "Deploy a model with Azure Machine Learning"
},
"nbformat": 4,
"nbformat_minor": 2
}

View File

@@ -1,4 +0,0 @@
name: deploy-aks-with-controlled-rollout
dependencies:
- pip:
- azureml-sdk

View File

@@ -1,28 +0,0 @@
import pickle
import json
import numpy
from sklearn.externals import joblib
from sklearn.linear_model import Ridge
from azureml.core.model import Model
def init():
global model
# note here "sklearn_regression_model.pkl" is the name of the model registered under
# this is a different behavior than before when the code is run locally, even though the code is the same.
model_path = Model.get_model_path('sklearn_regression_model.pkl')
# deserialize the model file back into a sklearn model
model = joblib.load(model_path)
# note you can pass in multiple rows for scoring
def run(raw_data):
try:
data = json.loads(raw_data)['data']
data = numpy.array(data)
result = model.predict(data)
# you can return any data type as long as it is JSON-serializable
return result.tolist()
except Exception as e:
error = str(e)
return error

Some files were not shown because too many files have changed in this diff Show More