Compare commits

...

30 Commits

Author SHA1 Message Date
amlrelsa-ms
badb620261 update samples from Release-163 as a part of SDK release 2022-09-20 21:11:25 +00:00
Harneet Virk
acf46100ae Merge pull request #1817 from Azure/release_update/Release-161
update samples from Release-161 as a part of  SDK release
2022-09-16 15:54:11 -07:00
amlrelsa-ms
cf2e3804d5 update samples from Release-161 as a part of SDK release 2022-09-16 20:16:37 +00:00
Harneet Virk
b7be42357f Merge pull request #1814 from Azure/release_update/Release-160
update samples from Release-160 as a part of  SDK release
2022-09-12 18:57:44 -07:00
amlrelsa-ms
3ac82c07ae update samples from Release-160 as a part of SDK release 2022-09-13 01:24:40 +00:00
Harneet Virk
9743c0a1fa Merge pull request #1755 from Azure/users/GitHubPolicyService/11f57c70-4141-4c68-9224-aceb8eab1c48
Adding Microsoft SECURITY.MD
2022-09-06 16:52:36 -07:00
Harneet Virk
ba4dac530e Merge pull request #1808 from Azure/release_update/Release-157
update samples from Release-157 as a part of  SDK release
2022-09-06 16:33:03 -07:00
amlrelsa-ms
7f7f0040fd update samples from Release-157 as a part of SDK release 2022-09-06 23:16:24 +00:00
Harneet Virk
9ca567cd9c Merge pull request #1802 from Azure/release_update/Release-156
update samples from Release-156 as a part of  SDK release
2022-08-18 17:23:55 -07:00
amlrelsa-ms
ae7b234ba0 update samples from Release-156 as a part of SDK release 2022-08-18 23:57:09 +00:00
Harneet Virk
9788d1965f Merge pull request #1799 from Azure/release_update/Release-155
update samples from Release-155 as a part of  SDK release
2022-08-12 14:18:11 -07:00
amlrelsa-ms
387e43a423 update samples from Release-155 as a part of SDK release 2022-08-12 20:38:16 +00:00
Harneet Virk
25f407fc81 Merge pull request #1796 from Azure/release_update/Release-154
update samples from Release-154 as a part of  SDK release
2022-08-10 11:36:05 -07:00
amlrelsa-ms
dcb2c4638f update samples from Release-154 as a part of SDK release 2022-08-10 18:10:45 +00:00
Harneet Virk
7fb5dd3ef9 Merge pull request #1795 from Azure/release_update/Release-153
update samples from Release-153 as a part of  SDK release
2022-08-09 15:39:30 -07:00
amlrelsa-ms
6a38f4bec3 update samples from Release-153 as a part of SDK release 2022-08-09 21:50:34 +00:00
Harneet Virk
aed078aeab Merge pull request #1793 from Azure/release_update/Release-152
update samples from Release-152 as a part of  SDK release
2022-08-08 11:51:52 -07:00
amlrelsa-ms
f999f41ed3 update samples from Release-152 as a part of SDK release 2022-08-08 17:27:37 +00:00
Harneet Virk
07e43ee7e4 Merge pull request #1791 from Azure/release_update/Release-151
update samples from Release-151 as a part of  SDK release
2022-08-05 13:12:57 -07:00
amlrelsa-ms
aac706c3f0 update samples from Release-151 as a part of SDK release 2022-08-05 20:01:34 +00:00
Harneet Virk
4ccb278051 Merge pull request #1789 from Azure/release_update/Release-150
update samples from Release-150 as a part of  SDK release
2022-08-04 12:08:14 -07:00
amlrelsa-ms
64a733480b update samples from Release-150 as a part of SDK release 2022-08-03 16:29:31 +00:00
Harneet Virk
dd0976f678 Merge pull request #1779 from Azure/release_update/Release-149
update samples from Release-149 as a part of  SDK release
2022-07-07 08:37:35 -07:00
amlrelsa-ms
15a3ca649d update samples from Release-149 as a part of SDK release 2022-07-07 00:18:42 +00:00
Harneet Virk
3c4770cfe5 Merge pull request #1776 from Azure/release_update/Release-148
update samples from Release-148 as a part of  SDK release
2022-07-01 13:41:03 -07:00
amlrelsa-ms
8d7de05908 update samples from Release-148 as a part of SDK release 2022-07-01 20:40:11 +00:00
Harneet Virk
863faae57f Merge pull request #1772 from Azure/release_update/Release-147
Update samples from Release-147 as a part of SDK release 1.43
2022-06-27 10:32:58 -07:00
amlrelsa-ms
8d3f5adcdb update samples from Release-147 as a part of SDK release 2022-06-27 17:29:38 +00:00
Harneet Virk
cd3394e129 Merge pull request #1771 from Azure/release_update/Release-146
update samples from Release-146 as a part of  SDK release
2022-06-20 14:31:06 -07:00
microsoft-github-policy-service[bot]
e0c9376aab Microsoft mandatory file 2022-05-25 17:12:16 +00:00
77 changed files with 1883 additions and 191 deletions

41
SECURITY.md Normal file
View File

@@ -0,0 +1,41 @@
<!-- BEGIN MICROSOFT SECURITY.MD V0.0.7 BLOCK -->
## Security
Microsoft takes the security of our software products and services seriously, which includes all source code repositories managed through our GitHub organizations, which include [Microsoft](https://github.com/Microsoft), [Azure](https://github.com/Azure), [DotNet](https://github.com/dotnet), [AspNet](https://github.com/aspnet), [Xamarin](https://github.com/xamarin), and [our GitHub organizations](https://opensource.microsoft.com/).
If you believe you have found a security vulnerability in any Microsoft-owned repository that meets [Microsoft's definition of a security vulnerability](https://aka.ms/opensource/security/definition), please report it to us as described below.
## Reporting Security Issues
**Please do not report security vulnerabilities through public GitHub issues.**
Instead, please report them to the Microsoft Security Response Center (MSRC) at [https://msrc.microsoft.com/create-report](https://aka.ms/opensource/security/create-report).
If you prefer to submit without logging in, send email to [secure@microsoft.com](mailto:secure@microsoft.com). If possible, encrypt your message with our PGP key; please download it from the [Microsoft Security Response Center PGP Key page](https://aka.ms/opensource/security/pgpkey).
You should receive a response within 24 hours. If for some reason you do not, please follow up via email to ensure we received your original message. Additional information can be found at [microsoft.com/msrc](https://aka.ms/opensource/security/msrc).
Please include the requested information listed below (as much as you can provide) to help us better understand the nature and scope of the possible issue:
* Type of issue (e.g. buffer overflow, SQL injection, cross-site scripting, etc.)
* Full paths of source file(s) related to the manifestation of the issue
* The location of the affected source code (tag/branch/commit or direct URL)
* Any special configuration required to reproduce the issue
* Step-by-step instructions to reproduce the issue
* Proof-of-concept or exploit code (if possible)
* Impact of the issue, including how an attacker might exploit the issue
This information will help us triage your report more quickly.
If you are reporting for a bug bounty, more complete reports can contribute to a higher bounty award. Please visit our [Microsoft Bug Bounty Program](https://aka.ms/opensource/security/bounty) page for more details about our active programs.
## Preferred Languages
We prefer all communications to be in English.
## Policy
Microsoft follows the principle of [Coordinated Vulnerability Disclosure](https://aka.ms/opensource/security/cvd).
<!-- END MICROSOFT SECURITY.MD BLOCK -->

View File

@@ -103,7 +103,7 @@
"source": [
"import azureml.core\n",
"\n",
"print(\"This notebook was created using version 1.42.0 of the Azure ML SDK\")\n",
"print(\"This notebook was created using version 1.45.0 of the Azure ML SDK\")\n",
"print(\"You are currently using version\", azureml.core.VERSION, \"of the Azure ML SDK\")"
]
},

View File

@@ -6,6 +6,7 @@ dependencies:
- fairlearn>=0.6.2
- joblib
- liac-arff
- raiwidgets~=0.18.1
- raiwidgets~=0.21.0
- itsdangerous==2.0.1
- markupsafe<2.1.0
- protobuf==3.20.0

View File

@@ -6,6 +6,7 @@ dependencies:
- fairlearn>=0.6.2
- joblib
- liac-arff
- raiwidgets~=0.18.1
- raiwidgets~=0.21.0
- itsdangerous==2.0.1
- markupsafe<2.1.0
- protobuf==3.20.0

View File

@@ -13,19 +13,24 @@ dependencies:
- pytorch::pytorch=1.4.0
- conda-forge::fbprophet==0.7.1
- cudatoolkit=10.1.243
- scipy==1.5.2
- scipy==1.5.3
- notebook
- pywin32==227
- PySocks==1.7.1
- jsonschema==4.5.1
- conda-forge::pyqt==5.12.3
- jsonschema==4.15.0
- jinja2<=2.11.2
- markupsafe<2.1.0
- tqdm==4.64.0
- pip:
# Required packages for AzureML execution, history, and data preparation.
- azureml-widgets~=1.42.0
- azureml-widgets~=1.45.0
- azureml-defaults~=1.45.0
- pytorch-transformers==1.0.0
- spacy==2.2.4
- pystan==2.19.1.1
- https://aka.ms/automl-resources/packages/en_core_web_sm-2.1.0.tar.gz
- -r https://automlsdkdataresources.blob.core.windows.net/validated-requirements/1.42.0/validated_win32_requirements.txt [--no-deps]
- -r https://automlsdkdataresources.blob.core.windows.net/validated-requirements/1.45.0/validated_win32_requirements.txt [--no-deps]
- arch==4.14
- wasabi==0.9.1

View File

@@ -11,23 +11,26 @@ dependencies:
- boto3==1.20.19
- botocore<=1.23.19
- matplotlib==3.2.1
- numpy==1.19.5
- numpy>=1.21.6,<=1.22.3
- cython==0.29.14
- urllib3==1.26.7
- scipy>=1.4.1,<=1.5.2
- scipy>=1.4.1,<=1.5.3
- scikit-learn==0.22.1
- py-xgboost<=1.3.3
- holidays==0.10.3
- conda-forge::fbprophet==0.7.1
- pytorch::pytorch=1.4.0
- cudatoolkit=10.1.243
- jinja2<=2.11.2
- markupsafe<2.1.0
- pip:
# Required packages for AzureML execution, history, and data preparation.
- azureml-widgets~=1.42.0
- azureml-widgets~=1.45.0
- azureml-defaults~=1.45.0
- pytorch-transformers==1.0.0
- spacy==2.2.4
- pystan==2.19.1.1
- https://aka.ms/automl-resources/packages/en_core_web_sm-2.1.0.tar.gz
- -r https://automlsdkdataresources.blob.core.windows.net/validated-requirements/1.42.0/validated_linux_requirements.txt [--no-deps]
- -r https://automlsdkdataresources.blob.core.windows.net/validated-requirements/1.45.0/validated_linux_requirements.txt [--no-deps]
- arch==4.14

View File

@@ -12,23 +12,26 @@ dependencies:
- boto3==1.20.19
- botocore<=1.23.19
- matplotlib==3.2.1
- numpy==1.19.5
- numpy>=1.21.6,<=1.22.3
- cython==0.29.14
- urllib3==1.26.7
- scipy>=1.4.1,<=1.5.2
- scipy>=1.4.1,<=1.5.3
- scikit-learn==0.22.1
- py-xgboost<=1.3.3
- holidays==0.10.3
- conda-forge::fbprophet==0.7.1
- pytorch::pytorch=1.4.0
- cudatoolkit=9.0
- jinja2<=2.11.2
- markupsafe<2.1.0
- pip:
# Required packages for AzureML execution, history, and data preparation.
- azureml-widgets~=1.42.0
- azureml-widgets~=1.45.0
- azureml-defaults~=1.45.0
- pytorch-transformers==1.0.0
- spacy==2.2.4
- pystan==2.19.1.1
- https://aka.ms/automl-resources/packages/en_core_web_sm-2.1.0.tar.gz
- -r https://automlsdkdataresources.blob.core.windows.net/validated-requirements/1.42.0/validated_darwin_requirements.txt [--no-deps]
- -r https://automlsdkdataresources.blob.core.windows.net/validated-requirements/1.45.0/validated_darwin_requirements.txt [--no-deps]
- arch==4.14

View File

@@ -228,8 +228,8 @@
"n_missing_samples = int(np.floor(data.shape[0] * missing_rate))\n",
"missing_samples = np.hstack(\n",
" (\n",
" np.zeros(data.shape[0] - n_missing_samples, dtype=np.bool),\n",
" np.ones(n_missing_samples, dtype=np.bool),\n",
" np.zeros(data.shape[0] - n_missing_samples, dtype=bool),\n",
" np.ones(n_missing_samples, dtype=bool),\n",
" )\n",
")\n",
"rng = np.random.RandomState(0)\n",
@@ -1074,7 +1074,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
"version": "3.8.12"
},
"nteract": {
"version": "nteract-front-end@1.0.0"

View File

@@ -207,11 +207,11 @@
"\n",
"def remove_blanks_20news(data, feature_column_name, target_column_name):\n",
"\n",
" data[feature_column_name] = (\n",
" data[feature_column_name]\n",
" .replace(r\"\\n\", \" \", regex=True)\n",
" .apply(lambda x: x.strip())\n",
" )\n",
" for index, row in data.iterrows():\n",
" data.at[index, feature_column_name] = (\n",
" row[feature_column_name].replace(\"\\n\", \" \").strip()\n",
" )\n",
"\n",
" data = data[data[feature_column_name] != \"\"]\n",
"\n",
" return data"

View File

@@ -1,6 +1,5 @@
import pandas as pd
from azureml.core import Environment
from azureml.train.estimator import Estimator
from azureml.core import Environment, ScriptRunConfig
from azureml.core.run import Run
@@ -16,16 +15,19 @@ def run_inference(
inference_env = train_run.get_environment()
est = Estimator(
est = ScriptRunConfig(
source_directory=script_folder,
entry_script="infer.py",
script_params={
"--target_column_name": target_column_name,
"--model_name": model_name,
},
inputs=[test_dataset.as_named_input("test_data")],
script="infer.py",
arguments=[
"--target_column_name",
target_column_name,
"--model_name",
model_name,
"--input-data",
test_dataset.as_named_input("data"),
],
compute_target=compute_target,
environment_definition=inference_env,
environment=inference_env,
)
run = test_experiment.submit(

View File

@@ -6,7 +6,7 @@ import numpy as np
from sklearn.externals import joblib
from azureml.automl.runtime.shared.score import scoring, constants
from azureml.core import Run
from azureml.core import Run, Dataset
from azureml.core.model import Model
@@ -21,6 +21,8 @@ parser.add_argument(
"--model_name", type=str, dest="model_name", help="Name of registered model"
)
parser.add_argument("--input-data", type=str, dest="input_data", help="Dataset")
args = parser.parse_args()
target_column_name = args.target_column_name
model_name = args.model_name
@@ -34,8 +36,8 @@ model_path = Model.get_model_path(model_name)
model = joblib.load(model_path)
run = Run.get_context()
# get input dataset by name
test_dataset = run.input_datasets["test_data"]
test_dataset = Dataset.get_by_id(run.experiment.workspace, id=args.input_data)
X_test_df = test_dataset.drop_columns(
columns=[target_column_name]

View File

@@ -120,9 +120,13 @@ except Exception:
end_time = datetime(2021, 5, 1, 0, 0)
end_time_last_slice = end_time - relativedelta(weeks=2)
train_df = get_noaa_data(end_time_last_slice, end_time)
try:
train_df = get_noaa_data(end_time_last_slice, end_time)
except Exception as ex:
print("get_noaa_data failed:", ex)
train_df = None
if train_df.size > 0:
if train_df is not None and train_df.size > 0:
print(
"Received {0} rows of new data after {1}.".format(
train_df.shape[0], end_time_last_slice

View File

@@ -0,0 +1,346 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Copyright (c) Microsoft Corporation. All rights reserved.\n",
"\n",
"Licensed under the MIT License."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/classification-credit-card-fraud/custom-model-training-from-autofeaturization-run.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Automated Machine Learning - Codegen for AutoFeaturization \n",
"_**Autofeaturization of credit card fraudulent transactions dataset on remote compute and codegen functionality**_\n",
"\n",
"## Contents\n",
"1. [Introduction](#Introduction)\n",
"1. [Setup](#Setup)\n",
"1. [Data](#Data)\n",
"1. [Autofeaturization](#Autofeaturization)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id='Introduction'></a>\n",
"## Introduction"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Autofeaturization** lets you run an AutoML experiment to only featurize the datasets. These datasets along with the transformer are stored in AML Storage and linked to the run which can later be retrieved and used to train models. \n",
"\n",
"**To run Autofeaturization, set the number of iterations to zero and featurization as auto.**\n",
"\n",
"Please refer to [Autofeaturization and custom model training](../autofeaturization-custom-model-training/custom-model-training-from-autofeaturization-run.ipynb) for more details on the same.\n",
"\n",
"[Codegen](https://github.com/Azure/automl-codegen-preview) is a feature, which when enabled, provides a user with the script of the underlying functionality and a notebook to tweak inputs or code and rerun the same.\n",
"\n",
"In this example we use the credit card fraudulent transactions dataset to showcase how you can use AutoML for autofeaturization and further how you can enable the `Codegen` feature.\n",
"\n",
"This notebook is using remote compute to complete the featurization.\n",
"\n",
"If you are using an Azure Machine Learning Compute Instance, you are all set. Otherwise, go through the [configuration](../../configuration.ipynb) notebook first if you haven't already, to establish your connection to the AzureML Workspace. \n",
"\n",
"Here you will learn how to create an autofeaturization experiment using an existing workspace with codegen feature enabled."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id='Setup'></a>\n",
"## Setup\n",
"\n",
"As part of the setup you have already created an Azure ML `Workspace` object. For Automated ML you will need to create an `Experiment` object, which is a named object in a `Workspace` used to run experiments."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import logging\n",
"import pandas as pd\n",
"import azureml.core\n",
"from azureml.core.experiment import Experiment\n",
"from azureml.core.workspace import Workspace\n",
"from azureml.core.dataset import Dataset\n",
"from azureml.train.automl import AutoMLConfig"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This sample notebook may use features that are not available in previous versions of the Azure ML SDK."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"This notebook was created using version 1.45.0 of the Azure ML SDK\")\n",
"print(\"You are currently using version\", azureml.core.VERSION, \"of the Azure ML SDK\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ws = Workspace.from_config()\n",
"\n",
"# choose a name for experiment\n",
"experiment_name = 'automl-autofeaturization-ccard-codegen-remote'\n",
"\n",
"experiment=Experiment(ws, experiment_name)\n",
"\n",
"output = {}\n",
"output['Subscription ID'] = ws.subscription_id\n",
"output['Workspace'] = ws.name\n",
"output['Resource Group'] = ws.resource_group\n",
"output['Location'] = ws.location\n",
"output['Experiment Name'] = experiment.name\n",
"outputDf = pd.DataFrame(data = output, index = [''])\n",
"outputDf.T"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create or Attach existing AmlCompute\n",
"A compute target is required to execute the Automated ML run. In this tutorial, you create AmlCompute as your training compute resource.\n",
"\n",
"> Note that if you have an AzureML Data Scientist role, you will not have permission to create compute resources. Talk to your workspace or IT admin to create the compute targets described in this section, if they do not already exist.\n",
"\n",
"#### Creation of AmlCompute takes approximately 5 minutes. \n",
"If the AmlCompute with that name is already in your workspace this code will skip the creation process.\n",
"As with other Azure services, there are limits on certain resources (e.g. AmlCompute) associated with the Azure Machine Learning service. Please read [this article](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-quotas) on the default limits and how to request more quota."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.compute import ComputeTarget, AmlCompute\n",
"from azureml.core.compute_target import ComputeTargetException\n",
"\n",
"# Choose a name for your CPU cluster\n",
"cpu_cluster_name = \"cpu-cluster\"\n",
"\n",
"# Verify that cluster does not exist already\n",
"try:\n",
" compute_target = ComputeTarget(workspace=ws, name=cpu_cluster_name)\n",
" print('Found existing cluster, use it.')\n",
"except ComputeTargetException:\n",
" compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_DS12_V2',\n",
" max_nodes=6)\n",
" compute_target = ComputeTarget.create(ws, cpu_cluster_name, compute_config)\n",
"\n",
"compute_target.wait_for_completion(show_output=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id='Data'></a>\n",
"## Data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Load Data\n",
"\n",
"Load the credit card fraudulent transactions dataset from a CSV file, containing both training features and labels. The features are inputs to the model, while the training labels represent the expected output of the model. \n",
"\n",
"Here the autofeaturization run will featurize the training data passed in."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Training Dataset"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"training_data = \"https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/creditcard_train.csv\"\n",
"training_dataset = Dataset.Tabular.from_delimited_files(training_data) # Tabular dataset\n",
"\n",
"label_column_name = 'Class' # output label"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id='Autofeaturization'></a>\n",
"## AutoFeaturization\n",
"\n",
"Instantiate an AutoMLConfig object. This defines the settings and data used to run the autofeaturization experiment.\n",
"\n",
"|Property|Description|\n",
"|-|-|\n",
"|**task**|classification or regression or forecasting|\n",
"|**training_data**|Input training dataset, containing both features and label column.|\n",
"|**iterations**|For an autofeaturization run, iterations will be 0.|\n",
"|**featurization**|For an autofeaturization run, featurization can be 'auto' or 'custom'.|\n",
"|**label_column_name**|The name of the label column.|\n",
"|**enable_code_generation**|For enabling codegen for the run, value would be True|\n",
"\n",
"**_You can find more information about primary metrics_** [here](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-configure-auto-train#primary-metric)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"automl_config = AutoMLConfig(task = 'classification',\n",
" debug_log = 'automl_errors.log',\n",
" iterations = 0, # autofeaturization run can be triggered by setting iterations to 0\n",
" compute_target = compute_target,\n",
" training_data = training_dataset,\n",
" label_column_name = label_column_name,\n",
" featurization = 'auto',\n",
" verbosity = logging.INFO,\n",
" enable_code_generation = True # enable codegen\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Call the `submit` method on the experiment object and pass the run configuration. Depending on the data this can run for a while. Validation errors and current status will be shown when setting `show_output=True` and the execution will be synchronous."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"remote_run = experiment.submit(automl_config, show_output = False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Results"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Widget for Monitoring Runs\n",
"\n",
"The widget will first report a \"loading\" status while running the first iteration. After completing the first iteration, an auto-updating graph and table will be shown. The widget will refresh once per minute, so you should see the graph update as child runs complete.\n",
"\n",
"**Note:** The widget displays a link at the bottom. Use this link to open a web interface to explore the individual run details"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.widgets import RunDetails\n",
"RunDetails(remote_run).show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"remote_run.wait_for_completion(show_output=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Codegen Script and Notebook"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Codegen script and notebook can be found under the `Outputs + logs` section from the details page of the remote run. Please check for the `autofeaturization_notebook.ipynb` under `/outputs/generated_code`. To modify the featurization code, open `script.py` and make changes. The codegen notebook can be run with the same environment configuration as the above AutoML run."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Experiment Complete!"
]
}
],
"metadata": {
"authors": [
{
"name": "bhavanatumma"
}
],
"interpreter": {
"hash": "adb464b67752e4577e3dc163235ced27038d19b7d88def00d75d1975bde5d9ab"
},
"kernelspec": {
"display_name": "Python 3.6",
"language": "python",
"name": "python36"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

View File

@@ -0,0 +1,4 @@
name: codegen-for-autofeaturization
dependencies:
- pip:
- azureml-sdk

View File

@@ -0,0 +1,735 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Copyright (c) Microsoft Corporation. All rights reserved.\n",
"\n",
"Licensed under the MIT License."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/classification-credit-card-fraud/custom-model-training-from-autofeaturization-run.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Automated Machine Learning - AutoFeaturization (Part 1)\n",
"_**Autofeaturization of credit card fraudulent transactions dataset on remote compute**_\n",
"\n",
"## Contents\n",
"1. [Introduction](#Introduction)\n",
"1. [Setup](#Setup)\n",
"1. [Data](#Data)\n",
"1. [Autofeaturization](#Autofeaturization)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id='Introduction'></a>\n",
"## Introduction"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Autofeaturization is a new feature to let you as the user run an AutoML experiment to only featurize the datasets. These datasets along with the transformer will be stored in the experiment which can later be retrieved and used to train models, either via AutoML or custom training. \n",
"\n",
"**To run Autofeaturization, pass in zero iterations and featurization as auto. This will featurize the datasets and terminate the experiment. Training will not occur.**\n",
"\n",
"*Limitations - Sparse data cannot be supported at the moment. Any dataset that has extensive categorical data might be featurized into sparse data which will not be allowed as input to AutoML. Efforts are underway to support sparse data and will be updated soon.* \n",
"\n",
"In this example we use the credit card fraudulent transactions dataset to showcase how you can use AutoML for autofeaturization. The goal is to clean and featurize the training dataset.\n",
"\n",
"This notebook is using remote compute to complete the featurization.\n",
"\n",
"If you are using an Azure Machine Learning Compute Instance, you are all set. Otherwise, go through the [configuration](../../configuration.ipynb) notebook first if you haven't already, to establish your connection to the AzureML Workspace. \n",
"\n",
"In the below steps, you will learn how to:\n",
"1. Create an autofeaturization experiment using an existing workspace.\n",
"2. View the featurized datasets and transformer"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id='Setup'></a>\n",
"## Setup\n",
"\n",
"As part of the setup you have already created an Azure ML `Workspace` object. For Automated ML you will need to create an `Experiment` object, which is a named object in a `Workspace` used to run experiments."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import logging\n",
"import pandas as pd\n",
"import azureml.core\n",
"from azureml.core.experiment import Experiment\n",
"from azureml.core.workspace import Workspace\n",
"from azureml.core.dataset import Dataset\n",
"from azureml.train.automl import AutoMLConfig"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This sample notebook may use features that are not available in previous versions of the Azure ML SDK."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"This notebook was created using version 1.45.0 of the Azure ML SDK\")\n",
"print(\"You are currently using version\", azureml.core.VERSION, \"of the Azure ML SDK\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ws = Workspace.from_config()\n",
"\n",
"# choose a name for experiment\n",
"experiment_name = 'automl-autofeaturization-ccard-remote'\n",
"\n",
"experiment=Experiment(ws, experiment_name)\n",
"\n",
"output = {}\n",
"output['Subscription ID'] = ws.subscription_id\n",
"output['Workspace'] = ws.name\n",
"output['Resource Group'] = ws.resource_group\n",
"output['Location'] = ws.location\n",
"output['Experiment Name'] = experiment.name\n",
"outputDf = pd.DataFrame(data = output, index = [''])\n",
"outputDf.T"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create or Attach existing AmlCompute\n",
"A compute target is required to execute the Automated ML run. In this tutorial, you create AmlCompute as your training compute resource.\n",
"\n",
"> Note that if you have an AzureML Data Scientist role, you will not have permission to create compute resources. Talk to your workspace or IT admin to create the compute targets described in this section, if they do not already exist.\n",
"\n",
"#### Creation of AmlCompute takes approximately 5 minutes. \n",
"If the AmlCompute with that name is already in your workspace this code will skip the creation process.\n",
"As with other Azure services, there are limits on certain resources (e.g. AmlCompute) associated with the Azure Machine Learning service. Please read [this article](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-quotas) on the default limits and how to request more quota."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.compute import ComputeTarget, AmlCompute\n",
"from azureml.core.compute_target import ComputeTargetException\n",
"\n",
"# Choose a name for your CPU cluster\n",
"cpu_cluster_name = \"cpu-cluster\"\n",
"\n",
"# Verify that cluster does not exist already\n",
"try:\n",
" compute_target = ComputeTarget(workspace=ws, name=cpu_cluster_name)\n",
" print('Found existing cluster, use it.')\n",
"except ComputeTargetException:\n",
" compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_DS12_V2',\n",
" max_nodes=6)\n",
" compute_target = ComputeTarget.create(ws, cpu_cluster_name, compute_config)\n",
"\n",
"compute_target.wait_for_completion(show_output=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id='Data'></a>\n",
"## Data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Load Data\n",
"\n",
"Load the credit card fraudulent transactions dataset from a CSV file, containing both training features and labels. The features are inputs to the model, while the training labels represent the expected output of the model. \n",
"\n",
"Here the autofeaturization run will featurize the training data passed in."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Training Dataset"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"training_data = \"https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/creditcard_train.csv\"\n",
"training_dataset = Dataset.Tabular.from_delimited_files(training_data) # Tabular dataset\n",
"\n",
"label_column_name = 'Class' # output label"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id='Autofeaturization'></a>\n",
"## AutoFeaturization\n",
"\n",
"Instantiate an AutoMLConfig object. This defines the settings and data used to run the autofeaturization experiment.\n",
"\n",
"|Property|Description|\n",
"|-|-|\n",
"|**task**|classification or regression|\n",
"|**training_data**|Input training dataset, containing both features and label column.|\n",
"|**iterations**|For an autofeaturization run, iterations will be 0.|\n",
"|**featurization**|For an autofeaturization run, featurization will be 'auto'.|\n",
"|**label_column_name**|The name of the label column.|\n",
"\n",
"**_You can find more information about primary metrics_** [here](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-configure-auto-train#primary-metric)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"automl_config = AutoMLConfig(task = 'classification',\n",
" debug_log = 'automl_errors.log',\n",
" iterations = 0, # autofeaturization run can be triggered by setting iterations to 0\n",
" compute_target = compute_target,\n",
" training_data = training_dataset,\n",
" label_column_name = label_column_name,\n",
" featurization = 'auto',\n",
" verbosity = logging.INFO\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Call the `submit` method on the experiment object and pass the run configuration. Depending on the data this can run for a while. Validation errors and current status will be shown when setting `show_output=True` and the execution will be synchronous."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"remote_run = experiment.submit(automl_config, show_output = False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Transformer and Featurized Datasets\n",
"The given datasets have been featurized and stored under `Outputs + logs` from the details page of the remote run. The structure is shown below. The featurized dataset is stored under `/outputs/featurization/data` and the transformer is saved under `/outputs/featurization/pipeline` \n",
"\n",
"Below you will learn how to refer to the data saved in your run and retrieve the same."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Featurized Data](https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/autofeaturization_img.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Results"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Widget for Monitoring Runs\n",
"\n",
"The widget will first report a \"loading\" status while running the first iteration. After completing the first iteration, an auto-updating graph and table will be shown. The widget will refresh once per minute, so you should see the graph update as child runs complete.\n",
"\n",
"**Note:** The widget displays a link at the bottom. Use this link to open a web interface to explore the individual run details"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.widgets import RunDetails\n",
"RunDetails(remote_run).show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"remote_run.wait_for_completion(show_output=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Automated Machine Learning - AutoFeaturization (Part 2)\n",
"_**Training using a custom model with the featurized data from Autofeaturization run of credit card fraudulent transactions dataset**_\n",
"\n",
"## Contents\n",
"1. [Introduction](#Introduction)\n",
"1. [Data Setup](#DataSetup)\n",
"1. [Autofeaturization Data](#AutofeaturizationData)\n",
"1. [Train](#Train)\n",
"1. [Results](#Results)\n",
"1. [Test](#Test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id='Introduction'></a>\n",
"## Introduction\n",
"\n",
"Here we use the featurized dataset saved in the above run to showcase how you can perform custom training by using the transformer from an autofeaturization run to transform validation / test datasets. \n",
"\n",
"The goal is to use autofeaturized run data and transformer to transform and run a custom training experiment independently\n",
"\n",
"In the below steps, you will learn how to:\n",
"1. Read transformer from a completed autofeaturization run and transform data\n",
"2. Pull featurized data from a completed autofeaturization run\n",
"3. Run a custom training experiment with the above data\n",
"4. Check results"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id='DataSetup'></a>\n",
"## Data Setup"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will load the featurized training data and also load the transformer from the above autofeaturized run. This transformer can then be used to transform the test data to check the accuracy of the custom model after training."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Load Test Data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"load test dataset from CSV and split into X and y columns to featurize with the transformer going forward."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"test_data = \"https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/creditcard_test.csv\"\n",
"\n",
"test_dataset = pd.read_csv(test_data)\n",
"label_column_name = 'Class'\n",
"\n",
"X_test_data = test_dataset[test_dataset.columns.difference([label_column_name])]\n",
"y_test_data = test_dataset[label_column_name].values\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Load data_transformer from the above remote run artifact"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### (Method 1)\n",
"\n",
"Method 1 allows you to read the transformer from the remote storage."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import mlflow\n",
"mlflow.set_tracking_uri(ws.get_mlflow_tracking_uri())\n",
"\n",
"# Set uri to fetch data transformer from remote parent run.\n",
"artifact_path = \"/outputs/featurization/pipeline/\"\n",
"uri = \"runs:/\" + remote_run.id + artifact_path\n",
"\n",
"print(uri)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### (Method 2)\n",
"\n",
"Method 2 downloads the transformer to the local directory and then can be used to transform the data. Uncomment to use."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"''' import pathlib\n",
"\n",
"# Download the transformer to the local directory\n",
"transformers_file_path = \"/outputs/featurization/pipeline/\"\n",
"local_path = \"./transformer\"\n",
"remote_run.download_files(prefix=transformers_file_path, output_directory=local_path, batch_size=500)\n",
"\n",
"path = pathlib.Path(\"transformer\") \n",
"path = str(path.absolute()) + transformers_file_path\n",
"str_uri = \"file:///\" + path\n",
"\n",
"print(str_uri) '''"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Transform Data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Note:** Not all datasets produce a y_transformer. The dataset used in the current notebook requires a transformer as the y column data is categorical."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.automl.core.shared.constants import Transformers\n",
"\n",
"transformers = mlflow.sklearn.load_model(uri) # Using method 1\n",
"data_transformers = transformers.get_transformers()\n",
"x_transformer = data_transformers[Transformers.X_TRANSFORMER]\n",
"y_transformer = data_transformers[Transformers.Y_TRANSFORMER]\n",
"\n",
"X_test = x_transformer.transform(X_test_data)\n",
"y_test = y_transformer.transform(y_test_data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Run the following cell to see the featurization summary of X and y transformers. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X_data_summary = x_transformer.get_featurization_summary(is_user_friendly=False)\n",
"\n",
"summary_df = pd.DataFrame.from_records(X_data_summary)\n",
"summary_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Load Datastore\n",
"\n",
"The below data store holds the featurized datasets, hence we load and access the data. Check the path and file names according to the saved structure in your experiment `Outputs + logs` as seen in <i>Autofeaturization Part 1</i>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.datastore import Datastore\n",
"\n",
"ds = Datastore.get(ws, \"workspaceartifactstore\")\n",
"experiment_loc = \"ExperimentRun/dcid.\" + remote_run.id\n",
"\n",
"remote_data_path = \"/outputs/featurization/data/\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id='AutofeaturizationData'></a>\n",
"## Autofeaturization Data\n",
"\n",
"We will load the training data from the previously completed Autofeaturization experiment. The resulting featurized dataframe can be passed into the custom model for training. Here we are saving the file to local from the experiment storage and reading the data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"train_data_file_path = \"full_training_dataset.df.parquet\"\n",
"local_data_path = \"./data/\" + train_data_file_path\n",
"\n",
"remote_run.download_file(remote_data_path + train_data_file_path, local_data_path)\n",
"\n",
"full_training_data = pd.read_parquet(local_data_path)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Another way to load the data is to go to the above autofeaturization experiment and check for the featurized dataset ids under `Output datasets`. Uncomment and replace them accordingly below to use."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# train_data = Dataset.get_by_id(ws, 'cb4418ee-bac4-45ac-b055-600653bdf83a') # replace the featurized full_training_dataset id\n",
"# full_training_data = train_data.to_pandas_dataframe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Training Data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We are dropping the y column and weights column from the featurized training dataset."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"Y_COLUMN = \"automl_y\"\n",
"SW_COLUMN = \"automl_weights\"\n",
"\n",
"X_train = full_training_data[full_training_data.columns.difference([Y_COLUMN, SW_COLUMN])]\n",
"y_train = full_training_data[Y_COLUMN].values\n",
"sample_weight = full_training_data[SW_COLUMN].values"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id='Train'></a>\n",
"## Train"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we are passing our training data to the lightgbm classifier, any custom model can be used with your data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import lightgbm as lgb\n",
"\n",
"model = lgb.LGBMClassifier(learning_rate=0.08,max_depth=-5,random_state=42)\n",
"model.fit(X_train, y_train, sample_weight=sample_weight, eval_set=[(X_test, y_test),(X_train, y_train)],\n",
" verbose=20,eval_metric='logloss')\n",
"\n",
"print('Training accuracy {:.4f}'.format(model.score(X_train, y_train)))\n",
"print('Testing accuracy {:.4f}'.format(model.score(X_test, y_test)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id='Results'></a>\n",
"## Analyze results\n",
"\n",
"### Retrieve the Model"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id='Test'></a>\n",
"## Test the fitted model\n",
"\n",
"Now that the model is trained, split the data in the same way the data was split for training (The difference here is the data is being split locally) and then run the test data through the trained model to get the predicted values."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"y_pred = model.predict(X_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Calculate metrics for the prediction\n",
"\n",
"Now visualize the data on a scatter plot to show what our truth (actual) values are compared to the predicted values \n",
"from the trained model that was returned."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.metrics import confusion_matrix\n",
"from matplotlib import pyplot as plt\n",
"import numpy as np\n",
"import itertools\n",
"\n",
"cf =confusion_matrix(y_test,y_pred)\n",
"plt.imshow(cf,cmap=plt.cm.Blues,interpolation='nearest')\n",
"plt.colorbar()\n",
"plt.title('Confusion Matrix')\n",
"plt.xlabel('Predicted')\n",
"plt.ylabel('Actual')\n",
"class_labels = ['False','True']\n",
"tick_marks = np.arange(len(class_labels))\n",
"plt.xticks(tick_marks,class_labels)\n",
"plt.yticks([-0.5,0,1,1.5],['','False','True',''])\n",
"# plotting text value inside cells\n",
"thresh = cf.max() / 2.\n",
"for i,j in itertools.product(range(cf.shape[0]),range(cf.shape[1])):\n",
" plt.text(j,i,format(cf[i,j],'d'),horizontalalignment='center',color='white' if cf[i,j] >thresh else 'black')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Experiment Complete!"
]
}
],
"metadata": {
"authors": [
{
"name": "bhavanatumma"
}
],
"interpreter": {
"hash": "adb464b67752e4577e3dc163235ced27038d19b7d88def00d75d1975bde5d9ab"
},
"kernelspec": {
"display_name": "Python 3.6",
"language": "python",
"name": "python36"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

View File

@@ -0,0 +1,4 @@
name: custom-model-training-from-autofeaturization-run
dependencies:
- pip:
- azureml-sdk

View File

@@ -3,20 +3,21 @@ dependencies:
# The python interpreter version.
# Currently Azure ML only supports 3.6.0 and later.
- pip<=20.2.4
- python>=3.6.0,<3.9
- python>=3.6.0,<3.10
- cython==0.29.14
- urllib3==1.26.7
- PyJWT < 2.0.0
- numpy==1.18.5
- numpy==1.21.6
- pywin32==227
- cryptography<37.0.0
- pip:
# Required packages for AzureML execution, history, and data preparation.
- azure-mgmt-core==1.3.0
- azure-core==1.21.1
- azure-core==1.24.1
- azure-identity==1.7.0
- azureml-defaults
- azureml-sdk
- azureml-widgets
- azureml-mlflow
- pandas
- mlflow

View File

@@ -7,18 +7,19 @@ dependencies:
# Currently Azure ML only supports 3.6.0 and later.
- pip<=20.2.4
- nomkl
- python>=3.6.0,<3.9
- python>=3.6.0,<3.10
- urllib3==1.26.7
- PyJWT < 2.0.0
- numpy==1.19.5
- numpy>=1.21.6,<=1.22.3
- cryptography<37.0.0
- pip:
# Required packages for AzureML execution, history, and data preparation.
- azure-mgmt-core==1.3.0
- azure-core==1.21.1
- azure-core==1.24.1
- azure-identity==1.7.0
- azureml-defaults
- azureml-sdk
- azureml-widgets
- azureml-mlflow
- pandas
- mlflow

View File

@@ -92,7 +92,7 @@
"metadata": {},
"outputs": [],
"source": [
"print(\"This notebook was created using version 1.42.0 of the Azure ML SDK\")\n",
"print(\"This notebook was created using version 1.45.0 of the Azure ML SDK\")\n",
"print(\"You are currently using version\", azureml.core.VERSION, \"of the Azure ML SDK\")"
]
},

View File

@@ -91,7 +91,7 @@
"metadata": {},
"outputs": [],
"source": [
"print(\"This notebook was created using version 1.42.0 of the Azure ML SDK\")\n",
"print(\"This notebook was created using version 1.45.0 of the Azure ML SDK\")\n",
"print(\"You are currently using version\", azureml.core.VERSION, \"of the Azure ML SDK\")"
]
},

View File

@@ -324,7 +324,8 @@
"| **experiment_timeout_hours** | Maximum amount of time in hours that the experiment can take before it terminates. This is optional but provides customers with greater control on exit criteria. |\n",
"| **label_column_name** | The name of the label column. |\n",
"| **forecast_horizon** | The forecast horizon is how many periods forward you would like to forecast. This integer horizon is in units of the timeseries frequency (e.g. daily, weekly). Periods are inferred from your data. |\n",
"| **n_cross_validations** | Number of cross validation splits. Rolling Origin Validation is used to split time-series in a temporally consistent way. |\n",
"| **n_cross_validations** | Number of cross validation splits. The default value is \"auto\", in which case AutoMl determines the number of cross-validations automatically, if a validation set is not provided. Or users could specify an integer value. Rolling Origin Validation is used to split time-series in a temporally consistent way. |\n",
"|**cv_step_size**|Number of periods between two consecutive cross-validation folds. The default value is \"auto\", in which case AutoMl determines the cross-validation step size automatically, if a validation set is not provided. Or users could specify an integer value.\n",
"| **time_column_name** | The name of your time column. |\n",
"| **time_series_id_column_names** | The column names used to uniquely identify timeseries in data that has multiple rows with the same timestamp. |\n",
"| **track_child_runs** | Flag to disable tracking of child runs. Only best run is tracked if the flag is set to False (this includes the model and metrics of the run). |\n",
@@ -353,7 +354,8 @@
" \"iterations\": 15,\n",
" \"experiment_timeout_hours\": 0.25, # This also needs to be changed based on the dataset. For larger data set this number needs to be bigger.\n",
" \"label_column_name\": TARGET_COLNAME,\n",
" \"n_cross_validations\": 3,\n",
" \"n_cross_validations\": \"auto\", # Feel free to set to a small integer (>=2) if runtime is an issue.\n",
" \"cv_step_size\": \"auto\",\n",
" \"time_column_name\": TIME_COLNAME,\n",
" \"forecast_horizon\": 6,\n",
" \"time_series_id_column_names\": partition_column_names,\n",
@@ -718,7 +720,12 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
"version": "3.8.5"
},
"vscode": {
"interpreter": {
"hash": "6bd77c88278e012ef31757c15997a7bea8c943977c43d6909403c00ae11d43ca"
}
}
},
"nbformat": 4,

View File

@@ -283,7 +283,8 @@
"| **experiment_timeout_hours** | Maximum amount of time in hours that the experiment can take before it terminates. This is optional but provides customers with greater control on exit criteria. |\n",
"| **label_column_name** | The name of the label column. |\n",
"| **max_horizon** | The forecast horizon is how many periods forward you would like to forecast. This integer horizon is in units of the timeseries frequency (e.g. daily, weekly). Periods are inferred from your data. |\n",
"| **n_cross_validations** | Number of cross validation splits. Rolling Origin Validation is used to split time-series in a temporally consistent way. |\n",
"| **n_cross_validations** | Number of cross validation splits. The default value is \"auto\", in which case AutoMl determines the number of cross-validations automatically, if a validation set is not provided. Or users could specify an integer value. Rolling Origin Validation is used to split time-series in a temporally consistent way. |\n",
"|**cv_step_size**|Number of periods between two consecutive cross-validation folds. The default value is \"auto\", in which case AutoMl determines the cross-validation step size automatically, if a validation set is not provided. Or users could specify an integer value.\n",
"| **time_column_name** | The name of your time column. |\n",
"| **grain_column_names** | The column names used to uniquely identify timeseries in data that has multiple rows with the same timestamp. |"
]
@@ -301,7 +302,8 @@
" \"iterations\": 15,\n",
" \"experiment_timeout_hours\": 1, # This also needs to be changed based on the dataset. For larger data set this number needs to be bigger.\n",
" \"label_column_name\": LABEL_COLUMN_NAME,\n",
" \"n_cross_validations\": 3,\n",
" \"n_cross_validations\": \"auto\", # Feel free to set to a small integer (>=2) if runtime is an issue.\n",
" \"cv_step_size\": \"auto\",\n",
" \"time_column_name\": TIME_COLUMN_NAME,\n",
" \"max_horizon\": FORECAST_HORIZON,\n",
" \"track_child_runs\": False,\n",
@@ -524,7 +526,7 @@
"metadata": {},
"outputs": [],
"source": [
"model_list = Model.list(ws, tags={\"experiment\": \"automl-backtesting\"})\n",
"model_list = Model.list(ws, tags=[[\"experiment\", \"automl-backtesting\"]])\n",
"model_data = {\"name\": [], \"last_training_date\": []}\n",
"for model in model_list:\n",
" if (\n",
@@ -712,7 +714,12 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
"version": "3.8.5"
},
"vscode": {
"interpreter": {
"hash": "6bd77c88278e012ef31757c15997a7bea8c943977c43d6909403c00ae11d43ca"
}
}
},
"nbformat": 4,

View File

@@ -72,6 +72,8 @@ def get_backtest_pipeline(
run_config.docker.use_docker = True
run_config.environment = env
utilities.set_environment_variables_for_run(run_config)
split_data = PipelineData(name="split_data_output", datastore=None).as_dataset()
split_step = PythonScriptStep(
name="split_data_for_backtest",
@@ -114,6 +116,7 @@ def get_backtest_pipeline(
run_invocation_timeout=3600,
node_count=node_count,
)
utilities.set_environment_variables_for_run(back_test_config)
forecasts = PipelineData(name="forecasts", datastore=None)
if model_name:
parallel_step_name = "{}-backtest".format(model_name.replace("_", "-"))
@@ -149,12 +152,7 @@ def get_backtest_pipeline(
inputs=[forecasts.as_mount()],
outputs=[data_results],
source_directory=PROJECT_FOLDER,
arguments=[
"--forecasts",
forecasts,
"--output-dir",
data_results,
],
arguments=["--forecasts", forecasts, "--output-dir", data_results],
runconfig=run_config,
compute_target=compute_target,
allow_reuse=False,

View File

@@ -265,7 +265,8 @@
"|**forecast_horizon**|The forecast horizon is how many periods forward you would like to forecast. This integer horizon is in units of the timeseries frequency (e.g. daily, weekly).|\n",
"|**country_or_region_for_holidays**|The country/region used to generate holiday features. These should be ISO 3166 two-letter country/region codes (i.e. 'US', 'GB').|\n",
"|**target_lags**|The target_lags specifies how far back we will construct the lags of the target variable.|\n",
"|**freq**|Forecast frequency. This optional parameter represents the period with which the forecast is desired, for example, daily, weekly, yearly, etc. Use this parameter for the correction of time series containing irregular data points or for padding of short time series. The frequency needs to be a pandas offset alias. Please refer to [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects) for more information."
"|**freq**|Forecast frequency. This optional parameter represents the period with which the forecast is desired, for example, daily, weekly, yearly, etc. Use this parameter for the correction of time series containing irregular data points or for padding of short time series. The frequency needs to be a pandas offset alias. Please refer to [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects) for more information.\n",
"|**cv_step_size**|Number of periods between two consecutive cross-validation folds. The default value is \"auto\", in which case AutoMl determines the cross-validation step size automatically, if a validation set is not provided. Or users could specify an integer value."
]
},
{
@@ -285,7 +286,7 @@
"|**training_data**|Input dataset, containing both features and label column.|\n",
"|**label_column_name**|The name of the label column.|\n",
"|**compute_target**|The remote compute for training.|\n",
"|**n_cross_validations**|Number of cross validation splits.|\n",
"|**n_cross_validations**|Number of cross-validation folds to use for model/pipeline selection. The default value is \"auto\", in which case AutoMl determines the number of cross-validations automatically, if a validation set is not provided. Or users could specify an integer value.\n",
"|**enable_early_stopping**|If early stopping is on, training will stop when the primary metric is no longer improving.|\n",
"|**forecasting_parameters**|A class that holds all the forecasting related parameters.|\n",
"\n",
@@ -350,6 +351,7 @@
" country_or_region_for_holidays=\"US\", # set country_or_region will trigger holiday featurizer\n",
" target_lags=\"auto\", # use heuristic based lag setting\n",
" freq=\"D\", # Set the forecast frequency to be daily\n",
" cv_step_size=\"auto\",\n",
")\n",
"\n",
"automl_config = AutoMLConfig(\n",
@@ -362,7 +364,7 @@
" label_column_name=target_column_name,\n",
" compute_target=compute_target,\n",
" enable_early_stopping=True,\n",
" n_cross_validations=3,\n",
" n_cross_validations=\"auto\", # Feel free to set to a small integer (>=2) if runtime is an issue.\n",
" max_concurrent_iterations=4,\n",
" max_cores_per_iteration=-1,\n",
" verbosity=logging.INFO,\n",
@@ -709,7 +711,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.7"
"version": "3.8.5"
},
"mimetype": "text/x-python",
"name": "python",
@@ -719,7 +721,12 @@
"Forecasting"
],
"task": "Forecasting",
"version": 3
"version": 3,
"vscode": {
"interpreter": {
"hash": "6bd77c88278e012ef31757c15997a7bea8c943977c43d6909403c00ae11d43ca"
}
}
},
"nbformat": 4,
"nbformat_minor": 4

View File

@@ -308,7 +308,8 @@
"|-|-|\n",
"|**time_column_name**|The name of your time column.|\n",
"|**forecast_horizon**|The forecast horizon is how many periods forward you would like to forecast. This integer horizon is in units of the timeseries frequency (e.g. daily, weekly).|\n",
"|**freq**|Forecast frequency. This optional parameter represents the period with which the forecast is desired, for example, daily, weekly, yearly, etc. Use this parameter for the correction of time series containing irregular data points or for padding of short time series. The frequency needs to be a pandas offset alias. Please refer to [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects) for more information."
"|**freq**|Forecast frequency. This optional parameter represents the period with which the forecast is desired, for example, daily, weekly, yearly, etc. Use this parameter for the correction of time series containing irregular data points or for padding of short time series. The frequency needs to be a pandas offset alias. Please refer to [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects) for more information.\n",
"|**cv_step_size**|Number of periods between two consecutive cross-validation folds. The default value is \"auto\", in which case AutoMl determines the cross-validation step size automatically, if a validation set is not provided. Or users could specify an integer value."
]
},
{
@@ -328,7 +329,7 @@
"|**training_data**|The training data to be used within the experiment.|\n",
"|**label_column_name**|The name of the label column.|\n",
"|**compute_target**|The remote compute for training.|\n",
"|**n_cross_validations**|Number of cross validation splits. Rolling Origin Validation is used to split time-series in a temporally consistent way.|\n",
"|**n_cross_validations**|Number of cross-validation folds to use for model/pipeline selection. The default value is \"auto\", in which case AutoMl determines the number of cross-validations automatically, if a validation set is not provided. Or users could specify an integer value.\n",
"|**enable_early_stopping**|Flag to enble early termination if the score is not improving in the short term.|\n",
"|**forecasting_parameters**|A class holds all the forecasting related parameters.|\n"
]
@@ -352,6 +353,7 @@
" time_column_name=time_column_name,\n",
" forecast_horizon=forecast_horizon,\n",
" freq=\"H\", # Set the forecast frequency to be hourly\n",
" cv_step_size=\"auto\",\n",
")\n",
"\n",
"automl_config = AutoMLConfig(\n",
@@ -363,7 +365,7 @@
" label_column_name=target_column_name,\n",
" compute_target=compute_target,\n",
" enable_early_stopping=True,\n",
" n_cross_validations=3,\n",
" n_cross_validations=\"auto\", # Feel free to set to a small integer (>=2) if runtime is an issue.\n",
" verbosity=logging.INFO,\n",
" forecasting_parameters=forecasting_parameters,\n",
")"
@@ -609,6 +611,7 @@
" forecast_horizon=forecast_horizon,\n",
" target_lags=12,\n",
" target_rolling_window_size=4,\n",
" cv_step_size=\"auto\",\n",
")\n",
"\n",
"automl_config = AutoMLConfig(\n",
@@ -628,7 +631,7 @@
" label_column_name=target_column_name,\n",
" compute_target=compute_target,\n",
" enable_early_stopping=True,\n",
" n_cross_validations=3,\n",
" n_cross_validations=\"auto\", # Feel free to set to a small integer (>=2) if runtime is an issue.\n",
" verbosity=logging.INFO,\n",
" forecasting_parameters=advanced_forecasting_parameters,\n",
")"
@@ -778,7 +781,12 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
"version": "3.8.5"
},
"vscode": {
"interpreter": {
"hash": "6bd77c88278e012ef31757c15997a7bea8c943977c43d6909403c00ae11d43ca"
}
}
},
"nbformat": 4,

View File

@@ -335,7 +335,8 @@
" forecast_horizon=forecast_horizon,\n",
" time_series_id_column_names=[TIME_SERIES_ID_COLUMN_NAME],\n",
" target_lags=lags,\n",
" freq=\"H\", # Set the forecast frequency to be hourly\n",
" freq=\"H\", # Set the forecast frequency to be hourly,\n",
" cv_step_size=\"auto\",\n",
")"
]
},
@@ -365,7 +366,7 @@
" enable_early_stopping=True,\n",
" training_data=train_data,\n",
" compute_target=compute_target,\n",
" n_cross_validations=3,\n",
" n_cross_validations=\"auto\", # Feel free to set to a small integer (>=2) if runtime is an issue.\n",
" verbosity=logging.INFO,\n",
" max_concurrent_iterations=4,\n",
" max_cores_per_iteration=-1,\n",
@@ -647,13 +648,11 @@
" & (fulldata[time_column_name] <= forecast_origin + horizon)\n",
" ]\n",
"\n",
" y_past = X_past.pop(target_column_name).values.astype(np.float)\n",
" y_future = X_future.pop(target_column_name).values.astype(np.float)\n",
" y_past = X_past.pop(target_column_name).values.astype(float)\n",
" y_future = X_future.pop(target_column_name).values.astype(float)\n",
"\n",
" # Now take y_future and turn it into question marks\n",
" y_query = y_future.copy().astype(\n",
" np.float\n",
" ) # because sometimes life hands you an int\n",
" y_query = y_future.copy().astype(float) # because sometimes life hands you an int\n",
" y_query.fill(np.NaN)\n",
"\n",
" print(\"X_past is \" + str(X_past.shape) + \" - shaped\")\n",
@@ -881,13 +880,18 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
"version": "3.8.5"
},
"tags": [
"Forecasting",
"Confidence Intervals"
],
"task": "Forecasting"
"task": "Forecasting",
"vscode": {
"interpreter": {
"hash": "6bd77c88278e012ef31757c15997a7bea8c943977c43d6909403c00ae11d43ca"
}
}
},
"nbformat": 4,
"nbformat_minor": 2

View File

@@ -95,7 +95,7 @@ def do_rolling_forecast_with_lookback(
# Extract test data from an expanding window up-to the horizon
expand_wind = X[time_column_name] < horizon_time
X_test_expand = X[expand_wind]
y_query_expand = np.zeros(len(X_test_expand)).astype(np.float)
y_query_expand = np.zeros(len(X_test_expand)).astype(float)
y_query_expand.fill(np.NaN)
if origin_time != X[time_column_name].min():
@@ -176,7 +176,7 @@ def do_rolling_forecast(fitted_model, X_test, y_test, max_horizon, freq="D"):
# Extract test data from an expanding window up-to the horizon
expand_wind = X_test[time_column_name] < horizon_time
X_test_expand = X_test[expand_wind]
y_query_expand = np.zeros(len(X_test_expand)).astype(np.float)
y_query_expand = np.zeros(len(X_test_expand)).astype(float)
y_query_expand.fill(np.NaN)
if origin_time != X_test[time_column_name].min():

View File

@@ -263,7 +263,8 @@
"| **experiment_timeout_hours** | Maximum amount of time in hours that the experiment can take before it terminates. This is optional but provides customers with greater control on exit criteria. |\n",
"| **label_column_name** | The name of the label column. |\n",
"| **forecast_horizon** | The forecast horizon is how many periods forward you would like to forecast. This integer horizon is in units of the timeseries frequency (e.g. daily, weekly). Periods are inferred from your data. |\n",
"| **n_cross_validations** | Number of cross validation splits. Rolling Origin Validation is used to split time-series in a temporally consistent way. |\n",
"|**n_cross_validations**|Number of cross-validation folds to use for model/pipeline selection. The default value is \"auto\", in which case AutoMl determines the number of cross-validations automatically, if a validation set is not provided. Or users could specify an integer value.\n",
"|**cv_step_size**|Number of periods between two consecutive cross-validation folds. The default value is \"auto\", in which case AutoMl determines the cross-validation step size automatically, if a validation set is not provided. Or users could specify an integer value.\n",
"| **enable_early_stopping** | Flag to enable early termination if the score is not improving in the short term. |\n",
"| **time_column_name** | The name of your time column. |\n",
"| **hierarchy_column_names** | The names of columns that define the hierarchical structure of the data from highest level to most granular. |\n",
@@ -311,10 +312,11 @@
" \"track_child_runs\": False,\n",
" \"pipeline_fetch_max_batch_size\": 15,\n",
" \"model_explainability\": model_explainability,\n",
" \"n_cross_validations\": \"auto\", # Feel free to set to a small integer (>=2) if runtime is an issue.\n",
" \"cv_step_size\": \"auto\",\n",
" # The following settings are specific to this sample and should be adjusted according to your own needs.\n",
" \"iteration_timeout_minutes\": 10,\n",
" \"iterations\": 10,\n",
" \"n_cross_validations\": 2,\n",
"}\n",
"\n",
"hts_parameters = HTSTrainParameters(\n",

View File

@@ -0,0 +1,122 @@
---
page_type: sample
languages:
- python
products:
- azure-machine-learning
description: Tutorial showing how to solve a complex machine learning time series forecasting problems at scale by using Azure Automated ML and Many Models solution accelerator.
---
![Many Models Solution Accelerator Banner](images/mmsa.png)
# Many Models Solution Accelerator
<!--
Guidelines on README format: https://review.docs.microsoft.com/help/onboard/admin/samples/concepts/readme-template?branch=master
Guidance on onboarding samples to docs.microsoft.com/samples: https://review.docs.microsoft.com/help/onboard/admin/samples/process/onboarding?branch=master
Taxonomies for products and languages: https://review.docs.microsoft.com/new-hope/information-architecture/metadata/taxonomies?branch=master
-->
In the real world, many problems can be too complex to be solved by a single machine learning model. Whether that be predicting sales for each individual store, building a predictive maintanence model for hundreds of oil wells, or tailoring an experience to individual users, building a model for each instance can lead to improved results on many machine learning problems.
This Pattern is very common across a wide variety of industries and applicable to many real world use cases. Below are some examples we have seen where this pattern is being used.
- Energy and utility companies building predictive maintenancemodelsforthousands of oil wells, hundreds of wind turbines or hundreds of smart meters
- Retail organizations building workforce optimization models for thousands of stores, campaign promotion propensity models, Price optimization models for hundreds of thousands of products they sell
- Restaurant chains buildingdemand forecasting models across thousands ofrestaurants
- Banks and financial institutes building models for cash replenishmentfor ATM Machine and for several ATMsor building personalized models for individuals
- Enterprises building revenue forecasting modelsat each division level
- Document management companies building text analytics and legal document search models per each state
Azure Machine Learning (AML) makes it easy to train, operate, and manage hundreds or even thousands of models. This repo will walk you through the end to end process of creating a many models solution from training to scoring to monitoring.
## Prerequisites
To use this solution accelerator, all you need is access to an [Azure subscription](https://azure.microsoft.com/free/) and an [Azure Machine Learning Workspace](https://docs.microsoft.com/azure/machine-learning/how-to-manage-workspace) that you'll create below.
While it's not required, a basic understanding of Azure Machine Learning will be helpful for understanding the solution. The following resources can help introduce you to AML:
1. [Azure Machine Learning Overview](https://azure.microsoft.com/services/machine-learning/)
2. [Azure Machine Learning Tutorials](https://docs.microsoft.com/azure/machine-learning/tutorial-1st-experiment-sdk-setup)
3. [Azure Machine Learning Sample Notebooks on Github](https://github.com/Azure/azureml-examples)
## Getting started
### 1. Deploy Resources
Start by deploying the resources to Azure. The button below will deploy Azure Machine Learning and its related resources:
<a href="https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2Fmicrosoft%2Fsolution-accelerator-many-models%2Fmaster%2Fazuredeploy.json" target="_blank">
<img src="http://azuredeploy.net/deploybutton.png"/>
</a>
### 2. Configure Development Environment
Next you'll need to configure your [development environment](https://docs.microsoft.com/azure/machine-learning/how-to-configure-environment) for Azure Machine Learning. We recommend using a [Compute Instance](https://docs.microsoft.com/azure/machine-learning/how-to-configure-environment#compute-instance) as it's the fastest way to get up and running.
### 3. Run Notebooks
Once your development environment is set up, run through the Jupyter Notebooks sequentially following the steps outlined. By the end, you'll know how to train, score, and make predictions using the many models pattern on Azure Machine Learning.
![Sequence of Notebooks](./images/mmsa-overview.png)
## Contents
In this repo, you'll train and score a forecasting model for each orange juice brand and for each store at a (simulated) grocery chain. By the end, you'll have forecasted sales by using up to 11,973 models to predict sales for the next few weeks.
The data used in this sample is simulated based on the [Dominick's Orange Juice Dataset](http://www.cs.unitn.it/~taufer/QMMA/L10-OJ-Data.html#(1)), sales data from a Chicago area grocery store.
<img src="images/Flow_map.png" width="1000">
### Using Automated ML to train the models:
The [`auto-ml-forecasting-many-models.ipynb`](./auto-ml-forecasting-many-models.ipynb) noteboook is a guided solution accelerator that demonstrates steps from data preparation, to model training, and forecasting on train models as well as operationalizing the solution.
## How-to-videos
Watch these how-to-videos for a step by step walk-through of the many model solution accelerator to learn how to setup your models using Automated ML.
### Automated ML
[![Watch the video](https://media.giphy.com/media/dWUKfameudyNGRnp1t/giphy.gif)](https://channel9.msdn.com/Shows/Docs-AI/Building-Large-Scale-Machine-Learning-Forecasting-Models-using-Azure-Machine-Learnings-Automated-ML)
## Key concepts
### ParallelRunStep
[ParallelRunStep](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-steps/azureml.pipeline.steps.parallel_run_step.parallelrunstep?view=azure-ml-py) enables the parallel training of models and is commonly used for batch inferencing. This [document](https://docs.microsoft.com/azure/machine-learning/how-to-use-parallel-run-step) walks through some of the key concepts around ParallelRunStep.
### Pipelines
[Pipelines](https://docs.microsoft.com/azure/machine-learning/concept-ml-pipelines) allow you to create workflows in your machine learning projects. These workflows have a number of benefits including speed, simplicity, repeatability, and modularity.
### Automated Machine Learning
[Automated Machine Learning](https://docs.microsoft.com/azure/machine-learning/concept-automated-ml) also referred to as automated ML or AutoML, is the process of automating the time consuming, iterative tasks of machine learning model development. It allows data scientists, analysts, and developers to build ML models with high scale, efficiency, and productivity all while sustaining model quality.
### Other Concepts
In additional to ParallelRunStep, Pipelines and Automated Machine Learning, you'll also be working with the following concepts including [workspace](https://docs.microsoft.com/azure/machine-learning/concept-workspace), [datasets](https://docs.microsoft.com/azure/machine-learning/concept-data#datasets), [compute targets](https://docs.microsoft.com/azure/machine-learning/concept-compute-target#train), [python script steps](https://docs.microsoft.com/python/api/azureml-pipeline-steps/azureml.pipeline.steps.python_script_step.pythonscriptstep?view=azure-ml-py), and [Azure Open Datasets](https://azure.microsoft.com/services/open-datasets/).
## Contributing
This project welcomes contributions and suggestions. To learn more visit the [contributing](../../../CONTRIBUTING.md) section.
Most contributions require you to agree to a Contributor License Agreement (CLA)
declaring that you have the right to, and actually do, grant us
the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide
a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions
provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.

View File

@@ -391,7 +391,8 @@
"| **experiment_timeout_hours** | Maximum amount of time in hours that the experiment can take before it terminates. This is optional but provides customers with greater control on exit criteria. |\n",
"| **label_column_name** | The name of the label column. |\n",
"| **forecast_horizon** | The forecast horizon is how many periods forward you would like to forecast. This integer horizon is in units of the timeseries frequency (e.g. daily, weekly). Periods are inferred from your data. |\n",
"| **n_cross_validations** | Number of cross validation splits. Rolling Origin Validation is used to split time-series in a temporally consistent way. |\n",
"| **n_cross_validations** | Number of cross validation splits. The default value is \"auto\", in which case AutoMl determines the number of cross-validations automatically, if a validation set is not provided. Or users could specify an integer value. Rolling Origin Validation is used to split time-series in a temporally consistent way. |\n",
"|**cv_step_size**|Number of periods between two consecutive cross-validation folds. The default value is \"auto\", in which case AutoMl determines the cross-validation step size automatically, if a validation set is not provided. Or users could specify an integer value.\n",
"| **enable_early_stopping** | Flag to enable early termination if the score is not improving in the short term. |\n",
"| **time_column_name** | The name of your time column. |\n",
"| **enable_engineered_explanations** | Engineered feature explanations will be downloaded if enable_engineered_explanations flag is set to True. By default it is set to False to save storage space. |\n",
@@ -423,7 +424,8 @@
" \"iterations\": 15,\n",
" \"experiment_timeout_hours\": 0.25,\n",
" \"label_column_name\": \"Quantity\",\n",
" \"n_cross_validations\": 3,\n",
" \"n_cross_validations\": \"auto\", # Feel free to set to a small integer (>=2) if runtime is an issue.\n",
" \"cv_step_size\": \"auto\",\n",
" \"time_column_name\": \"WeekStarting\",\n",
" \"drop_column_names\": \"Revenue\",\n",
" \"forecast_horizon\": 6,\n",
@@ -849,7 +851,12 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
"version": "3.8.5"
},
"vscode": {
"interpreter": {
"hash": "6bd77c88278e012ef31757c15997a7bea8c943977c43d6909403c00ae11d43ca"
}
}
},
"nbformat": 4,

Binary file not shown.

After

Width:  |  Height:  |  Size: 32 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 306 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 2.6 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 106 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 158 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 80 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 68 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 631 KiB

View File

@@ -0,0 +1,39 @@
from pathlib import Path
from azureml.core import Run
import argparse
import os
def main(args):
output = Path(args.output)
output.mkdir(parents=True, exist_ok=True)
run_context = Run.get_context()
input_path = run_context.input_datasets["train_10_models"]
for file_name in os.listdir(input_path):
input_file = os.path.join(input_path, file_name)
with open(input_file, "r") as f:
content = f.read()
# Apply any data pre-processing techniques here
output_file = os.path.join(output, file_name)
with open(output_file, "w") as f:
f.write(content)
def my_parse_args():
parser = argparse.ArgumentParser("Test")
parser.add_argument("--input", type=str)
parser.add_argument("--output", type=str)
args = parser.parse_args()
return args
if __name__ == "__main__":
args = my_parse_args()
main(args)

View File

@@ -0,0 +1,31 @@
from pathlib import Path
from azureml.core import Run
import argparse
def main(args):
output = Path(args.output)
output.mkdir(parents=True, exist_ok=True)
run_context = Run.get_context()
dataset = run_context.input_datasets["train_10_models"]
df = dataset.to_pandas_dataframe()
# Apply any data pre-processing techniques here
df.to_parquet(output / "data_prepared_result.parquet", compression=None)
def my_parse_args():
parser = argparse.ArgumentParser("Test")
parser.add_argument("--input", type=str)
parser.add_argument("--output", type=str)
args = parser.parse_args()
return args
if __name__ == "__main__":
args = my_parse_args()
main(args)

View File

@@ -0,0 +1,3 @@
dependencies:
- pip:
- azureml-contrib-automl-pipeline-steps

View File

@@ -368,7 +368,8 @@
"|**time_column_name**|The name of your time column.|\n",
"|**forecast_horizon**|The forecast horizon is how many periods forward you would like to forecast. This integer horizon is in units of the timeseries frequency (e.g. daily, weekly).|\n",
"|**time_series_id_column_names**|The column names used to uniquely identify the time series in data that has multiple rows with the same timestamp. If the time series identifiers are not defined, the data set is assumed to be one time series.|\n",
"|**freq**|Forecast frequency. This optional parameter represents the period with which the forecast is desired, for example, daily, weekly, yearly, etc. Use this parameter for the correction of time series containing irregular data points or for padding of short time series. The frequency needs to be a pandas offset alias. Please refer to [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects) for more information."
"|**freq**|Forecast frequency. This optional parameter represents the period with which the forecast is desired, for example, daily, weekly, yearly, etc. Use this parameter for the correction of time series containing irregular data points or for padding of short time series. The frequency needs to be a pandas offset alias. Please refer to [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects) for more information.\n",
"|**cv_step_size**|Number of periods between two consecutive cross-validation folds. The default value is \"auto\", in which case AutoMl determines the cross-validation step size automatically, if a validation set is not provided. Or users could specify an integer value."
]
},
{
@@ -390,7 +391,7 @@
"In the first case, AutoML loops over all time-series in your dataset and trains one model (e.g. AutoArima or Prophet, as the case may be) for each series. This can result in long runtimes to train these models if there are a lot of series in the data. One way to mitigate this problem is to fit models for different series in parallel if you have multiple compute cores available. To enable this behavior, set the `max_cores_per_iteration` parameter in your AutoMLConfig as shown in the example in the next cell. \n",
"\n",
"\n",
"Finally, a note about the cross-validation (CV) procedure for time-series data. AutoML uses out-of-sample error estimates to select a best pipeline/model, so it is important that the CV fold splitting is done correctly. Time-series can violate the basic statistical assumptions of the canonical K-Fold CV strategy, so AutoML implements a [rolling origin validation](https://robjhyndman.com/hyndsight/tscv/) procedure to create CV folds for time-series data. To use this procedure, you just need to specify the desired number of CV folds in the AutoMLConfig object. It is also possible to bypass CV and use your own validation set by setting the *validation_data* parameter of AutoMLConfig.\n",
"Finally, a note about the cross-validation (CV) procedure for time-series data. AutoML uses out-of-sample error estimates to select a best pipeline/model, so it is important that the CV fold splitting is done correctly. Time-series can violate the basic statistical assumptions of the canonical K-Fold CV strategy, so AutoML implements a [rolling origin validation](https://robjhyndman.com/hyndsight/tscv/) procedure to create CV folds for time-series data. To use this procedure, you could specify the desired number of CV folds and the number of periods between two consecutive folds in the AutoMLConfig object, or AutoMl could set them automatically if you don't specify them. It is also possible to bypass CV and use your own validation set by setting the *validation_data* parameter of AutoMLConfig.\n",
"\n",
"Here is a summary of AutoMLConfig parameters used for training the OJ model:\n",
"\n",
@@ -403,7 +404,7 @@
"|**training_data**|Input dataset, containing both features and label column.|\n",
"|**label_column_name**|The name of the label column.|\n",
"|**compute_target**|The remote compute for training.|\n",
"|**n_cross_validations**|Number of cross-validation folds to use for model/pipeline selection|\n",
"|**n_cross_validations**|Number of cross-validation folds to use for model/pipeline selection. The default value is \"auto\", in which case AutoMl determines the number of cross-validations automatically, if a validation set is not provided. Or users could specify an integer value.\n",
"|**enable_voting_ensemble**|Allow AutoML to create a Voting ensemble of the best performing models|\n",
"|**enable_stack_ensemble**|Allow AutoML to create a Stack ensemble of the best performing models|\n",
"|**debug_log**|Log file path for writing debugging information|\n",
@@ -424,6 +425,7 @@
" forecast_horizon=n_test_periods,\n",
" time_series_id_column_names=time_series_id_column_names,\n",
" freq=\"W-THU\", # Set the forecast frequency to be weekly (start on each Thursday)\n",
" cv_step_size=\"auto\",\n",
")\n",
"\n",
"automl_config = AutoMLConfig(\n",
@@ -436,7 +438,7 @@
" compute_target=compute_target,\n",
" enable_early_stopping=True,\n",
" featurization=featurization_config,\n",
" n_cross_validations=3,\n",
" n_cross_validations=\"auto\", # Feel free to set to a small integer (>=2) if runtime is an issue.\n",
" verbosity=logging.INFO,\n",
" max_cores_per_iteration=-1,\n",
" forecasting_parameters=forecasting_parameters,\n",
@@ -833,12 +835,17 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
"version": "3.8.5"
},
"tags": [
"None"
],
"task": "Forecasting"
"task": "Forecasting",
"vscode": {
"interpreter": {
"hash": "6bd77c88278e012ef31757c15997a7bea8c943977c43d6909403c00ae11d43ca"
}
}
},
"nbformat": 4,
"nbformat_minor": 4

View File

@@ -237,11 +237,11 @@
"\n",
"datastore = ws.get_default_datastore()\n",
"train_dataset = TabularDatasetFactory.register_pandas_dataframe(\n",
" train, target=(datastore, \"dataset/\"), name=\"dominicks_OJ_train\"\n",
" train, target=(datastore, \"dataset/\"), name=\"dominicks_OJ_train_pipeline\"\n",
")\n",
"\n",
"test_dataset = TabularDatasetFactory.register_pandas_dataframe(\n",
" test, target=(datastore, \"dataset/\"), name=\"dominicks_OJ_test\"\n",
" test, target=(datastore, \"dataset/\"), name=\"dominicks_OJ_test_pipeline\"\n",
")"
]
},
@@ -292,7 +292,8 @@
"|**time_column_name**|The name of your time column.|\n",
"|**forecast_horizon**|The forecast horizon is how many periods forward you would like to forecast. This integer horizon is in units of the timeseries frequency (e.g. daily, weekly).|\n",
"|**time_series_id_column_names**|The column names used to uniquely identify the time series in data that has multiple rows with the same timestamp. If the time series identifiers are not defined, the data set is assumed to be one time series.|\n",
"|**freq**|Forecast frequency. This optional parameter represents the period with which the forecast is desired, for example, daily, weekly, yearly, etc. Use this parameter for the correction of time series containing irregular data points or for padding of short time series. The frequency needs to be a pandas offset alias. Please refer to [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects) for more information."
"|**freq**|Forecast frequency. This optional parameter represents the period with which the forecast is desired, for example, daily, weekly, yearly, etc. Use this parameter for the correction of time series containing irregular data points or for padding of short time series. The frequency needs to be a pandas offset alias. Please refer to [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects) for more information.\n",
"|**cv_step_size**|Number of periods between two consecutive cross-validation folds. The default value is \"auto\", in which case AutoMl determines the cross-validation step size automatically, if a validation set is not provided. Or users could specify an integer value."
]
},
{
@@ -307,7 +308,8 @@
" time_column_name=time_column_name,\n",
" forecast_horizon=n_test_periods,\n",
" time_series_id_column_names=time_series_id_column_names,\n",
" freq=\"W-THU\", # Set the forecast frequency to be weekly (start on each Thursday)\n",
" freq=\"W-THU\", # Set the forecast frequency to be weekly (start on each Thursday),\n",
" cv_step_size=\"auto\",\n",
")\n",
"\n",
"automl_config = AutoMLConfig(\n",
@@ -319,7 +321,7 @@
" label_column_name=target_column_name,\n",
" compute_target=compute_target,\n",
" enable_early_stopping=True,\n",
" n_cross_validations=5,\n",
" n_cross_validations=\"auto\", # Feel free to set to a small integer (>=2) if runtime is an issue.\n",
" verbosity=logging.INFO,\n",
" max_cores_per_iteration=-1,\n",
" forecasting_parameters=forecasting_parameters,\n",
@@ -811,12 +813,17 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
"version": "3.8.5"
},
"tags": [
"None"
],
"task": "Forecasting"
"task": "Forecasting",
"vscode": {
"interpreter": {
"hash": "6bd77c88278e012ef31757c15997a7bea8c943977c43d6909403c00ae11d43ca"
}
}
},
"nbformat": 4,
"nbformat_minor": 4

View File

@@ -23,11 +23,7 @@ except ImportError:
def infer_forecasting_dataset_tcn(
X_test,
y_test,
model,
output_path,
output_dataset_name="results",
X_test, y_test, model, output_path, output_dataset_name="results"
):
y_pred, df_all = model.forecast(X_test, y_test)
@@ -71,10 +67,7 @@ def get_model(model_path, model_file_name):
def get_args():
parser = argparse.ArgumentParser()
parser.add_argument(
"--model_name",
type=str,
dest="model_name",
help="Model to be loaded",
"--model_name", type=str, dest="model_name", help="Model to be loaded"
)
parser.add_argument(
@@ -108,18 +101,13 @@ def get_args():
return args
def get_data(
run,
fitted_model,
target_column_name,
test_dataset_name,
):
def get_data(run, fitted_model, target_column_name, test_dataset_name):
# get input dataset by name
test_dataset = Dataset.get_by_name(run.experiment.workspace, test_dataset_name)
test_df = test_dataset.to_pandas_dataframe()
if target_column_name in test_df:
y_test = test_df.pop(target_column_name)
y_test = test_df.pop(target_column_name).values
else:
y_test = np.full(test_df.shape[0], np.nan)
@@ -159,10 +147,7 @@ if __name__ == "__main__":
fitted_model = get_model(model_path, model_file_name)
X_test_df, y_test = get_data(
run,
fitted_model,
target_column_name,
test_dataset_name,
run, fitted_model, target_column_name, test_dataset_name
)
infer_forecasting_dataset_tcn(

View File

@@ -358,7 +358,8 @@
" enable_early_stopping=True,\n",
" training_data=train_dataset,\n",
" label_column_name=TARGET_COLNAME,\n",
" n_cross_validations=5,\n",
" n_cross_validations=\"auto\", # Feel free to set to a small integer (>=2) if runtime is an issue.\n",
" cv_step_size=\"auto\",\n",
" verbosity=logging.INFO,\n",
" max_cores_per_iteration=-1,\n",
" compute_target=compute_target,\n",
@@ -585,7 +586,12 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
"version": "3.8.5"
},
"vscode": {
"interpreter": {
"hash": "6bd77c88278e012ef31757c15997a7bea8c943977c43d6909403c00ae11d43ca"
}
}
},
"nbformat": 4,

View File

@@ -69,17 +69,19 @@
"# ONNX Model Zoo and save it in the same folder as this tutorial\n",
"\n",
"import urllib.request\n",
"import os\n",
"\n",
"onnx_model_url = \"https://github.com/onnx/models/blob/main/vision/body_analysis/emotion_ferplus/model/emotion-ferplus-7.tar.gz?raw=true\"\n",
"\n",
"urllib.request.urlretrieve(onnx_model_url, filename=\"emotion-ferplus-7.tar.gz\")\n",
"os.mkdir(\"emotion_ferplus\")\n",
"\n",
"# the ! magic command tells our jupyter notebook kernel to run the following line of \n",
"# code from the command line instead of the notebook kernel\n",
"\n",
"# We use tar and xvcf to unzip the files we just retrieved from the ONNX model zoo\n",
"\n",
"!tar xvzf emotion-ferplus-7.tar.gz"
"!tar xvzf emotion-ferplus-7.tar.gz -C emotion_ferplus"
]
},
{
@@ -130,7 +132,7 @@
"metadata": {},
"outputs": [],
"source": [
"model_dir = \"emotion_ferplus\" # replace this with the location of your model files\n",
"model_dir = \"emotion_ferplus/model\" # replace this with the location of your model files\n",
"\n",
"# leave as is if it's in the same folder as this notebook"
]
@@ -496,13 +498,12 @@
"\n",
"# to use parsers to read in our model/data\n",
"import json\n",
"import os\n",
"\n",
"test_inputs = []\n",
"test_outputs = []\n",
"\n",
"# read in 3 testing images from .pb files\n",
"test_data_size = 3\n",
"# read in 1 testing images from .pb files\n",
"test_data_size = 1\n",
"\n",
"for num in np.arange(test_data_size):\n",
" input_test_data = os.path.join(model_dir, 'test_data_set_{0}'.format(num), 'input_0.pb')\n",
@@ -533,7 +534,7 @@
},
"source": [
"### Show some sample images\n",
"We use `matplotlib` to plot 3 test images from the dataset."
"We use `matplotlib` to plot 1 test images from the dataset."
]
},
{
@@ -547,7 +548,7 @@
"outputs": [],
"source": [
"plt.figure(figsize = (20, 20))\n",
"for test_image in np.arange(3):\n",
"for test_image in np.arange(test_data_size):\n",
" test_inputs[test_image].reshape(1, 64, 64)\n",
" plt.subplot(1, 8, test_image+1)\n",
" plt.axhline('')\n",

View File

@@ -69,10 +69,12 @@
"# ONNX Model Zoo and save it in the same folder as this tutorial\n",
"\n",
"import urllib.request\n",
"import os\n",
"\n",
"onnx_model_url = \"https://github.com/onnx/models/blob/main/vision/classification/mnist/model/mnist-7.tar.gz?raw=true\"\n",
"\n",
"urllib.request.urlretrieve(onnx_model_url, filename=\"mnist-7.tar.gz\")"
"urllib.request.urlretrieve(onnx_model_url, filename=\"mnist-7.tar.gz\")\n",
"os.mkdir(\"mnist\")"
]
},
{
@@ -86,7 +88,7 @@
"\n",
"# We use tar and xvcf to unzip the files we just retrieved from the ONNX model zoo\n",
"\n",
"!tar xvzf mnist-7.tar.gz"
"!tar xvzf mnist-7.tar.gz -C mnist"
]
},
{
@@ -137,7 +139,7 @@
"metadata": {},
"outputs": [],
"source": [
"model_dir = \"mnist\" # replace this with the location of your model files\n",
"model_dir = \"mnist/model\" # replace this with the location of your model files\n",
"\n",
"# leave as is if it's in the same folder as this notebook"
]
@@ -447,13 +449,12 @@
"\n",
"# to use parsers to read in our model/data\n",
"import json\n",
"import os\n",
"\n",
"test_inputs = []\n",
"test_outputs = []\n",
"\n",
"# read in 3 testing images from .pb files\n",
"test_data_size = 3\n",
"# read in 1 testing images from .pb files\n",
"test_data_size = 1\n",
"\n",
"for i in np.arange(test_data_size):\n",
" input_test_data = os.path.join(model_dir, 'test_data_set_{0}'.format(i), 'input_0.pb')\n",
@@ -486,7 +487,7 @@
},
"source": [
"### Show some sample images\n",
"We use `matplotlib` to plot 3 test images from the dataset."
"We use `matplotlib` to plot 1 test images from the dataset."
]
},
{
@@ -500,7 +501,7 @@
"outputs": [],
"source": [
"plt.figure(figsize = (16, 6))\n",
"for test_image in np.arange(3):\n",
"for test_image in np.arange(test_data_size):\n",
" plt.subplot(1, 15, test_image+1)\n",
" plt.axhline('')\n",
" plt.axvline('')\n",

View File

@@ -2,6 +2,8 @@
# Licensed under the MIT license.
from azureml.core.run import Run
from azureml.interpret import ExplanationClient
from interpret_community.adapter import ExplanationAdapter
import joblib
import os
import shap
@@ -11,9 +13,11 @@ OUTPUT_DIR = './outputs/'
os.makedirs(OUTPUT_DIR, exist_ok=True)
run = Run.get_context()
client = ExplanationClient.from_run(run)
# get a dataset on income prediction
X, y = shap.datasets.adult()
features = X.columns.values
# train an XGBoost model (but any other tree model type should work)
model = xgboost.XGBClassifier()
@@ -26,6 +30,12 @@ shap_values = explainer(X_shap)
print("computed shap values:")
print(shap_values)
# Use the explanation adapter to convert the importances into an interpret-community
# style explanation which can be uploaded to AzureML or visualized in the
# ExplanationDashboard widget
adapter = ExplanationAdapter(features, classification=True)
global_explanation = adapter.create_global(shap_values.values, X_shap, expected_values=shap_values.base_values)
# write X_shap out as a pickle file for later visualization
x_shap_pkl = 'x_shap.pkl'
with open(x_shap_pkl, 'wb') as file:
@@ -42,3 +52,8 @@ with open(model_file_name, 'wb') as file:
run.upload_file('xgboost_model.pkl', os.path.join('./outputs/', model_file_name))
original_model = run.register_model(model_name='xgboost_with_gpu_tree_explainer',
model_path='xgboost_model.pkl')
# Uploading model explanation data for storage or visualization in webUX
# The explanation can then be downloaded on any compute
comment = 'Global explanation on classification model trained on adult census income dataset'
client.upload_model_explanation(global_explanation, comment=comment, model_id=original_model.id)

View File

@@ -106,7 +106,7 @@
"metadata": {},
"outputs": [],
"source": [
"print(\"This notebook was created using version 1.42.0 of the Azure ML SDK\")\n",
"print(\"This notebook was created using version 1.45.0 of the Azure ML SDK\")\n",
"print(\"You are currently using version\", azureml.core.VERSION, \"of the Azure ML SDK\")"
]
},
@@ -225,36 +225,72 @@
"\n",
"from azureml.core import Environment\n",
"\n",
"environment_name = \"shap-gpu-tree\"\n",
"\n",
"environment_name = \"shapgpu\"\n",
"env = Environment(environment_name)\n",
"\n",
"env.docker.enabled = True\n",
"env.docker.base_image = None\n",
"env.docker.base_dockerfile = \"\"\"\n",
"FROM rapidsai/rapidsai:cuda10.0-devel-ubuntu18.04\n",
"\n",
"\n",
"# Note: this is to pin the pandas and xgboost versions to be same as notebook.\n",
"# In production scenario user would choose their dependencies\n",
"import pkg_resources\n",
"available_packages = pkg_resources.working_set\n",
"pandas_ver = None\n",
"numpy_ver = None\n",
"for dist in list(available_packages):\n",
" if dist.key == 'pandas':\n",
" pandas_ver = dist.version\n",
"pandas_dep = 'pandas'\n",
"numpy_dep = 'numpy'\n",
"if pandas_ver:\n",
" pandas_dep = 'pandas=={}'.format(pandas_ver)\n",
"if numpy_ver:\n",
" numpy_dep = 'numpy=={}'.format(numpy_ver)\n",
"\n",
"# Note: we build shap at commit 690245 for Tesla K80 GPUs\n",
"env.docker.base_dockerfile = f\"\"\"\n",
"FROM nvidia/cuda:10.2-devel-ubuntu18.04\n",
"ENV PATH=\"/root/miniconda3/bin:${{PATH}}\"\n",
"ARG PATH=\"/root/miniconda3/bin:${{PATH}}\"\n",
"RUN apt-get update && \\\n",
"apt-get install -y fuse && \\\n",
"apt-get install -y build-essential && \\\n",
"apt-get install -y python3-dev && \\\n",
"source activate rapids && \\\n",
"apt-get install -y wget && \\\n",
"apt-get install -y git && \\\n",
"rm -rf /var/lib/apt/lists/* && \\\n",
"wget \\\n",
"https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && \\\n",
"mkdir /root/.conda && \\\n",
"bash Miniconda3-latest-Linux-x86_64.sh -b && \\\n",
"rm -f Miniconda3-latest-Linux-x86_64.sh && \\\n",
"conda init bash && \\\n",
". ~/.bashrc && \\\n",
"conda create -n shapgpu python=3.8 && \\\n",
"conda activate shapgpu && \\\n",
"apt-get install -y g++ && \\\n",
"printenv && \\\n",
"echo \"which nvcc: \" && \\\n",
"which nvcc && \\\n",
"pip install azureml-defaults && \\\n",
"pip install azureml-telemetry && \\\n",
"pip install azureml-interpret && \\\n",
"pip install {pandas_dep} && \\\n",
"cd /usr/local/src && \\\n",
"git clone https://github.com/slundberg/shap && \\\n",
"git clone https://github.com/slundberg/shap.git --single-branch && \\\n",
"cd shap && \\\n",
"git reset --hard 690245c6ab043edf40cfce3d8438a62e29ab599f && \\\n",
"mkdir build && \\\n",
"python setup.py install --user && \\\n",
"pip uninstall -y xgboost && \\\n",
"rm /conda/envs/rapids/lib/libxgboost.so && \\\n",
"pip install xgboost==1.4.2\n",
"conda install py-xgboost==1.3.3 && \\\n",
"pip uninstall -y numpy && \\\n",
"conda install {numpy_dep} \\\n",
"\"\"\"\n",
"\n",
"env.python.user_managed_dependencies = True\n",
"env.python.interpreter_path = '/root/miniconda3/envs/shapgpu/bin/python'\n",
"\n",
"from azureml.core import Run\n",
"from azureml.core import ScriptRunConfig\n",
@@ -266,6 +302,176 @@
"run = experiment.submit(config=src)\n",
"run"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note: if you need to cancel a run, you can follow [these instructions](https://aka.ms/aml-docs-cancel-run)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"# Shows output of the run on stdout.\n",
"run.wait_for_completion(show_output=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"run.get_metrics()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Download \n",
"1. Download model explanation data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.interpret import ExplanationClient\n",
"\n",
"# Get model explanation data\n",
"client = ExplanationClient.from_run(run)\n",
"global_explanation = client.download_model_explanation()\n",
"local_importance_values = global_explanation.local_importance_values\n",
"expected_values = global_explanation.expected_values"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Get the top k (e.g., 4) most important features with their importance values\n",
"global_explanation_topk = client.download_model_explanation(top_k=4)\n",
"global_importance_values = global_explanation_topk.get_ranked_global_values()\n",
"global_importance_names = global_explanation_topk.get_ranked_global_names()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print('global importance values: {}'.format(global_importance_values))\n",
"print('global importance names: {}'.format(global_importance_names))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"2. Download model file."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Retrieve model for visualization and deployment\n",
"from azureml.core.model import Model\n",
"import joblib\n",
"original_model = Model(ws, 'xgboost_with_gpu_tree_explainer')\n",
"model_path = original_model.download(exist_ok=True)\n",
"original_model = joblib.load(model_path)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"3. Download test dataset."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Retrieve x_test for visualization\n",
"x_test_path = './x_shap_adult_census.pkl'\n",
"run.download_file('x_shap_adult_census.pkl', output_file_path=x_test_path)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"x_test = joblib.load('x_shap_adult_census.pkl')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Visualize\n",
"Load the visualization dashboard"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from raiwidgets import ExplanationDashboard"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from interpret_community.common.model_wrapper import wrap_model\n",
"from interpret_community.dataset.dataset_wrapper import DatasetWrapper\n",
"# note we need to wrap the XGBoost model to output predictions and probabilities in the scikit-learn format\n",
"class WrappedXGBoostModel(object):\n",
" \"\"\"A class for wrapping an XGBoost model to output integer predicted classes.\"\"\"\n",
"\n",
" def __init__(self, model):\n",
" self.model = model\n",
"\n",
" def predict(self, dataset):\n",
" return self.model.predict(dataset).astype(int)\n",
"\n",
" def predict_proba(self, dataset):\n",
" return self.model.predict_proba(dataset)\n",
"\n",
"wrapped_model = WrappedXGBoostModel(wrap_model(original_model, DatasetWrapper(x_test), model_task='classification'))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ExplanationDashboard(global_explanation, wrapped_model, dataset=x_test)"
]
}
],
"metadata": {

View File

@@ -1,5 +1,18 @@
name: train-explain-model-gpu-tree-explainer
dependencies:
- py-xgboost==1.3.3
- pip:
- azureml-sdk
- azureml-interpret
- flask
- flask-cors
- gevent>=1.3.6
- ipython
- matplotlib
- ipywidgets
- raiwidgets~=0.21.0
- itsdangerous==2.0.1
- markupsafe<2.1.0
- scipy>=1.5.3
- protobuf==3.20.0
- jinja2==3.0.3

View File

@@ -269,23 +269,29 @@
"available_packages = pkg_resources.working_set\n",
"sklearn_ver = None\n",
"pandas_ver = None\n",
"joblib_ver = None\n",
"for dist in list(available_packages):\n",
" if dist.key == 'scikit-learn':\n",
" sklearn_ver = dist.version\n",
" elif dist.key == 'pandas':\n",
" pandas_ver = dist.version\n",
" elif dist.key == 'joblib':\n",
" joblib_ver = dist.version\n",
"sklearn_dep = 'scikit-learn'\n",
"pandas_dep = 'pandas'\n",
"joblib_dep = 'joblib'\n",
"if sklearn_ver:\n",
" sklearn_dep = 'scikit-learn=={}'.format(sklearn_ver)\n",
"if pandas_ver:\n",
" pandas_dep = 'pandas=={}'.format(pandas_ver)\n",
"if joblib_ver:\n",
" joblib_dep = 'joblib=={}'.format(joblib_ver)\n",
"# Specify CondaDependencies obj\n",
"# The CondaDependencies specifies the conda and pip packages that are installed in the environment\n",
"# the submitted job is run in. Note the remote environment(s) needs to be similar to the local\n",
"# environment, otherwise if a model is trained or deployed in a different environment this can\n",
"# cause errors. Please take extra care when specifying your dependencies in a production environment.\n",
"azureml_pip_packages.extend([sklearn_dep, pandas_dep])\n",
"azureml_pip_packages.extend([sklearn_dep, pandas_dep, joblib_dep])\n",
"run_config.environment.python.conda_dependencies = CondaDependencies.create(pip_packages=azureml_pip_packages, python_version=python_version)\n",
"\n",
"from azureml.core import ScriptRunConfig\n",

View File

@@ -6,11 +6,13 @@ dependencies:
- flask
- flask-cors
- gevent>=1.3.6
- jinja2
- ipython
- matplotlib
- azureml-dataset-runtime
- ipywidgets
- raiwidgets~=0.18.1
- raiwidgets~=0.21.0
- itsdangerous==2.0.1
- markupsafe<2.1.0
- scipy>=1.5.3
- protobuf==3.20.0
- jinja2==3.0.3

View File

@@ -6,11 +6,13 @@ dependencies:
- flask
- flask-cors
- gevent>=1.3.6
- jinja2
- ipython
- matplotlib
- ipywidgets
- raiwidgets~=0.18.1
- raiwidgets~=0.21.0
- packaging>=20.9
- itsdangerous==2.0.1
- markupsafe<2.1.0
- scipy>=1.5.3
- protobuf==3.20.0
- jinja2==3.0.3

View File

@@ -18,7 +18,9 @@ def init():
original_model_path = Model.get_model_path('local_deploy_model')
scoring_explainer_path = Model.get_model_path('IBM_attrition_explainer')
# Load the original model into the environment
original_model = joblib.load(original_model_path)
# Load the scoring explainer into the environment
scoring_explainer = joblib.load(scoring_explainer_path)
@@ -29,5 +31,15 @@ def run(raw_data):
predictions = original_model.predict(data)
# Retrieve model explanations
local_importance_values = scoring_explainer.explain(data)
# Retrieve the feature names, which we may want to return to the user.
# Note: you can also get the raw_features and engineered_features
# by calling scoring_explainer.raw_features and
# scoring_explainer.engineered_features but you may need to pass
# the raw or engineered feature names in the ScoringExplainer
# constructor, depending on if you are using feature maps or
# transformations on the original explainer.
features = scoring_explainer.features
# You can return any data type as long as it is JSON-serializable
return {'predictions': predictions.tolist(), 'local_importance_values': local_importance_values}
return {'predictions': predictions.tolist(),
'local_importance_values': local_importance_values,
'features': features}

View File

@@ -340,17 +340,29 @@
"available_packages = pkg_resources.working_set\n",
"sklearn_ver = None\n",
"pandas_ver = None\n",
"numpy_ver = None\n",
"numba_ver = None\n",
"for dist in available_packages:\n",
" if dist.key == 'scikit-learn':\n",
" sklearn_ver = dist.version\n",
" elif dist.key == 'numpy':\n",
" numpy_ver = dist.version\n",
" elif dist.key == 'numba':\n",
" numba_ver = dist.version\n",
" elif dist.key == 'pandas':\n",
" pandas_ver = dist.version\n",
"sklearn_dep = 'scikit-learn'\n",
"pandas_dep = 'pandas'\n",
"numpy_dep = 'numpy'\n",
"numba_dep = 'numba'\n",
"if sklearn_ver:\n",
" sklearn_dep = 'scikit-learn=={}'.format(sklearn_ver)\n",
"if pandas_ver:\n",
" pandas_dep = 'pandas=={}'.format(pandas_ver)\n",
"if numpy_ver:\n",
" numpy_dep = 'numpy=={}'.format(numpy_ver)\n",
"if numba_ver:\n",
" numba_dep = 'numba=={}'.format(numba_ver)\n",
"# Specify CondaDependencies obj\n",
"# The CondaDependencies specifies the conda and pip packages that are installed in the environment\n",
"# the submitted job is run in. Note the remote environment(s) needs to be similar to the local\n",
@@ -358,8 +370,8 @@
"# cause errors. Please take extra care when specifying your dependencies in a production environment.\n",
"myenv = CondaDependencies.create(\n",
" python_version=python_version,\n",
" conda_packages=['pip==20.2.4'],\n",
" pip_packages=['pyyaml', sklearn_dep, pandas_dep] + azureml_pip_packages)\n",
" conda_packages=['pip==20.2.4', numpy_dep],\n",
" pip_packages=['pyyaml', sklearn_dep, pandas_dep, numba_dep] + azureml_pip_packages)\n",
"\n",
"with open(\"myenv.yml\",\"w\") as f:\n",
" f.write(myenv.serialize_to_string())\n",

View File

@@ -6,11 +6,13 @@ dependencies:
- flask
- flask-cors
- gevent>=1.3.6
- jinja2
- ipython
- matplotlib
- ipywidgets
- raiwidgets~=0.18.1
- raiwidgets~=0.21.0
- packaging>=20.9
- itsdangerous==2.0.1
- markupsafe<2.1.0
- scipy>=1.5.3
- protobuf==3.20.0
- jinja2==3.0.3

View File

@@ -277,23 +277,29 @@
"available_packages = pkg_resources.working_set\n",
"sklearn_ver = None\n",
"pandas_ver = None\n",
"joblib_ver = None\n",
"for dist in available_packages:\n",
" if dist.key == 'scikit-learn':\n",
" sklearn_ver = dist.version\n",
" elif dist.key == 'pandas':\n",
" pandas_ver = dist.version\n",
" elif dist.key == 'joblib':\n",
" joblib_ver = dist.version\n",
"sklearn_dep = 'scikit-learn'\n",
"pandas_dep = 'pandas'\n",
"joblib_dep = 'joblib'\n",
"if sklearn_ver:\n",
" sklearn_dep = 'scikit-learn=={}'.format(sklearn_ver)\n",
"if pandas_ver:\n",
" pandas_dep = 'pandas=={}'.format(pandas_ver)\n",
"if joblib_ver:\n",
" joblib_dep = 'joblib=={}'.format(joblib_ver)\n",
"# Specify CondaDependencies obj\n",
"# The CondaDependencies specifies the conda and pip packages that are installed in the environment\n",
"# the submitted job is run in. Note the remote environment(s) needs to be similar to the local\n",
"# environment, otherwise if a model is trained or deployed in a different environment this can\n",
"# cause errors. Please take extra care when specifying your dependencies in a production environment.\n",
"azureml_pip_packages.extend(['pyyaml', sklearn_dep, pandas_dep])\n",
"azureml_pip_packages.extend(['pyyaml', sklearn_dep, pandas_dep, joblib_dep])\n",
"run_config.environment.python.conda_dependencies = CondaDependencies.create(\n",
" python_version=python_version,\n",
" pip_packages=azureml_pip_packages)\n",
@@ -440,23 +446,29 @@
"available_packages = pkg_resources.working_set\n",
"sklearn_ver = None\n",
"pandas_ver = None\n",
"joblib_ver = None\n",
"for dist in available_packages:\n",
" if dist.key == 'scikit-learn':\n",
" sklearn_ver = dist.version\n",
" elif dist.key == 'pandas':\n",
" pandas_ver = dist.version\n",
" elif dist.key == 'joblib':\n",
" joblib_ver = dist.version\n",
"sklearn_dep = 'scikit-learn'\n",
"pandas_dep = 'pandas'\n",
"joblib_dep = 'joblib'\n",
"if sklearn_ver:\n",
" sklearn_dep = 'scikit-learn=={}'.format(sklearn_ver)\n",
"if pandas_ver:\n",
" pandas_dep = 'pandas=={}'.format(pandas_ver)\n",
"if joblib_ver:\n",
" joblib_dep = 'joblib=={}'.format(joblib_ver)\n",
"# Specify CondaDependencies obj\n",
"# The CondaDependencies specifies the conda and pip packages that are installed in the environment\n",
"# the submitted job is run in. Note the remote environment(s) needs to be similar to the local\n",
"# environment, otherwise if a model is trained or deployed in a different environment this can\n",
"# cause errors. Please take extra care when specifying your dependencies in a production environment.\n",
"azureml_pip_packages.extend(['pyyaml', sklearn_dep, pandas_dep])\n",
"azureml_pip_packages.extend(['pyyaml', sklearn_dep, pandas_dep, joblib_dep])\n",
"myenv = CondaDependencies.create(python_version=python_version, pip_packages=azureml_pip_packages)\n",
"\n",
"with open(\"myenv.yml\",\"w\") as f:\n",

View File

@@ -6,12 +6,14 @@ dependencies:
- flask
- flask-cors
- gevent>=1.3.6
- jinja2
- ipython
- matplotlib
- azureml-dataset-runtime
- azureml-core
- ipywidgets
- raiwidgets~=0.18.1
- raiwidgets~=0.21.0
- itsdangerous==2.0.1
- markupsafe<2.1.0
- scipy>=1.5.3
- protobuf==3.20.0
- jinja2==3.0.3

View File

@@ -3,3 +3,4 @@ dependencies:
- pip:
- azureml-sdk
- azureml-widgets
- protobuf==3.20.0

View File

@@ -1,3 +1,4 @@
# DisableDockerDetector "Disabled to unblock PRs until the owner can fix the file. Not used in any prod deployments - only as a documentation for the customers"
FROM rocker/tidyverse:4.0.0-ubuntu18.04
# Install python

View File

@@ -4,7 +4,7 @@ import os
import numpy as np
from utils import download_mnist
from datautils import download_mnist
import chainer
from chainer import backend

View File

@@ -2,7 +2,7 @@ import numpy as np
import os
import json
from utils import download_mnist
from datautils import download_mnist
from chainer import serializers, using_config, Variable, datasets
import chainer.functions as F

View File

@@ -210,7 +210,7 @@
"\n",
"shutil.copy('chainer_mnist.py', project_folder)\n",
"shutil.copy('chainer_score.py', project_folder)\n",
"shutil.copy('utils.py', project_folder)"
"shutil.copy('datautils.py', project_folder)"
]
},
{
@@ -554,7 +554,7 @@
"cd = CondaDependencies.create()\n",
"cd.add_conda_package('numpy')\n",
"cd.add_pip_package('chainer==5.1.0')\n",
"cd.add_pip_package(\"azureml-defaults\")\n",
"cd.add_pip_package(\"azureml-defaults==1.43.0\")\n",
"cd.add_pip_package(\"azureml-opendatasets\")\n",
"cd.save_to_file(base_directory='./', conda_file_path='myenv.yml')\n",
"\n",

View File

@@ -437,7 +437,8 @@
" - azureml-defaults\n",
" - tensorflow-gpu==2.0.0\n",
" - keras<=2.3.1\n",
" - matplotlib"
" - matplotlib\n",
" - protobuf==3.20.1"
]
},
{
@@ -989,6 +990,7 @@
"cd.add_conda_package('h5py<=2.10.0')\n",
"cd.add_conda_package('keras<=2.3.1')\n",
"cd.add_pip_package(\"azureml-defaults\")\n",
"cd.add_pip_package(\"protobuf==3.20.1\")\n",
"cd.save_to_file(base_directory='./', conda_file_path='myenv.yml')\n",
"\n",
"print(cd.serialize_to_string())"

View File

@@ -264,7 +264,7 @@
"- python=3.6.2\n",
"- pip=21.3.1\n",
"- pip:\n",
" - azureml-defaults\n",
" - azureml-defaults==1.43.0\n",
" - torch==1.6.0\n",
" - torchvision==0.7.0\n",
" - future==0.17.1\n",
@@ -283,7 +283,7 @@
"\n",
"# Specify a GPU base image\n",
"pytorch_env.docker.enabled = True\n",
"pytorch_env.docker.base_image = 'mcr.microsoft.com/azureml/openmpi3.1.2-cuda10.1-cudnn7-ubuntu18.04'"
"pytorch_env.docker.base_image = 'mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.1-cudnn8-ubuntu18.04'"
]
},
{
@@ -539,7 +539,7 @@
"metadata": {},
"source": [
"## Deploy model as web service\n",
"Once you have your trained model, you can deploy the model on Azure. In this tutorial, we will deploy the model as a web service in [Azure Container Instances](https://docs.microsoft.com/en-us/azure/container-instances/) (ACI). For more information on deploying models using Azure ML, refer [here](https://docs.microsoft.com/azure/machine-learning/service/how-to-deploy-and-where)."
"Once you have your trained model, you can deploy the model on Azure. In this tutorial, we will deploy the model as a web service in [Azure Container Instances](https://docs.microsoft.com/en-us/azure/container-instances/) (ACI). For more information on deploying models using Azure ML, refer [here](https://docs.microsoft.com/en-us/azure/machine-learning/v1/how-to-deploy-and-where)."
]
},
{

View File

@@ -943,6 +943,7 @@
"cd.add_conda_package('numpy')\n",
"cd.add_pip_package('tensorflow==2.2.0')\n",
"cd.add_pip_package(\"azureml-defaults\")\n",
"cd.add_pip_package(\"protobuf==3.20.1\")\n",
"cd.save_to_file(base_directory='./', conda_file_path='myenv.yml')\n",
"\n",
"print(cd.serialize_to_string())"

View File

@@ -8,8 +8,8 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
rm -rf /var/lib/apt/lists/* && \
rm -rf /usr/share/man/*
RUN conda install -y conda=4.12.0 python=3.7 && conda clean -ay
RUN pip install ray-on-aml==0.1.6 & \
RUN conda install -y conda=4.13.0 python=3.7 && conda clean -ay
RUN pip install ray-on-aml==0.2.1 & \
pip install --no-cache-dir \
azureml-defaults \
azureml-dataset-runtime[fuse,pandas] \

View File

@@ -1,3 +1,4 @@
# DisableDockerDetector "Disabled to unblock PRs until the owner can fix the file. Not used in any prod deployments - only as a documentation for the customers"
FROM akdmsft/particle-cpu
RUN conda install -c anaconda python=3.7

View File

@@ -8,8 +8,9 @@ dependencies:
- matplotlib
- azureml-dataset-runtime
- ipywidgets
- raiwidgets~=0.18.1
- raiwidgets~=0.21.0
- liac-arff
- packaging>=20.9
- itsdangerous==2.0.1
- markupsafe<2.1.0
- protobuf==3.20.0

View File

@@ -43,6 +43,7 @@
" 1. Logging numeric metrics\n",
" 1. Logging vectors\n",
" 1. Logging tables\n",
" 1. Logging when additional Metric Names are required\n",
" 1. Uploading files\n",
"1. [Analyzing results](#Analyzing-results)\n",
" 1. Tagging a run\n",
@@ -100,7 +101,7 @@
"\n",
"# Check core SDK version number\n",
"\n",
"print(\"This notebook was created using SDK version 1.42.0, you are currently running version\", azureml.core.VERSION)"
"print(\"This notebook was created using SDK version 1.45.0, you are currently running version\", azureml.core.VERSION)"
]
},
{
@@ -367,7 +368,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Logging for when more Metric Names are required\n",
"### Logging when additional Metric Names are required\n",
"\n",
"Limits on logging are internally enforced to ensure a smooth experience, however these can sometimes be limiting, particularly in terms of the limit on metric names.\n",
"\n",

View File

@@ -1,10 +1,11 @@
import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras.layers.normalization import BatchNormalization
from keras.utils import to_categorical
from keras.callbacks import Callback
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Flatten
from tensorflow.keras.layers import Conv2D, MaxPooling2D
from tensorflow.keras.layers import BatchNormalization
from tensorflow.keras.losses import categorical_crossentropy
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.callbacks import Callback
import numpy as np
import pandas as pd
@@ -64,8 +65,8 @@ model.add(Dense(128, activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(num_classes, activation='softmax'))
model.compile(loss=keras.losses.categorical_crossentropy,
optimizer=keras.optimizers.Adam(),
model.compile(loss=categorical_crossentropy,
optimizer=Adam(),
metrics=['accuracy'])
# start an Azure ML run

View File

@@ -270,16 +270,19 @@
"%%writefile conda_dependencies.yml\n",
"\n",
"dependencies:\n",
"- python=3.6.2\n",
"- python=3.8\n",
"- pip==20.2.4\n",
"- pip:\n",
" - azureml-core\n",
" - azureml-dataset-runtime\n",
" - keras==2.4.3\n",
" - tensorflow==2.4.3\n",
" - keras==2.6\n",
" - tensorflow-gpu==2.6\n",
" - numpy\n",
" - scikit-learn\n",
" - pandas\n",
" - matplotlib"
" - matplotlib\n",
" - protobuf==3.20.1\n",
" - typing-extensions==4.3.0"
]
},
{

View File

@@ -106,6 +106,8 @@ Machine Learning notebook samples and encourage efficient retrieval of topics an
| [upload-fairness-dashboard](https://github.com/Azure/MachineLearningNotebooks/blob/master//contrib/fairness/upload-fairness-dashboard.ipynb) | | | | | | |
| [azure-ml-with-nvidia-rapids](https://github.com/Azure/MachineLearningNotebooks/blob/master//contrib/RAPIDS/azure-ml-with-nvidia-rapids.ipynb) | | | | | | |
| [auto-ml-continuous-retraining](https://github.com/Azure/MachineLearningNotebooks/blob/master//how-to-use-azureml/automated-machine-learning/continuous-retraining/auto-ml-continuous-retraining.ipynb) | | | | | | |
| [codegen-for-autofeaturization](https://github.com/Azure/MachineLearningNotebooks/blob/master//how-to-use-azureml/automated-machine-learning/experimental/autofeaturization-codegen/codegen-for-autofeaturization.ipynb) | | | | | | |
| [custom-model-training-from-autofeaturization-run](https://github.com/Azure/MachineLearningNotebooks/blob/master//how-to-use-azureml/automated-machine-learning/experimental/autofeaturization-custom-model-training/custom-model-training-from-autofeaturization-run.ipynb) | | | | | | |
| [auto-ml-regression-model-proxy](https://github.com/Azure/MachineLearningNotebooks/blob/master//how-to-use-azureml/automated-machine-learning/experimental/regression-model-proxy/auto-ml-regression-model-proxy.ipynb) | | | | | | |
| [auto-ml-forecasting-backtest-many-models](https://github.com/Azure/MachineLearningNotebooks/blob/master//how-to-use-azureml/automated-machine-learning/forecasting-backtest-many-models/auto-ml-forecasting-backtest-many-models.ipynb) | | | | | | |
| [auto-ml-forecasting-energy-demand](https://github.com/Azure/MachineLearningNotebooks/blob/master//how-to-use-azureml/automated-machine-learning/forecasting-energy-demand/auto-ml-forecasting-energy-demand.ipynb) | | | | | | |

View File

@@ -102,7 +102,7 @@
"source": [
"import azureml.core\n",
"\n",
"print(\"This notebook was created using version 1.42.0 of the Azure ML SDK\")\n",
"print(\"This notebook was created using version 1.45.0 of the Azure ML SDK\")\n",
"print(\"You are currently using version\", azureml.core.VERSION, \"of the Azure ML SDK\")"
]
},

View File

@@ -211,7 +211,7 @@
"metadata": {},
"source": [
"## View Experiment\n",
"In the left-hand menu in Azure Machine Learning Studio, select __Experiments__ and then select your experiment (azure-ml-in10-mins-tutorial). An experiment is a grouping of many runs from a specified script or piece of code. Information for the run is stored under that experiment. If the name doesn't exist when you submit an experiment, if you select your run you will see various tabs containing metrics, logs, explanations, etc.\n",
"In the left-hand menu in Azure Machine Learning Studio, select __Jobs__ and then select your experiment (azure-ml-in10-mins-tutorial). An experiment is a grouping of many runs from a specified script or piece of code. Information for the run is stored under that experiment. If the name doesn't exist when you submit an experiment, if you select your run you will see various tabs containing metrics, logs, explanations, etc.\n",
"\n",
"## Version control your models with the model registry\n",
"\n",

View File

@@ -151,8 +151,7 @@
"# use a curated environment that has already been built for you\n",
"\n",
"env = Environment.get(workspace=ws, \n",
" name=\"AzureML-Scikit-learn0.24-Cuda11-OpenMpi4.1.0-py36\", \n",
" version=1)"
" name=\"AzureML-sklearn-0.24-ubuntu18.04-py37-cpu\")"
]
},
{
@@ -222,7 +221,7 @@
"source": [
"### Submit the job\n",
"\n",
"Run the experiment by submitting the ScriptRunConfig object. After this there are many options for monitoring your run. Once submitted, you can either navigate to the experiment \"get-started-with-jobsubmission-tutorial\" in the left menu item __Experiments__ to monitor the run, or you can monitor the run inline as the `run.wait_for_completion(show_output=True)` will stream the logs of the run. You will see that the environment is built for you to ensure reproducibility - this adds a couple of minutes to the run time. On subsequent runs, the environment is re-used making the runtime shorter."
"Run the experiment by submitting the ScriptRunConfig object. After this there are many options for monitoring your run. Once submitted, you can either navigate to the experiment \"get-started-with-jobsubmission-tutorial\" in the left menu item __Jobs__ to monitor the run, or you can monitor the run inline as the `run.wait_for_completion(show_output=True)` will stream the logs of the run. You will see that the environment is built for you to ensure reproducibility - this adds a couple of minutes to the run time. On subsequent runs, the environment is re-used making the runtime shorter."
]
},
{