mirror of
https://github.com/Azure/MachineLearningNotebooks.git
synced 2025-12-19 17:17:04 -05:00
948 lines
34 KiB
Plaintext
948 lines
34 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Copyright (c) Microsoft Corporation. All rights reserved.\n",
|
|
"\n",
|
|
"Licensed under the MIT License."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Automated Machine Learning\n",
|
|
"_**Classification with Deployment using a Bank Marketing Dataset**_\n",
|
|
"\n",
|
|
"## Contents\n",
|
|
"1. [Introduction](#Introduction)\n",
|
|
"1. [Setup](#Setup)\n",
|
|
"1. [Train](#Train)\n",
|
|
"1. [Results](#Results)\n",
|
|
"1. [Deploy](#Deploy)\n",
|
|
"1. [Test](#Test)\n",
|
|
"1. [Acknowledgements](#Acknowledgements)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Introduction\n",
|
|
"\n",
|
|
"In this example we use the UCI Bank Marketing dataset to showcase how you can use AutoML for a classification problem and deploy it to an Azure Container Instance (ACI). The classification goal is to predict if the client will subscribe to a term deposit with the bank.\n",
|
|
"\n",
|
|
"If you are using an Azure Machine Learning Compute Instance, you are all set. Otherwise, go through the [configuration](../../../configuration.ipynb) notebook first if you haven't already to establish your connection to the AzureML Workspace. \n",
|
|
"\n",
|
|
"Please find the ONNX related documentations [here](https://github.com/onnx/onnx).\n",
|
|
"\n",
|
|
"In this notebook you will learn how to:\n",
|
|
"1. Create an experiment using an existing workspace.\n",
|
|
"2. Configure AutoML using `AutoMLConfig`.\n",
|
|
"3. Train the model using local compute with ONNX compatible config on.\n",
|
|
"4. Explore the results, featurization transparency options and save the ONNX model\n",
|
|
"5. Inference with the ONNX model.\n",
|
|
"6. Register the model.\n",
|
|
"7. Create a container image.\n",
|
|
"8. Create an Azure Container Instance (ACI) service.\n",
|
|
"9. Test the ACI service.\n",
|
|
"\n",
|
|
"In addition this notebook showcases the following features\n",
|
|
"- **Blocking** certain pipelines\n",
|
|
"- Specifying **target metrics** to indicate stopping criteria\n",
|
|
"- Handling **missing data** in the input"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Setup\n",
|
|
"\n",
|
|
"As part of the setup you have already created an Azure ML `Workspace` object. For AutoML you will need to create an `Experiment` object, which is a named object in a `Workspace` used to run experiments."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import logging\n",
|
|
"\n",
|
|
"from matplotlib import pyplot as plt\n",
|
|
"import pandas as pd\n",
|
|
"import os\n",
|
|
"\n",
|
|
"import azureml.core\n",
|
|
"from azureml.core.experiment import Experiment\n",
|
|
"from azureml.core.workspace import Workspace\n",
|
|
"from azureml.automl.core.featurization import FeaturizationConfig\n",
|
|
"from azureml.core.dataset import Dataset\n",
|
|
"from azureml.train.automl import AutoMLConfig\n",
|
|
"from azureml.interpret import ExplanationClient"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"This sample notebook may use features that are not available in previous versions of the Azure ML SDK."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"print(\"This notebook was created using version 1.21.0 of the Azure ML SDK\")\n",
|
|
"print(\"You are currently using version\", azureml.core.VERSION, \"of the Azure ML SDK\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Accessing the Azure ML workspace requires authentication with Azure.\n",
|
|
"\n",
|
|
"The default authentication is interactive authentication using the default tenant. Executing the `ws = Workspace.from_config()` line in the cell below will prompt for authentication the first time that it is run.\n",
|
|
"\n",
|
|
"If you have multiple Azure tenants, you can specify the tenant by replacing the `ws = Workspace.from_config()` line in the cell below with the following:\n",
|
|
"\n",
|
|
"```\n",
|
|
"from azureml.core.authentication import InteractiveLoginAuthentication\n",
|
|
"auth = InteractiveLoginAuthentication(tenant_id = 'mytenantid')\n",
|
|
"ws = Workspace.from_config(auth = auth)\n",
|
|
"```\n",
|
|
"\n",
|
|
"If you need to run in an environment where interactive login is not possible, you can use Service Principal authentication by replacing the `ws = Workspace.from_config()` line in the cell below with the following:\n",
|
|
"\n",
|
|
"```\n",
|
|
"from azureml.core.authentication import ServicePrincipalAuthentication\n",
|
|
"auth = auth = ServicePrincipalAuthentication('mytenantid', 'myappid', 'mypassword')\n",
|
|
"ws = Workspace.from_config(auth = auth)\n",
|
|
"```\n",
|
|
"For more details, see [aka.ms/aml-notebook-auth](http://aka.ms/aml-notebook-auth)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"ws = Workspace.from_config()\n",
|
|
"\n",
|
|
"# choose a name for experiment\n",
|
|
"experiment_name = 'automl-classification-bmarketing-all'\n",
|
|
"\n",
|
|
"experiment=Experiment(ws, experiment_name)\n",
|
|
"\n",
|
|
"output = {}\n",
|
|
"output['Subscription ID'] = ws.subscription_id\n",
|
|
"output['Workspace'] = ws.name\n",
|
|
"output['Resource Group'] = ws.resource_group\n",
|
|
"output['Location'] = ws.location\n",
|
|
"output['Experiment Name'] = experiment.name\n",
|
|
"pd.set_option('display.max_colwidth', -1)\n",
|
|
"outputDf = pd.DataFrame(data = output, index = [''])\n",
|
|
"outputDf.T"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Create or Attach existing AmlCompute\n",
|
|
"You will need to create a compute target for your AutoML run. In this tutorial, you create AmlCompute as your training compute resource.\n",
|
|
"#### Creation of AmlCompute takes approximately 5 minutes. \n",
|
|
"If the AmlCompute with that name is already in your workspace this code will skip the creation process.\n",
|
|
"As with other Azure services, there are limits on certain resources (e.g. AmlCompute) associated with the Azure Machine Learning service. Please read [this article](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-quotas) on the default limits and how to request more quota."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.core.compute import ComputeTarget, AmlCompute\n",
|
|
"from azureml.core.compute_target import ComputeTargetException\n",
|
|
"\n",
|
|
"# Choose a name for your CPU cluster\n",
|
|
"cpu_cluster_name = \"cpu-cluster-4\"\n",
|
|
"\n",
|
|
"# Verify that cluster does not exist already\n",
|
|
"try:\n",
|
|
" compute_target = ComputeTarget(workspace=ws, name=cpu_cluster_name)\n",
|
|
" print('Found existing cluster, use it.')\n",
|
|
"except ComputeTargetException:\n",
|
|
" compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',\n",
|
|
" max_nodes=6)\n",
|
|
" compute_target = ComputeTarget.create(ws, cpu_cluster_name, compute_config)\n",
|
|
"\n",
|
|
"compute_target.wait_for_completion(show_output=True)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Data"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Load Data\n",
|
|
"\n",
|
|
"Leverage azure compute to load the bank marketing dataset as a Tabular Dataset into the dataset variable. "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Training Data"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"data = pd.read_csv(\"https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_train.csv\")\n",
|
|
"data.head()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Add missing values in 75% of the lines.\n",
|
|
"import numpy as np\n",
|
|
"\n",
|
|
"missing_rate = 0.75\n",
|
|
"n_missing_samples = int(np.floor(data.shape[0] * missing_rate))\n",
|
|
"missing_samples = np.hstack((np.zeros(data.shape[0] - n_missing_samples, dtype=np.bool), np.ones(n_missing_samples, dtype=np.bool)))\n",
|
|
"rng = np.random.RandomState(0)\n",
|
|
"rng.shuffle(missing_samples)\n",
|
|
"missing_features = rng.randint(0, data.shape[1], n_missing_samples)\n",
|
|
"data.values[np.where(missing_samples)[0], missing_features] = np.nan"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"if not os.path.isdir('data'):\n",
|
|
" os.mkdir('data')\n",
|
|
" \n",
|
|
"# Save the train data to a csv to be uploaded to the datastore\n",
|
|
"pd.DataFrame(data).to_csv(\"data/train_data.csv\", index=False)\n",
|
|
"\n",
|
|
"ds = ws.get_default_datastore()\n",
|
|
"ds.upload(src_dir='./data', target_path='bankmarketing', overwrite=True, show_progress=True)\n",
|
|
"\n",
|
|
" \n",
|
|
"\n",
|
|
"# Upload the training data as a tabular dataset for access during training on remote compute\n",
|
|
"train_data = Dataset.Tabular.from_delimited_files(path=ds.path('bankmarketing/train_data.csv'))\n",
|
|
"label = \"y\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Validation Data"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"validation_data = \"https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_validate.csv\"\n",
|
|
"validation_dataset = Dataset.Tabular.from_delimited_files(validation_data)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Test Data"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"test_data = \"https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_test.csv\"\n",
|
|
"test_dataset = Dataset.Tabular.from_delimited_files(test_data)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Train\n",
|
|
"\n",
|
|
"Instantiate a AutoMLConfig object. This defines the settings and data used to run the experiment.\n",
|
|
"\n",
|
|
"|Property|Description|\n",
|
|
"|-|-|\n",
|
|
"|**task**|classification or regression or forecasting|\n",
|
|
"|**primary_metric**|This is the metric that you want to optimize. Classification supports the following primary metrics: <br><i>accuracy</i><br><i>AUC_weighted</i><br><i>average_precision_score_weighted</i><br><i>norm_macro_recall</i><br><i>precision_score_weighted</i>|\n",
|
|
"|**iteration_timeout_minutes**|Time limit in minutes for each iteration.|\n",
|
|
"|**blocked_models** | *List* of *strings* indicating machine learning algorithms for AutoML to avoid in this run. <br><br> Allowed values for **Classification**<br><i>LogisticRegression</i><br><i>SGD</i><br><i>MultinomialNaiveBayes</i><br><i>BernoulliNaiveBayes</i><br><i>SVM</i><br><i>LinearSVM</i><br><i>KNN</i><br><i>DecisionTree</i><br><i>RandomForest</i><br><i>ExtremeRandomTrees</i><br><i>LightGBM</i><br><i>GradientBoosting</i><br><i>TensorFlowDNN</i><br><i>TensorFlowLinearClassifier</i><br><br>Allowed values for **Regression**<br><i>ElasticNet</i><br><i>GradientBoosting</i><br><i>DecisionTree</i><br><i>KNN</i><br><i>LassoLars</i><br><i>SGD</i><br><i>RandomForest</i><br><i>ExtremeRandomTrees</i><br><i>LightGBM</i><br><i>TensorFlowLinearRegressor</i><br><i>TensorFlowDNN</i><br><br>Allowed values for **Forecasting**<br><i>ElasticNet</i><br><i>GradientBoosting</i><br><i>DecisionTree</i><br><i>KNN</i><br><i>LassoLars</i><br><i>SGD</i><br><i>RandomForest</i><br><i>ExtremeRandomTrees</i><br><i>LightGBM</i><br><i>TensorFlowLinearRegressor</i><br><i>TensorFlowDNN</i><br><i>Arima</i><br><i>Prophet</i>|\n",
|
|
"|**allowed_models** | *List* of *strings* indicating machine learning algorithms for AutoML to use in this run. Same values listed above for **blocked_models** allowed for **allowed_models**.|\n",
|
|
"|**experiment_exit_score**| Value indicating the target for *primary_metric*. <br>Once the target is surpassed the run terminates.|\n",
|
|
"|**experiment_timeout_hours**| Maximum amount of time in hours that all iterations combined can take before the experiment terminates.|\n",
|
|
"|**enable_early_stopping**| Flag to enble early termination if the score is not improving in the short term.|\n",
|
|
"|**featurization**| 'auto' / 'off' Indicator for whether featurization step should be done automatically or not. Note: If the input data is sparse, featurization cannot be turned on.|\n",
|
|
"|**n_cross_validations**|Number of cross validation splits.|\n",
|
|
"|**training_data**|Input dataset, containing both features and label column.|\n",
|
|
"|**label_column_name**|The name of the label column.|\n",
|
|
"\n",
|
|
"**_You can find more information about primary metrics_** [here](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-configure-auto-train#primary-metric)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"automl_settings = {\n",
|
|
" \"experiment_timeout_hours\" : 0.3,\n",
|
|
" \"enable_early_stopping\" : True,\n",
|
|
" \"iteration_timeout_minutes\": 5,\n",
|
|
" \"max_concurrent_iterations\": 4,\n",
|
|
" \"max_cores_per_iteration\": -1,\n",
|
|
" #\"n_cross_validations\": 2,\n",
|
|
" \"primary_metric\": 'AUC_weighted',\n",
|
|
" \"featurization\": 'auto',\n",
|
|
" \"verbosity\": logging.INFO,\n",
|
|
"}\n",
|
|
"\n",
|
|
"automl_config = AutoMLConfig(task = 'classification',\n",
|
|
" debug_log = 'automl_errors.log',\n",
|
|
" compute_target=compute_target,\n",
|
|
" experiment_exit_score = 0.9984,\n",
|
|
" blocked_models = ['KNN','LinearSVM'],\n",
|
|
" enable_onnx_compatible_models=True,\n",
|
|
" training_data = train_data,\n",
|
|
" label_column_name = label,\n",
|
|
" validation_data = validation_dataset,\n",
|
|
" **automl_settings\n",
|
|
" )"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Call the `submit` method on the experiment object and pass the run configuration. Execution of local runs is synchronous. Depending on the data and the number of iterations this can run for a while. Validation errors and current status will be shown when setting `show_output=True` and the execution will be synchronous."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"remote_run = experiment.submit(automl_config, show_output = False)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"remote_run"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Run the following cell to access previous runs. Uncomment the cell below and update the run_id."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"#from azureml.train.automl.run import AutoMLRun\n",
|
|
"#remote_run = AutoMLRun(experiment=experiment, run_id='<run_ID_goes_here')\n",
|
|
"#remote_run"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Wait for the remote run to complete\n",
|
|
"remote_run.wait_for_completion()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"best_run_customized, fitted_model_customized = remote_run.get_output()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Transparency\n",
|
|
"\n",
|
|
"View updated featurization summary"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"custom_featurizer = fitted_model_customized.named_steps['datatransformer']\n",
|
|
"df = custom_featurizer.get_featurization_summary()\n",
|
|
"pd.DataFrame(data=df)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Set `is_user_friendly=False` to get a more detailed summary for the transforms being applied."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"df = custom_featurizer.get_featurization_summary(is_user_friendly=False)\n",
|
|
"pd.DataFrame(data=df)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"df = custom_featurizer.get_stats_feature_type_summary()\n",
|
|
"pd.DataFrame(data=df)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Results"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.widgets import RunDetails\n",
|
|
"RunDetails(remote_run).show() "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Retrieve the Best Model's explanation\n",
|
|
"Retrieve the explanation from the best_run which includes explanations for engineered features and raw features. Make sure that the run for generating explanations for the best model is completed."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Wait for the best model explanation run to complete\n",
|
|
"from azureml.core.run import Run\n",
|
|
"model_explainability_run_id = remote_run.id + \"_\" + \"ModelExplain\"\n",
|
|
"print(model_explainability_run_id)\n",
|
|
"model_explainability_run = Run(experiment=experiment, run_id=model_explainability_run_id)\n",
|
|
"model_explainability_run.wait_for_completion()\n",
|
|
"\n",
|
|
"# Get the best run object\n",
|
|
"best_run, fitted_model = remote_run.get_output()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Download engineered feature importance from artifact store\n",
|
|
"You can use ExplanationClient to download the engineered feature explanations from the artifact store of the best_run."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"client = ExplanationClient.from_run(best_run)\n",
|
|
"engineered_explanations = client.download_model_explanation(raw=False)\n",
|
|
"exp_data = engineered_explanations.get_feature_importance_dict()\n",
|
|
"exp_data"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Download raw feature importance from artifact store\n",
|
|
"You can use ExplanationClient to download the raw feature explanations from the artifact store of the best_run."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"client = ExplanationClient.from_run(best_run)\n",
|
|
"engineered_explanations = client.download_model_explanation(raw=True)\n",
|
|
"exp_data = engineered_explanations.get_feature_importance_dict()\n",
|
|
"exp_data"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Retrieve the Best ONNX Model\n",
|
|
"\n",
|
|
"Below we select the best pipeline from our iterations. The `get_output` method returns the best run and the fitted model. The Model includes the pipeline and any pre-processing. Overloads on `get_output` allow you to retrieve the best run and fitted model for *any* logged metric or for a particular *iteration*.\n",
|
|
"\n",
|
|
"Set the parameter return_onnx_model=True to retrieve the best ONNX model, instead of the Python model."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"best_run, onnx_mdl = remote_run.get_output(return_onnx_model=True)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Save the best ONNX model"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.automl.runtime.onnx_convert import OnnxConverter\n",
|
|
"onnx_fl_path = \"./best_model.onnx\"\n",
|
|
"OnnxConverter.save_onnx_model(onnx_mdl, onnx_fl_path)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Predict with the ONNX model, using onnxruntime package"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import sys\n",
|
|
"import json\n",
|
|
"from azureml.automl.core.onnx_convert import OnnxConvertConstants\n",
|
|
"from azureml.train.automl import constants\n",
|
|
"\n",
|
|
"if sys.version_info < OnnxConvertConstants.OnnxIncompatiblePythonVersion:\n",
|
|
" python_version_compatible = True\n",
|
|
"else:\n",
|
|
" python_version_compatible = False\n",
|
|
"\n",
|
|
"import onnxruntime\n",
|
|
"from azureml.automl.runtime.onnx_convert import OnnxInferenceHelper\n",
|
|
"\n",
|
|
"def get_onnx_res(run):\n",
|
|
" res_path = 'onnx_resource.json'\n",
|
|
" run.download_file(name=constants.MODEL_RESOURCE_PATH_ONNX, output_file_path=res_path)\n",
|
|
" with open(res_path) as f:\n",
|
|
" onnx_res = json.load(f)\n",
|
|
" return onnx_res\n",
|
|
"\n",
|
|
"if python_version_compatible:\n",
|
|
" test_df = test_dataset.to_pandas_dataframe()\n",
|
|
" mdl_bytes = onnx_mdl.SerializeToString()\n",
|
|
" onnx_res = get_onnx_res(best_run)\n",
|
|
"\n",
|
|
" onnxrt_helper = OnnxInferenceHelper(mdl_bytes, onnx_res)\n",
|
|
" pred_onnx, pred_prob_onnx = onnxrt_helper.predict(test_df)\n",
|
|
"\n",
|
|
" print(pred_onnx)\n",
|
|
" print(pred_prob_onnx)\n",
|
|
"else:\n",
|
|
" print('Please use Python version 3.6 or 3.7 to run the inference helper.')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Deploy\n",
|
|
"\n",
|
|
"### Retrieve the Best Model\n",
|
|
"\n",
|
|
"Below we select the best pipeline from our iterations. The `get_output` method returns the best run and the fitted model. Overloads on `get_output` allow you to retrieve the best run and fitted model for *any* logged metric or for a particular *iteration*."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Widget for Monitoring Runs\n",
|
|
"\n",
|
|
"The widget will first report a \"loading\" status while running the first iteration. After completing the first iteration, an auto-updating graph and table will be shown. The widget will refresh once per minute, so you should see the graph update as child runs complete.\n",
|
|
"\n",
|
|
"**Note:** The widget displays a link at the bottom. Use this link to open a web interface to explore the individual run details"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"best_run, fitted_model = remote_run.get_output()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"model_name = best_run.properties['model_name']\n",
|
|
"\n",
|
|
"script_file_name = 'inference/score.py'\n",
|
|
"\n",
|
|
"best_run.download_file('outputs/scoring_file_v_1_0_0.py', 'inference/score.py')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Register the Fitted Model for Deployment\n",
|
|
"If neither `metric` nor `iteration` are specified in the `register_model` call, the iteration with the best primary metric is registered."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"description = 'AutoML Model trained on bank marketing data to predict if a client will subscribe to a term deposit'\n",
|
|
"tags = None\n",
|
|
"model = remote_run.register_model(model_name = model_name, description = description, tags = tags)\n",
|
|
"\n",
|
|
"print(remote_run.model_id) # This will be written to the script file later in the notebook."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Deploy the model as a Web Service on Azure Container Instance"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.core.model import InferenceConfig\n",
|
|
"from azureml.core.webservice import AciWebservice\n",
|
|
"from azureml.core.webservice import Webservice\n",
|
|
"from azureml.core.model import Model\n",
|
|
"from azureml.core.environment import Environment\n",
|
|
"\n",
|
|
"inference_config = InferenceConfig(entry_script=script_file_name)\n",
|
|
"\n",
|
|
"aciconfig = AciWebservice.deploy_configuration(cpu_cores = 1, \n",
|
|
" memory_gb = 1, \n",
|
|
" tags = {'area': \"bmData\", 'type': \"automl_classification\"}, \n",
|
|
" description = 'sample service for Automl Classification')\n",
|
|
"\n",
|
|
"aci_service_name = 'automl-sample-bankmarketing-all'\n",
|
|
"print(aci_service_name)\n",
|
|
"aci_service = Model.deploy(ws, aci_service_name, [model], inference_config, aciconfig)\n",
|
|
"aci_service.wait_for_deployment(True)\n",
|
|
"print(aci_service.state)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Get Logs from a Deployed Web Service\n",
|
|
"\n",
|
|
"Gets logs from a deployed web service."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"#aci_service.get_logs()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Test\n",
|
|
"\n",
|
|
"Now that the model is trained, run the test data through the trained model to get the predicted values. This calls the ACI web service to do the prediction.\n",
|
|
"\n",
|
|
"Note that the JSON passed to the ACI web service is an array of rows of data. Each row should either be an array of values in the same order that was used for training or a dictionary where the keys are the same as the column names used for training. The example below uses dictionary rows."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Load the bank marketing datasets.\n",
|
|
"from numpy import array"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"X_test = test_dataset.drop_columns(columns=['y'])\n",
|
|
"y_test = test_dataset.keep_columns(columns=['y'], validate=True)\n",
|
|
"test_dataset.take(5).to_pandas_dataframe()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"X_test = X_test.to_pandas_dataframe()\n",
|
|
"y_test = y_test.to_pandas_dataframe()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import json\n",
|
|
"import requests\n",
|
|
"\n",
|
|
"X_test_json = X_test.to_json(orient='records')\n",
|
|
"data = \"{\\\"data\\\": \" + X_test_json +\"}\"\n",
|
|
"headers = {'Content-Type': 'application/json'}\n",
|
|
"\n",
|
|
"resp = requests.post(aci_service.scoring_uri, data, headers=headers)\n",
|
|
"\n",
|
|
"y_pred = json.loads(json.loads(resp.text))['result']"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"actual = array(y_test)\n",
|
|
"actual = actual[:,0]\n",
|
|
"print(len(y_pred), \" \", len(actual))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Calculate metrics for the prediction\n",
|
|
"\n",
|
|
"Now visualize the data as a confusion matrix that compared the predicted values against the actual values.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"%matplotlib notebook\n",
|
|
"from sklearn.metrics import confusion_matrix\n",
|
|
"import numpy as np\n",
|
|
"import itertools\n",
|
|
"\n",
|
|
"cf =confusion_matrix(actual,y_pred)\n",
|
|
"plt.imshow(cf,cmap=plt.cm.Blues,interpolation='nearest')\n",
|
|
"plt.colorbar()\n",
|
|
"plt.title('Confusion Matrix')\n",
|
|
"plt.xlabel('Predicted')\n",
|
|
"plt.ylabel('Actual')\n",
|
|
"class_labels = ['no','yes']\n",
|
|
"tick_marks = np.arange(len(class_labels))\n",
|
|
"plt.xticks(tick_marks,class_labels)\n",
|
|
"plt.yticks([-0.5,0,1,1.5],['','no','yes',''])\n",
|
|
"# plotting text value inside cells\n",
|
|
"thresh = cf.max() / 2.\n",
|
|
"for i,j in itertools.product(range(cf.shape[0]),range(cf.shape[1])):\n",
|
|
" plt.text(j,i,format(cf[i,j],'d'),horizontalalignment='center',color='white' if cf[i,j] >thresh else 'black')\n",
|
|
"plt.show()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Delete a Web Service\n",
|
|
"\n",
|
|
"Deletes the specified web service."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"aci_service.delete()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Acknowledgements"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"This Bank Marketing dataset is made available under the Creative Commons (CCO: Public Domain) License: https://creativecommons.org/publicdomain/zero/1.0/. Any rights in individual contents of the database are licensed under the Database Contents License: https://creativecommons.org/publicdomain/zero/1.0/ and is available at: https://www.kaggle.com/janiobachmann/bank-marketing-dataset .\n",
|
|
"\n",
|
|
"_**Acknowledgements**_\n",
|
|
"This data set is originally available within the UCI Machine Learning Database: https://archive.ics.uci.edu/ml/datasets/bank+marketing\n",
|
|
"\n",
|
|
"[Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"authors": [
|
|
{
|
|
"name": "ratanase"
|
|
}
|
|
],
|
|
"category": "tutorial",
|
|
"compute": [
|
|
"AML"
|
|
],
|
|
"datasets": [
|
|
"Bankmarketing"
|
|
],
|
|
"deployment": [
|
|
"ACI"
|
|
],
|
|
"exclude_from_index": false,
|
|
"framework": [
|
|
"None"
|
|
],
|
|
"friendly_name": "Automated ML run with basic edition features.",
|
|
"index_order": 5,
|
|
"kernelspec": {
|
|
"display_name": "Python 3.6",
|
|
"language": "python",
|
|
"name": "python36"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.6.7"
|
|
},
|
|
"tags": [
|
|
"featurization",
|
|
"explainability",
|
|
"remote_run",
|
|
"AutomatedML"
|
|
],
|
|
"task": "Classification"
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
} |