mirror of
https://github.com/Azure/MachineLearningNotebooks.git
synced 2025-12-19 17:17:04 -05:00
526 lines
17 KiB
Plaintext
526 lines
17 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Copyright (c) Microsoft Corporation. All rights reserved. \n",
|
|
"Licensed under the MIT License."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Azure Machine Learning Pipeline with AutoMLStep\n",
|
|
"This notebook demonstrates the use of AutoMLStep in Azure Machine Learning Pipeline."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Introduction\n",
|
|
"In this example we showcase how you can use AzureML Dataset to load data for AutoML via AML Pipeline. \n",
|
|
"\n",
|
|
"If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, make sure you have executed the [configuration](https://aka.ms/pl-config) before running this notebook, please also take a look at the [Automated ML setup-using-a-local-conda-environment](https://github.com/Azure/MachineLearningNotebooks/tree/master/how-to-use-azureml/automated-machine-learning#setup-using-a-local-conda-environment) section to setup the environment.\n",
|
|
"\n",
|
|
"In this notebook you will learn how to:\n",
|
|
"1. Create an `Experiment` in an existing `Workspace`.\n",
|
|
"2. Create or Attach existing AmlCompute to a workspace.\n",
|
|
"3. Define data loading in a `TabularDataset`.\n",
|
|
"4. Configure AutoML using `AutoMLConfig`.\n",
|
|
"5. Use AutoMLStep\n",
|
|
"6. Train the model using AmlCompute\n",
|
|
"7. Explore the results.\n",
|
|
"8. Test the best fitted model."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Azure Machine Learning and Pipeline SDK-specific imports"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import logging\n",
|
|
"import os\n",
|
|
"import csv\n",
|
|
"\n",
|
|
"from matplotlib import pyplot as plt\n",
|
|
"import numpy as np\n",
|
|
"import pandas as pd\n",
|
|
"from sklearn import datasets\n",
|
|
"import pkg_resources\n",
|
|
"\n",
|
|
"import azureml.core\n",
|
|
"from azureml.core.experiment import Experiment\n",
|
|
"from azureml.core.workspace import Workspace\n",
|
|
"from azureml.train.automl import AutoMLConfig\n",
|
|
"from azureml.core.dataset import Dataset\n",
|
|
"\n",
|
|
"from azureml.pipeline.steps import AutoMLStep\n",
|
|
"\n",
|
|
"# Check core SDK version number\n",
|
|
"print(\"SDK version:\", azureml.core.VERSION)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Initialize Workspace\n",
|
|
"Initialize a workspace object from persisted configuration. Make sure the config file is present at .\\config.json"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"ws = Workspace.from_config()\n",
|
|
"print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\\n')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Create an Azure ML experiment\n",
|
|
"Let's create an experiment named \"automlstep-sample\" and a folder to hold the training scripts. The script runs will be recorded under the experiment in Azure.\n",
|
|
"\n",
|
|
"The best practice is to use separate folders for scripts and its dependent files for each step and specify that folder as the `source_directory` for the step. This helps reduce the size of the snapshot created for the step (only the specific folder is snapshotted). Since changes in any files in the `source_directory` would trigger a re-upload of the snapshot, this helps keep the reuse of the step when there are no changes in the `source_directory` of the step."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Choose a name for the run history container in the workspace.\n",
|
|
"experiment_name = 'automlstep-sample'\n",
|
|
"project_folder = './project'\n",
|
|
"\n",
|
|
"experiment = Experiment(ws, experiment_name)\n",
|
|
"experiment"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Create or Attach an AmlCompute cluster\n",
|
|
"You will need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for your AutoML run. In this tutorial, you get the default `AmlCompute` as your training compute resource."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.core.compute import AmlCompute\n",
|
|
"from azureml.core.compute import ComputeTarget\n",
|
|
"from azureml.core.compute_target import ComputeTargetException\n",
|
|
"\n",
|
|
"# Choose a name for your CPU cluster\n",
|
|
"amlcompute_cluster_name = \"cpu-cluster\"\n",
|
|
"\n",
|
|
"# Verify that cluster does not exist already\n",
|
|
"try:\n",
|
|
" compute_target = ComputeTarget(workspace=ws, name=amlcompute_cluster_name)\n",
|
|
" print('Found existing cluster, use it.')\n",
|
|
"except ComputeTargetException:\n",
|
|
" compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',# for GPU, use \"STANDARD_NC6\"\n",
|
|
" #vm_priority = 'lowpriority', # optional\n",
|
|
" max_nodes=4)\n",
|
|
" compute_target = ComputeTarget.create(ws, amlcompute_cluster_name, compute_config)\n",
|
|
"\n",
|
|
"compute_target.wait_for_completion(show_output=True, min_node_count = 1, timeout_in_minutes = 10)\n",
|
|
"# For a more detailed view of current AmlCompute status, use get_status()."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Data"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Try to load the dataset from the Workspace. Otherwise, create it from the file\n",
|
|
"found = False\n",
|
|
"key = \"Crime-Dataset\"\n",
|
|
"description_text = \"Crime Dataset (used in the the aml-pipelines-with-automated-machine-learning-step.ipynb notebook)\"\n",
|
|
"\n",
|
|
"if key in ws.datasets.keys(): \n",
|
|
" found = True\n",
|
|
" dataset = ws.datasets[key] \n",
|
|
"\n",
|
|
"if not found:\n",
|
|
" # Create AML Dataset and register it into Workspace\n",
|
|
" # The data referenced here was a 1MB simple random sample of the Chicago Crime data into a local temporary directory.\n",
|
|
" example_data = 'https://dprepdata.blob.core.windows.net/demo/crime0-random.csv'\n",
|
|
" dataset = Dataset.Tabular.from_delimited_files(example_data)\n",
|
|
" dataset = dataset.drop_columns(['FBI Code'])\n",
|
|
" \n",
|
|
" #Register Dataset in Workspace\n",
|
|
" dataset = dataset.register(workspace=ws,\n",
|
|
" name=key,\n",
|
|
" description=description_text)\n",
|
|
"\n",
|
|
"\n",
|
|
"df = dataset.to_pandas_dataframe()\n",
|
|
"df.describe()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Review the Dataset Result\n",
|
|
"\n",
|
|
"You can peek the result of a TabularDataset at any range using `skip(i)` and `take(j).to_pandas_dataframe()`. Doing so evaluates only `j` records for all the steps in the TabularDataset, which makes it fast even against large datasets.\n",
|
|
"\n",
|
|
"`TabularDataset` objects are composed of a list of transformation steps (optional)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"dataset.take(5).to_pandas_dataframe()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Train\n",
|
|
"This creates a general AutoML settings object."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"automl_settings = {\n",
|
|
" \"experiment_timeout_minutes\": 20,\n",
|
|
" \"max_concurrent_iterations\": 4,\n",
|
|
" \"primary_metric\" : 'AUC_weighted'\n",
|
|
"}\n",
|
|
"automl_config = AutoMLConfig(compute_target=compute_target,\n",
|
|
" task = \"classification\",\n",
|
|
" training_data=dataset,\n",
|
|
" label_column_name=\"Primary Type\", \n",
|
|
" path = project_folder,\n",
|
|
" enable_early_stopping= True,\n",
|
|
" featurization= 'auto',\n",
|
|
" debug_log = \"automl_errors.log\",\n",
|
|
" **automl_settings\n",
|
|
" )"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Create Pipeline and AutoMLStep\n",
|
|
"\n",
|
|
"You can define outputs for the AutoMLStep using TrainingOutput."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.pipeline.core import PipelineData, TrainingOutput\n",
|
|
"\n",
|
|
"ds = ws.get_default_datastore()\n",
|
|
"metrics_output_name = 'metrics_output'\n",
|
|
"best_model_output_name = 'best_model_output'\n",
|
|
"\n",
|
|
"metrics_data = PipelineData(name='metrics_data',\n",
|
|
" datastore=ds,\n",
|
|
" pipeline_output_name=metrics_output_name,\n",
|
|
" training_output=TrainingOutput(type='Metrics'))\n",
|
|
"model_data = PipelineData(name='model_data',\n",
|
|
" datastore=ds,\n",
|
|
" pipeline_output_name=best_model_output_name,\n",
|
|
" training_output=TrainingOutput(type='Model'))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Create an AutoMLStep."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"tags": [
|
|
"automlstep-remarks-sample1"
|
|
]
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"automl_step = AutoMLStep(\n",
|
|
" name='automl_module',\n",
|
|
" automl_config=automl_config,\n",
|
|
" outputs=[metrics_data, model_data],\n",
|
|
" allow_reuse=True)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"tags": [
|
|
"automlstep-remarks-sample2"
|
|
]
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.pipeline.core import Pipeline\n",
|
|
"pipeline = Pipeline(\n",
|
|
" description=\"pipeline_with_automlstep\",\n",
|
|
" workspace=ws, \n",
|
|
" steps=[automl_step])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"pipeline_run = experiment.submit(pipeline)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.widgets import RunDetails\n",
|
|
"RunDetails(pipeline_run).show()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"pipeline_run.wait_for_completion()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Examine Results\n",
|
|
"\n",
|
|
"### Retrieve the metrics of all child runs\n",
|
|
"Outputs of above run can be used as inputs of other steps in pipeline. In this tutorial, we will examine the outputs by retrieve output data and running some tests."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"metrics_output = pipeline_run.get_pipeline_output(metrics_output_name)\n",
|
|
"num_file_downloaded = metrics_output.download('.', show_progress=True)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import json\n",
|
|
"with open(metrics_output._path_on_datastore) as f:\n",
|
|
" metrics_output_result = f.read()\n",
|
|
" \n",
|
|
"deserialized_metrics_output = json.loads(metrics_output_result)\n",
|
|
"df = pd.DataFrame(deserialized_metrics_output)\n",
|
|
"df"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Retrieve the Best Model"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Retrieve best model from Pipeline Run\n",
|
|
"best_model_output = pipeline_run.get_pipeline_output(best_model_output_name)\n",
|
|
"num_file_downloaded = best_model_output.download('.', show_progress=True)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import pickle\n",
|
|
"\n",
|
|
"with open(best_model_output._path_on_datastore, \"rb\" ) as f:\n",
|
|
" best_model = pickle.load(f)\n",
|
|
"best_model"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"best_model.steps"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Test the Model\n",
|
|
"#### Load Test Data\n",
|
|
"For the test data, it should have the same preparation step as the train data. Otherwise it might get failed at the preprocessing step."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"dataset_test = Dataset.Tabular.from_delimited_files(path='https://dprepdata.blob.core.windows.net/demo/crime0-test.csv')\n",
|
|
"df_test = dataset_test.to_pandas_dataframe()\n",
|
|
"df_test = df_test[pd.notnull(df_test['Primary Type'])]\n",
|
|
"\n",
|
|
"y_test = df_test['Primary Type']\n",
|
|
"X_test = df_test.drop(['Primary Type', 'FBI Code'], axis=1)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Testing Our Best Fitted Model\n",
|
|
"\n",
|
|
"We will use confusion matrix to see how our model works."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from sklearn.metrics import confusion_matrix\n",
|
|
"ypred = best_model.predict(X_test)\n",
|
|
"cm = confusion_matrix(y_test, ypred)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Visualize the confusion matrix\n",
|
|
"pd.DataFrame(cm).style.background_gradient(cmap='Blues', low=0, high=0.9)"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"authors": [
|
|
{
|
|
"name": "anshirga"
|
|
}
|
|
],
|
|
"category": "tutorial",
|
|
"compute": [
|
|
"AML Compute"
|
|
],
|
|
"datasets": [
|
|
"Custom"
|
|
],
|
|
"deployment": [
|
|
"None"
|
|
],
|
|
"exclude_from_index": false,
|
|
"framework": [
|
|
"Automated Machine Learning"
|
|
],
|
|
"friendly_name": "How to use AutoMLStep with AML Pipelines",
|
|
"kernelspec": {
|
|
"display_name": "Python 3.6",
|
|
"language": "python",
|
|
"name": "python36"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.6.7"
|
|
},
|
|
"order_index": 11,
|
|
"star_tag": [
|
|
"featured"
|
|
],
|
|
"tags": [
|
|
"None"
|
|
],
|
|
"task": "Demonstrates the use of AutoMLStep"
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
} |