mirror of
https://github.com/Azure/MachineLearningNotebooks.git
synced 2025-12-20 17:45:10 -05:00
523 lines
17 KiB
Plaintext
523 lines
17 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Copyright (c) Microsoft Corporation. All rights reserved. \n",
|
|
"Licensed under the MIT License."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Azure Machine Learning Pipeline with AutoMLStep\n",
|
|
"This notebook demonstrates the use of AutoMLStep in Azure Machine Learning Pipeline."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Introduction\n",
|
|
"In this example we showcase how you can use AzureML Dataset to load data for AutoML via AML Pipeline. \n",
|
|
"\n",
|
|
"If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, make sure you have executed the [configuration](https://aka.ms/pl-config) before running this notebook.\n",
|
|
"\n",
|
|
"In this notebook you will learn how to:\n",
|
|
"1. Create an `Experiment` in an existing `Workspace`.\n",
|
|
"2. Create or Attach existing AmlCompute to a workspace.\n",
|
|
"3. Define data loading in a `TabularDataset`.\n",
|
|
"4. Configure AutoML using `AutoMLConfig`.\n",
|
|
"5. Use AutoMLStep\n",
|
|
"6. Train the model using AmlCompute\n",
|
|
"7. Explore the results.\n",
|
|
"8. Test the best fitted model."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Azure Machine Learning and Pipeline SDK-specific imports"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import logging\n",
|
|
"import os\n",
|
|
"import csv\n",
|
|
"\n",
|
|
"from matplotlib import pyplot as plt\n",
|
|
"import numpy as np\n",
|
|
"import pandas as pd\n",
|
|
"from sklearn import datasets\n",
|
|
"import pkg_resources\n",
|
|
"\n",
|
|
"import azureml.core\n",
|
|
"from azureml.core.experiment import Experiment\n",
|
|
"from azureml.core.workspace import Workspace\n",
|
|
"from azureml.train.automl import AutoMLConfig\n",
|
|
"from azureml.core.compute import AmlCompute\n",
|
|
"from azureml.core.compute import ComputeTarget\n",
|
|
"from azureml.core.dataset import Dataset\n",
|
|
"from azureml.core.runconfig import RunConfiguration\n",
|
|
"from azureml.core.conda_dependencies import CondaDependencies\n",
|
|
"\n",
|
|
"from azureml.train.automl import AutoMLStep\n",
|
|
"\n",
|
|
"# Check core SDK version number\n",
|
|
"print(\"SDK version:\", azureml.core.VERSION)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Initialize Workspace\n",
|
|
"Initialize a workspace object from persisted configuration. Make sure the config file is present at .\\config.json"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"ws = Workspace.from_config()\n",
|
|
"print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\\n')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Create an Azure ML experiment\n",
|
|
"Let's create an experiment named \"automl-classification\" and a folder to hold the training scripts. The script runs will be recorded under the experiment in Azure.\n",
|
|
"\n",
|
|
"The best practice is to use separate folders for scripts and its dependent files for each step and specify that folder as the `source_directory` for the step. This helps reduce the size of the snapshot created for the step (only the specific folder is snapshotted). Since changes in any files in the `source_directory` would trigger a re-upload of the snapshot, this helps keep the reuse of the step when there are no changes in the `source_directory` of the step."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Choose a name for the run history container in the workspace.\n",
|
|
"experiment_name = 'automlstep-classification'\n",
|
|
"project_folder = './project'\n",
|
|
"\n",
|
|
"experiment = Experiment(ws, experiment_name)\n",
|
|
"experiment"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Create or Attach an AmlCompute cluster\n",
|
|
"You will need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for your AutoML run. In this tutorial, you get the default `AmlCompute` as your training compute resource."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Choose a name for your cluster.\n",
|
|
"amlcompute_cluster_name = \"cpu-cluster\"\n",
|
|
"\n",
|
|
"found = False\n",
|
|
"# Check if this compute target already exists in the workspace.\n",
|
|
"cts = ws.compute_targets\n",
|
|
"if amlcompute_cluster_name in cts and cts[amlcompute_cluster_name].type == 'AmlCompute':\n",
|
|
" found = True\n",
|
|
" print('Found existing compute target.')\n",
|
|
" compute_target = cts[amlcompute_cluster_name]\n",
|
|
" \n",
|
|
"if not found:\n",
|
|
" print('Creating a new compute target...')\n",
|
|
" provisioning_config = AmlCompute.provisioning_configuration(vm_size = \"STANDARD_D2_V2\", # for GPU, use \"STANDARD_NC6\"\n",
|
|
" #vm_priority = 'lowpriority', # optional\n",
|
|
" max_nodes = 4)\n",
|
|
"\n",
|
|
" # Create the cluster.\n",
|
|
" compute_target = ComputeTarget.create(ws, amlcompute_cluster_name, provisioning_config)\n",
|
|
" \n",
|
|
" # Can poll for a minimum number of nodes and for a specific timeout.\n",
|
|
" # If no min_node_count is provided, it will use the scale settings for the cluster.\n",
|
|
" compute_target.wait_for_completion(show_output = True, min_node_count = 1, timeout_in_minutes = 10)\n",
|
|
" \n",
|
|
" # For a more detailed view of current AmlCompute status, use get_status()."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# create a new RunConfig object\n",
|
|
"conda_run_config = RunConfiguration(framework=\"python\")\n",
|
|
"\n",
|
|
"conda_run_config.environment.docker.enabled = True\n",
|
|
"conda_run_config.environment.docker.base_image = azureml.core.runconfig.DEFAULT_CPU_IMAGE\n",
|
|
"\n",
|
|
"cd = CondaDependencies.create(pip_packages=['azureml-sdk[automl]'], \n",
|
|
" conda_packages=['numpy', 'py-xgboost<=0.80'])\n",
|
|
"conda_run_config.environment.python.conda_dependencies = cd\n",
|
|
"\n",
|
|
"print('run config is ready')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Data"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# The data referenced here was a 1MB simple random sample of the Chicago Crime data into a local temporary directory.\n",
|
|
"example_data = 'https://dprepdata.blob.core.windows.net/demo/crime0-random.csv'\n",
|
|
"dataset = Dataset.Tabular.from_delimited_files(example_data)\n",
|
|
"dataset.to_pandas_dataframe().describe()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"dataset.take(5).to_pandas_dataframe()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Review the Dataset Result\n",
|
|
"\n",
|
|
"You can peek the result of a TabularDataset at any range using `skip(i)` and `take(j).to_pandas_dataframe()`. Doing so evaluates only `j` records for all the steps in the TabularDataset, which makes it fast even against large datasets.\n",
|
|
"\n",
|
|
"`TabularDataset` objects are composed of a list of transformation steps (optional)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"X = dataset.drop_columns(columns=['Primary Type', 'FBI Code'])\n",
|
|
"y = dataset.keep_columns(columns=['Primary Type'], validate=True)\n",
|
|
"print('X and y are ready!')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Train\n",
|
|
"This creates a general AutoML settings object."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"automl_settings = {\n",
|
|
" \"iteration_timeout_minutes\" : 5,\n",
|
|
" \"iterations\" : 2,\n",
|
|
" \"primary_metric\" : 'AUC_weighted',\n",
|
|
" \"preprocess\" : True,\n",
|
|
" \"verbosity\" : logging.INFO\n",
|
|
"}\n",
|
|
"automl_config = AutoMLConfig(task = 'classification',\n",
|
|
" debug_log = 'automl_errors.log',\n",
|
|
" path = project_folder,\n",
|
|
" compute_target=compute_target,\n",
|
|
" run_configuration=conda_run_config,\n",
|
|
" X = X,\n",
|
|
" y = y,\n",
|
|
" **automl_settings\n",
|
|
" )"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"You can define outputs for the AutoMLStep using TrainingOutput."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.pipeline.core import PipelineData, TrainingOutput\n",
|
|
"\n",
|
|
"ds = ws.get_default_datastore()\n",
|
|
"metrics_output_name = 'metrics_output'\n",
|
|
"best_model_output_name = 'best_model_output'\n",
|
|
"\n",
|
|
"metirics_data = PipelineData(name='metrics_data',\n",
|
|
" datastore=ds,\n",
|
|
" pipeline_output_name=metrics_output_name,\n",
|
|
" training_output=TrainingOutput(type='Metrics'))\n",
|
|
"model_data = PipelineData(name='model_data',\n",
|
|
" datastore=ds,\n",
|
|
" pipeline_output_name=best_model_output_name,\n",
|
|
" training_output=TrainingOutput(type='Model'))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Create an AutoMLStep."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"automl_step = AutoMLStep(\n",
|
|
" name='automl_module',\n",
|
|
" automl_config=automl_config,\n",
|
|
" outputs=[metirics_data, model_data],\n",
|
|
" allow_reuse=True)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.pipeline.core import Pipeline\n",
|
|
"pipeline = Pipeline(\n",
|
|
" description=\"pipeline_with_automlstep\",\n",
|
|
" workspace=ws, \n",
|
|
" steps=[automl_step])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"pipeline_run = experiment.submit(pipeline)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.widgets import RunDetails\n",
|
|
"RunDetails(pipeline_run).show()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"pipeline_run.wait_for_completion()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Examine Results\n",
|
|
"\n",
|
|
"### Retrieve the metrics of all child runs\n",
|
|
"Outputs of above run can be used as inputs of other steps in pipeline. In this tutorial, we will examine the outputs by retrieve output data and running some tests."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"metrics_output = pipeline_run.get_pipeline_output(metrics_output_name)\n",
|
|
"num_file_downloaded = metrics_output.download('.', show_progress=True)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import json\n",
|
|
"with open(metrics_output._path_on_datastore) as f: \n",
|
|
" metrics_output_result = f.read()\n",
|
|
" \n",
|
|
"deserialized_metrics_output = json.loads(metrics_output_result)\n",
|
|
"df = pd.DataFrame(deserialized_metrics_output)\n",
|
|
"df"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Retrieve the Best Model"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"best_model_output = pipeline_run.get_pipeline_output(best_model_output_name)\n",
|
|
"num_file_downloaded = best_model_output.download('.', show_progress=True)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import pickle\n",
|
|
"\n",
|
|
"with open(best_model_output._path_on_datastore, \"rb\" ) as f:\n",
|
|
" best_model = pickle.load(f)\n",
|
|
"best_model"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Test the Model\n",
|
|
"#### Load Test Data\n",
|
|
"For the test data, it should have the same preparation step as the train data. Otherwise it might get failed at the preprocessing step."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"dataset = Dataset.Tabular.from_delimited_files(path='https://dprepdata.blob.core.windows.net/demo/crime0-test.csv')\n",
|
|
"df_test = dataset_test.to_pandas_dataframe()\n",
|
|
"df_test = df_test[pd.notnull(df['Primary Type'])]\n",
|
|
"\n",
|
|
"y_test = df_test[['Primary Type']]\n",
|
|
"X_test = df_test.drop(['Primary Type', 'FBI Code'], axis=1)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Testing Our Best Fitted Model\n",
|
|
"\n",
|
|
"We will use confusion matrix to see how our model works."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from pandas_ml import ConfusionMatrix\n",
|
|
"\n",
|
|
"ypred = best_model.predict(X_test)\n",
|
|
"\n",
|
|
"cm = ConfusionMatrix(y_test['Primary Type'], ypred)\n",
|
|
"\n",
|
|
"print(cm)\n",
|
|
"\n",
|
|
"cm.plot()"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"authors": [
|
|
{
|
|
"name": "sanpil"
|
|
}
|
|
],
|
|
"category": "tutorial",
|
|
"compute": [
|
|
"AML Compute"
|
|
],
|
|
"datasets": [
|
|
"Custom"
|
|
],
|
|
"deployment": [
|
|
"None"
|
|
],
|
|
"exclude_from_index": false,
|
|
"framework": [
|
|
"Automated Machine Learning"
|
|
],
|
|
"friendly_name": "How to use AutoMLStep with AML Pipelines",
|
|
"kernelspec": {
|
|
"display_name": "Python 3.6",
|
|
"language": "python",
|
|
"name": "python36"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.6.7"
|
|
},
|
|
"order_index": 11,
|
|
"star_tag": [
|
|
"featured"
|
|
],
|
|
"tags": [
|
|
"None"
|
|
],
|
|
"task": "Demonstrates the use of AutoMLStep"
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
} |