mirror of
https://github.com/Azure/MachineLearningNotebooks.git
synced 2025-12-20 09:37:04 -05:00
607 lines
22 KiB
Plaintext
607 lines
22 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Copyright (c) Microsoft Corporation. All rights reserved.\n",
|
|
"\n",
|
|
"Licensed under the MIT License."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## AutoML Installation\n",
|
|
"\n",
|
|
"**For Databricks non ML runtime 7.1(scala 2.21, spark 3.0.0) and up, Install AML sdk by running the following command in the first cell of the notebook.**\n",
|
|
"\n",
|
|
"%pip install -r https://aka.ms/automl_linux_requirements.txt\n",
|
|
"\n",
|
|
"**For Databricks non ML runtime 7.0 and lower, Install AML sdk using init script as shown in [readme](readme.md) before running this notebook.**"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# AutoML : Classification with Local Compute on Azure DataBricks with deployment to ACI\n",
|
|
"\n",
|
|
"In this example we use the scikit-learn's to showcase how you can use AutoML for a simple classification problem.\n",
|
|
"\n",
|
|
"In this notebook you will learn how to:\n",
|
|
"1. Create Azure Machine Learning Workspace object and initialize your notebook directory to easily reload this object from a configuration file.\n",
|
|
"2. Create an `Experiment` in an existing `Workspace`.\n",
|
|
"3. Configure AutoML using `AutoMLConfig`.\n",
|
|
"4. Train the model using AzureDataBricks.\n",
|
|
"5. Explore the results.\n",
|
|
"6. Register the model.\n",
|
|
"7. Deploy the model.\n",
|
|
"8. Test the best fitted model.\n",
|
|
"\n",
|
|
"Prerequisites:\n",
|
|
"Before running this notebook, please follow the readme for installing necessary libraries to your cluster."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Register Machine Learning Services Resource Provider\n",
|
|
"Microsoft.MachineLearningServices only needs to be registed once in the subscription. To register it:\n",
|
|
"Start the Azure portal.\n",
|
|
"Select your All services and then Subscription.\n",
|
|
"Select the subscription that you want to use.\n",
|
|
"Click on Resource providers\n",
|
|
"Click the Register link next to Microsoft.MachineLearningServices"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Check the Azure ML Core SDK Version to Validate Your Installation"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import azureml.core\n",
|
|
"\n",
|
|
"print(\"SDK Version:\", azureml.core.VERSION)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Initialize an Azure ML Workspace\n",
|
|
"### What is an Azure ML Workspace and Why Do I Need One?\n",
|
|
"\n",
|
|
"An Azure ML workspace is an Azure resource that organizes and coordinates the actions of many other Azure resources to assist in executing and sharing machine learning workflows. In particular, an Azure ML workspace coordinates storage, databases, and compute resources providing added functionality for machine learning experimentation, operationalization, and the monitoring of operationalized models.\n",
|
|
"\n",
|
|
"\n",
|
|
"### What do I Need?\n",
|
|
"\n",
|
|
"To create or access an Azure ML workspace, you will need to import the Azure ML library and specify following information:\n",
|
|
"* A name for your workspace. You can choose one.\n",
|
|
"* Your subscription id. Use the `id` value from the `az account show` command output above.\n",
|
|
"* The resource group name. The resource group organizes Azure resources and provides a default region for the resources in the group. The resource group will be created if it doesn't exist. Resource groups can be created and viewed in the [Azure portal](https://portal.azure.com)\n",
|
|
"* Supported regions include `eastus2`, `eastus`,`westcentralus`, `southeastasia`, `westeurope`, `australiaeast`, `westus2`, `southcentralus`."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"subscription_id = \"<Your SubscriptionId>\" #you should be owner or contributor\n",
|
|
"resource_group = \"<Resource group - new or existing>\" #you should be owner or contributor\n",
|
|
"workspace_name = \"<workspace to be created>\" #your workspace name\n",
|
|
"workspace_region = \"<azureregion>\" #your region"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Creating a Workspace\n",
|
|
"If you already have access to an Azure ML workspace you want to use, you can skip this cell. Otherwise, this cell will create an Azure ML workspace for you in the specified subscription, provided you have the correct permissions for the given `subscription_id`.\n",
|
|
"\n",
|
|
"This will fail when:\n",
|
|
"1. The workspace already exists.\n",
|
|
"2. You do not have permission to create a workspace in the resource group.\n",
|
|
"3. You are not a subscription owner or contributor and no Azure ML workspaces have ever been created in this subscription.\n",
|
|
"\n",
|
|
"If workspace creation fails for any reason other than already existing, please work with your IT administrator to provide you with the appropriate permissions or to provision the required resources.\n",
|
|
"\n",
|
|
"**Note:** Creation of a new workspace can take several minutes."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Import the Workspace class and check the Azure ML SDK version.\n",
|
|
"from azureml.core import Workspace\n",
|
|
"\n",
|
|
"ws = Workspace.create(name = workspace_name,\n",
|
|
" subscription_id = subscription_id,\n",
|
|
" resource_group = resource_group, \n",
|
|
" location = workspace_region, \n",
|
|
" exist_ok=True)\n",
|
|
"ws.get_details()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Configuring Your Local Environment\n",
|
|
"You can validate that you have access to the specified workspace and write a configuration file to the default configuration location, `./aml_config/config.json`."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.core import Workspace\n",
|
|
"\n",
|
|
"ws = Workspace(workspace_name = workspace_name,\n",
|
|
" subscription_id = subscription_id,\n",
|
|
" resource_group = resource_group)\n",
|
|
"\n",
|
|
"# Persist the subscription id, resource group name, and workspace name in aml_config/config.json.\n",
|
|
"ws.write_config()\n",
|
|
"write_config(path=\"/databricks/driver/aml_config/\",file_name=<alias_conf.cfg>)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Create an Experiment\n",
|
|
"\n",
|
|
"As part of the setup you have already created an Azure ML `Workspace` object. For AutoML you will need to create an `Experiment` object, which is a named object in a `Workspace` used to run experiments."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import logging\n",
|
|
"import os\n",
|
|
"import random\n",
|
|
"import time\n",
|
|
"import json\n",
|
|
"\n",
|
|
"from matplotlib import pyplot as plt\n",
|
|
"from matplotlib.pyplot import imshow\n",
|
|
"import numpy as np\n",
|
|
"import pandas as pd\n",
|
|
"\n",
|
|
"import azureml.core\n",
|
|
"from azureml.core.experiment import Experiment\n",
|
|
"from azureml.core.workspace import Workspace\n",
|
|
"from azureml.train.automl import AutoMLConfig\n",
|
|
"from azureml.train.automl.run import AutoMLRun"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Choose a name for the experiment and specify the project folder.\n",
|
|
"experiment_name = 'automl-local-classification'\n",
|
|
"\n",
|
|
"experiment = Experiment(ws, experiment_name)\n",
|
|
"\n",
|
|
"output = {}\n",
|
|
"output['SDK version'] = azureml.core.VERSION\n",
|
|
"output['Subscription ID'] = ws.subscription_id\n",
|
|
"output['Workspace Name'] = ws.name\n",
|
|
"output['Resource Group'] = ws.resource_group\n",
|
|
"output['Location'] = ws.location\n",
|
|
"output['Experiment Name'] = experiment.name\n",
|
|
"pd.set_option('display.max_colwidth', -1)\n",
|
|
"pd.DataFrame(data = output, index = ['']).T"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Load Training Data Using Dataset"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Automated ML takes a `TabularDataset` as input.\n",
|
|
"\n",
|
|
"You are free to use the data preparation libraries/tools of your choice to do the require preparation and once you are done, you can write it to a datastore and create a TabularDataset from it."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# The data referenced here was a 1MB simple random sample of the Chicago Crime data into a local temporary directory.\n",
|
|
"from azureml.core.dataset import Dataset\n",
|
|
"\n",
|
|
"example_data = 'https://dprepdata.blob.core.windows.net/demo/crime0-random.csv'\n",
|
|
"dataset = Dataset.Tabular.from_delimited_files(example_data)\n",
|
|
"dataset.take(5).to_pandas_dataframe()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Review the TabularDataset\n",
|
|
"You can peek the result of a TabularDataset at any range using `skip(i)` and `take(j).to_pandas_dataframe()`. Doing so evaluates only j records for all the steps in the TabularDataset, which makes it fast even against large datasets."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"training_data = dataset.drop_columns(columns=['FBI Code'])\n",
|
|
"label = 'Primary Type'"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Configure AutoML\n",
|
|
"\n",
|
|
"Instantiate an `AutoMLConfig` object to specify the settings and data used to run the experiment.\n",
|
|
"\n",
|
|
"|Property|Description|\n",
|
|
"|-|-|\n",
|
|
"|**task**|classification or regression|\n",
|
|
"|**primary_metric**|This is the metric that you want to optimize. Classification supports the following primary metrics: <br><i>accuracy</i><br><i>AUC_weighted</i><br><i>average_precision_score_weighted</i><br><i>norm_macro_recall</i><br><i>precision_score_weighted</i>|\n",
|
|
"|**primary_metric**|This is the metric that you want to optimize. Regression supports the following primary metrics: <br><i>spearman_correlation</i><br><i>normalized_root_mean_squared_error</i><br><i>r2_score</i><br><i>normalized_mean_absolute_error</i>|\n",
|
|
"|**iteration_timeout_minutes**|Time limit in minutes for each iteration.|\n",
|
|
"|**iterations**|Number of iterations. In each iteration AutoML trains a specific pipeline with the data.|\n",
|
|
"|**spark_context**|Spark Context object. for Databricks, use spark_context=sc|\n",
|
|
"|**max_concurrent_iterations**|Maximum number of iterations to execute in parallel. This should be <= number of worker nodes in your Azure Databricks cluster.|\n",
|
|
"|**n_cross_validations**|Number of cross validation splits.|\n",
|
|
"|**training_data**|Input dataset, containing both features and label column.|\n",
|
|
"|**label_column_name**|The name of the label column.|"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"automl_config = AutoMLConfig(task = 'classification',\n",
|
|
" debug_log = 'automl_errors.log',\n",
|
|
" primary_metric = 'AUC_weighted',\n",
|
|
" iteration_timeout_minutes = 10,\n",
|
|
" iterations = 5,\n",
|
|
" n_cross_validations = 10,\n",
|
|
" max_concurrent_iterations = 2, #change it based on number of worker nodes\n",
|
|
" verbosity = logging.INFO,\n",
|
|
" spark_context=sc, #databricks/spark related\n",
|
|
" training_data=training_data,\n",
|
|
" label_column_name=label)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Train the Models\n",
|
|
"\n",
|
|
"Call the `submit` method on the experiment object and pass the run configuration. Execution of local runs is synchronous. Depending on the data and the number of iterations this can run for a while."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"local_run = experiment.submit(automl_config, show_output = True)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Explore the Results"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Portal URL for Monitoring Runs\n",
|
|
"\n",
|
|
"The following will provide a link to the web interface to explore individual run details and status. In the future we might support output displayed in the notebook."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"displayHTML(\"<a href={} target='_blank'>Azure Portal: {}</a>\".format(local_run.get_portal_url(), local_run.id))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Retrieve All Child Runs after the experiment is completed (in portal)\n",
|
|
"You can also use SDK methods to fetch all the child runs and see individual metrics that we log."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"children = list(local_run.get_children())\n",
|
|
"metricslist = {}\n",
|
|
"for run in children:\n",
|
|
" properties = run.get_properties()\n",
|
|
" #print(properties)\n",
|
|
" metrics = {k: v for k, v in run.get_metrics().items() if isinstance(v, float)} \n",
|
|
" metricslist[int(properties['iteration'])] = metrics\n",
|
|
"\n",
|
|
"rundata = pd.DataFrame(metricslist).sort_index(1)\n",
|
|
"rundata"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Deploy\n",
|
|
"\n",
|
|
"### Retrieve the Best Model\n",
|
|
"\n",
|
|
"Below we select the best pipeline from our iterations. The `get_output` method on `automl_classifier` returns the best run and the fitted model for the last invocation. Overloads on `get_output` allow you to retrieve the best run and fitted model for *any* logged metric or for a particular *iteration*."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"best_run, fitted_model = local_run.get_output()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Download the conda environment file\n",
|
|
"From the *best_run* download the conda environment file that was used to train the AutoML model."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.automl.core.shared import constants\n",
|
|
"conda_env_file_name = 'conda_env.yml'\n",
|
|
"best_run.download_file(name=\"outputs/conda_env_v_1_0_0.yml\", output_file_path=conda_env_file_name)\n",
|
|
"with open(conda_env_file_name, \"r\") as conda_file:\n",
|
|
" conda_file_contents = conda_file.read()\n",
|
|
" print(conda_file_contents)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Download the model scoring file\n",
|
|
"From the *best_run* download the scoring file to get the predictions from the AutoML model."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.automl.core.shared import constants\n",
|
|
"script_file_name = 'scoring_file.py'\n",
|
|
"best_run.download_file(name=\"outputs/scoring_file_v_1_0_0.py\", output_file_path=script_file_name)\n",
|
|
"with open(script_file_name, \"r\") as scoring_file:\n",
|
|
" scoring_file_contents = scoring_file.read()\n",
|
|
" print(scoring_file_contents)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Register the Fitted Model for Deployment\n",
|
|
"If neither metric nor iteration are specified in the register_model call, the iteration with the best primary metric is registered."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"description = 'AutoML Model'\n",
|
|
"tags = None\n",
|
|
"model = local_run.register_model(description = description, tags = tags)\n",
|
|
"local_run.model_id # This will be written to the scoring script file later in the notebook."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Deploy the model as a Web Service on Azure Container Instance\n",
|
|
"\n",
|
|
"Create the configuration needed for deploying the model as a web service service."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.core.model import InferenceConfig\n",
|
|
"from azureml.core.webservice import AciWebservice\n",
|
|
"from azureml.core.environment import Environment\n",
|
|
"\n",
|
|
"myenv = Environment.from_conda_specification(name=\"myenv\", file_path=conda_env_file_name)\n",
|
|
"inference_config = InferenceConfig(entry_script=script_file_name, environment=myenv)\n",
|
|
"\n",
|
|
"aciconfig = AciWebservice.deploy_configuration(cpu_cores = 1, \n",
|
|
" memory_gb = 1, \n",
|
|
" tags = {'area': \"digits\", 'type': \"automl_classification\"}, \n",
|
|
" description = 'sample service for Automl Classification')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.core.webservice import Webservice\n",
|
|
"from azureml.core.model import Model\n",
|
|
"\n",
|
|
"aci_service_name = 'automl-databricks-local'\n",
|
|
"print(aci_service_name)\n",
|
|
"aci_service = Model.deploy(ws, aci_service_name, [model], inference_config, aciconfig)\n",
|
|
"aci_service.wait_for_deployment(True)\n",
|
|
"print(aci_service.state)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Test the Best Fitted Model\n",
|
|
"\n",
|
|
"#### Load Test Data - you can split the dataset beforehand & pass Train dataset to AutoML and use Test dataset to evaluate the best model."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"dataset_test = Dataset.Tabular.from_delimited_files(path='https://dprepdata.blob.core.windows.net/demo/crime0-test.csv')\n",
|
|
"\n",
|
|
"df_test = dataset_test.to_pandas_dataframe()\n",
|
|
"df_test = df_test[pd.notnull(df_test['Primary Type'])]\n",
|
|
"\n",
|
|
"y_test = df_test[['Primary Type']]\n",
|
|
"X_test = df_test.drop(['Primary Type', 'FBI Code'], axis=1)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Testing Our Best Fitted Model\n",
|
|
"We will try to predict digits and see how our model works. This is just an example to show you."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"fitted_model.predict(X_test)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"### Delete the service"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"aci_service.delete()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
""
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"authors": [
|
|
{
|
|
"name": "savitam"
|
|
},
|
|
{
|
|
"name": "sasum"
|
|
}
|
|
],
|
|
"kernelspec": {
|
|
"display_name": "Python 3.6",
|
|
"language": "python",
|
|
"name": "python36"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.6.8"
|
|
},
|
|
"name": "auto-ml-classification-local-adb",
|
|
"notebookId": 3772036807853791
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 1
|
|
} |