mirror of
https://github.com/Azure/MachineLearningNotebooks.git
synced 2025-12-19 17:17:04 -05:00
417 lines
15 KiB
Plaintext
417 lines
15 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Copyright (c) Microsoft Corporation. All rights reserved.\n",
|
|
"\n",
|
|
"Licensed under the MIT License."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Automated Machine Learning\n",
|
|
"_**Blacklisting Models, Early Termination, and Handling Missing Data**_\n",
|
|
"\n",
|
|
"## Contents\n",
|
|
"1. [Introduction](#Introduction)\n",
|
|
"1. [Setup](#Setup)\n",
|
|
"1. [Data](#Data)\n",
|
|
"1. [Train](#Train)\n",
|
|
"1. [Results](#Results)\n",
|
|
"1. [Test](#Test)\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Introduction\n",
|
|
"In this example we use the scikit-learn's [digit dataset](http://scikit-learn.org/stable/datasets/index.html#optical-recognition-of-handwritten-digits-dataset) to showcase how you can use AutoML for handling missing values in data. We also provide a stopping metric indicating a target for the primary metrics so that AutoML can terminate the run without necessarly going through all the iterations. Finally, if you want to avoid a certain pipeline, we allow you to specify a blacklist of algorithms that AutoML will ignore for this run.\n",
|
|
"\n",
|
|
"Make sure you have executed the [configuration](../../../configuration.ipynb) before running this notebook.\n",
|
|
"\n",
|
|
"In this notebook you will learn how to:\n",
|
|
"1. Create an `Experiment` in an existing `Workspace`.\n",
|
|
"2. Configure AutoML using `AutoMLConfig`.\n",
|
|
"3. Train the model.\n",
|
|
"4. Explore the results.\n",
|
|
"5. Viewing the engineered names for featurized data and featurization summary for all raw features.\n",
|
|
"6. Test the best fitted model.\n",
|
|
"\n",
|
|
"In addition this notebook showcases the following features\n",
|
|
"- **Blacklisting** certain pipelines\n",
|
|
"- Specifying **target metrics** to indicate stopping criteria\n",
|
|
"- Handling **missing data** in the input"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Setup\n",
|
|
"\n",
|
|
"As part of the setup you have already created an Azure ML `Workspace` object. For AutoML you will need to create an `Experiment` object, which is a named object in a `Workspace` used to run experiments."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import logging\n",
|
|
"\n",
|
|
"from matplotlib import pyplot as plt\n",
|
|
"import numpy as np\n",
|
|
"import pandas as pd\n",
|
|
"from sklearn import datasets\n",
|
|
"\n",
|
|
"import azureml.core\n",
|
|
"from azureml.core.experiment import Experiment\n",
|
|
"from azureml.core.workspace import Workspace\n",
|
|
"from azureml.train.automl import AutoMLConfig"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"ws = Workspace.from_config()\n",
|
|
"\n",
|
|
"# Choose a name for the experiment.\n",
|
|
"experiment_name = 'automl-local-missing-data'\n",
|
|
"project_folder = './sample_projects/automl-local-missing-data'\n",
|
|
"\n",
|
|
"experiment = Experiment(ws, experiment_name)\n",
|
|
"\n",
|
|
"output = {}\n",
|
|
"output['SDK version'] = azureml.core.VERSION\n",
|
|
"output['Subscription ID'] = ws.subscription_id\n",
|
|
"output['Workspace'] = ws.name\n",
|
|
"output['Resource Group'] = ws.resource_group\n",
|
|
"output['Location'] = ws.location\n",
|
|
"output['Project Directory'] = project_folder\n",
|
|
"output['Experiment Name'] = experiment.name\n",
|
|
"pd.set_option('display.max_colwidth', -1)\n",
|
|
"outputDf = pd.DataFrame(data = output, index = [''])\n",
|
|
"outputDf.T"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Data"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"digits = datasets.load_digits()\n",
|
|
"X_train = digits.data[10:,:]\n",
|
|
"y_train = digits.target[10:]\n",
|
|
"\n",
|
|
"# Add missing values in 75% of the lines.\n",
|
|
"missing_rate = 0.75\n",
|
|
"n_missing_samples = int(np.floor(X_train.shape[0] * missing_rate))\n",
|
|
"missing_samples = np.hstack((np.zeros(X_train.shape[0] - n_missing_samples, dtype=np.bool), np.ones(n_missing_samples, dtype=np.bool)))\n",
|
|
"rng = np.random.RandomState(0)\n",
|
|
"rng.shuffle(missing_samples)\n",
|
|
"missing_features = rng.randint(0, X_train.shape[1], n_missing_samples)\n",
|
|
"X_train[np.where(missing_samples)[0], missing_features] = np.nan"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"df = pd.DataFrame(data = X_train)\n",
|
|
"df['Label'] = pd.Series(y_train, index=df.index)\n",
|
|
"df.head()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Train\n",
|
|
"\n",
|
|
"Instantiate an `AutoMLConfig` object to specify the settings and data used to run the experiment. This includes setting `experiment_exit_score`, which should cause the run to complete before the `iterations` count is reached.\n",
|
|
"\n",
|
|
"|Property|Description|\n",
|
|
"|-|-|\n",
|
|
"|**task**|classification or regression|\n",
|
|
"|**primary_metric**|This is the metric that you want to optimize. Classification supports the following primary metrics: <br><i>accuracy</i><br><i>AUC_weighted</i><br><i>average_precision_score_weighted</i><br><i>norm_macro_recall</i><br><i>precision_score_weighted</i>|\n",
|
|
"|**iteration_timeout_minutes**|Time limit in minutes for each iteration.|\n",
|
|
"|**iterations**|Number of iterations. In each iteration AutoML trains a specific pipeline with the data.|\n",
|
|
"|**preprocess**|Setting this to *True* enables AutoML to perform preprocessing on the input to handle *missing data*, and to perform some common *feature extraction*.|\n",
|
|
"|**experiment_exit_score**|*double* value indicating the target for *primary_metric*. <br>Once the target is surpassed the run terminates.|\n",
|
|
"|**blacklist_models**|*List* of *strings* indicating machine learning algorithms for AutoML to avoid in this run.<br><br> Allowed values for **Classification**<br><i>LogisticRegression</i><br><i>SGD</i><br><i>MultinomialNaiveBayes</i><br><i>BernoulliNaiveBayes</i><br><i>SVM</i><br><i>LinearSVM</i><br><i>KNN</i><br><i>DecisionTree</i><br><i>RandomForest</i><br><i>ExtremeRandomTrees</i><br><i>LightGBM</i><br><i>GradientBoosting</i><br><i>TensorFlowDNN</i><br><i>TensorFlowLinearClassifier</i><br><br>Allowed values for **Regression**<br><i>ElasticNet</i><br><i>GradientBoosting</i><br><i>DecisionTree</i><br><i>KNN</i><br><i>LassoLars</i><br><i>SGD</i><br><i>RandomForest</i><br><i>ExtremeRandomTrees</i><br><i>LightGBM</i><br><i>TensorFlowLinearRegressor</i><br><i>TensorFlowDNN</i>|\n",
|
|
"|**X**|(sparse) array-like, shape = [n_samples, n_features]|\n",
|
|
"|**y**|(sparse) array-like, shape = [n_samples, ], Multi-class targets.|\n",
|
|
"|**path**|Relative path to the project folder. AutoML stores configuration files for the experiment under this folder. You can specify a new empty folder.|"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"automl_config = AutoMLConfig(task = 'classification',\n",
|
|
" debug_log = 'automl_errors.log',\n",
|
|
" primary_metric = 'AUC_weighted',\n",
|
|
" iteration_timeout_minutes = 60,\n",
|
|
" iterations = 20,\n",
|
|
" preprocess = True,\n",
|
|
" experiment_exit_score = 0.9984,\n",
|
|
" blacklist_models = ['KNN','LinearSVM'],\n",
|
|
" verbosity = logging.INFO,\n",
|
|
" X = X_train, \n",
|
|
" y = y_train,\n",
|
|
" path = project_folder)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Call the `submit` method on the experiment object and pass the run configuration. Execution of local runs is synchronous. Depending on the data and the number of iterations this can run for a while.\n",
|
|
"In this example, we specify `show_output = True` to print currently running iterations to the console."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"local_run = experiment.submit(automl_config, show_output = True)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"local_run"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Results"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Widget for Monitoring Runs\n",
|
|
"\n",
|
|
"The widget will first report a \"loading\" status while running the first iteration. After completing the first iteration, an auto-updating graph and table will be shown. The widget will refresh once per minute, so you should see the graph update as child runs complete.\n",
|
|
"\n",
|
|
"**Note:** The widget displays a link at the bottom. Use this link to open a web interface to explore the individual run details."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.widgets import RunDetails\n",
|
|
"RunDetails(local_run).show() "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"\n",
|
|
"#### Retrieve All Child Runs\n",
|
|
"You can also use SDK methods to fetch all the child runs and see individual metrics that we log."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"children = list(local_run.get_children())\n",
|
|
"metricslist = {}\n",
|
|
"for run in children:\n",
|
|
" properties = run.get_properties()\n",
|
|
" metrics = {k: v for k, v in run.get_metrics().items() if isinstance(v, float)}\n",
|
|
" metricslist[int(properties['iteration'])] = metrics\n",
|
|
"\n",
|
|
"rundata = pd.DataFrame(metricslist).sort_index(1)\n",
|
|
"rundata"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Retrieve the Best Model\n",
|
|
"\n",
|
|
"Below we select the best pipeline from our iterations. The `get_output` method returns the best run and the fitted model. The Model includes the pipeline and any pre-processing. Overloads on `get_output` allow you to retrieve the best run and fitted model for *any* logged metric or for a particular *iteration*."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"best_run, fitted_model = local_run.get_output()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Best Model Based on Any Other Metric\n",
|
|
"Show the run and the model which has the smallest `accuracy` value:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# lookup_metric = \"accuracy\"\n",
|
|
"# best_run, fitted_model = local_run.get_output(metric = lookup_metric)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Model from a Specific Iteration\n",
|
|
"Show the run and the model from the third iteration:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# iteration = 3\n",
|
|
"# best_run, fitted_model = local_run.get_output(iteration = iteration)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### View the engineered names for featurized data\n",
|
|
"Below we display the engineered feature names generated for the featurized data using the preprocessing featurization."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"fitted_model.named_steps['datatransformer'].get_engineered_feature_names()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### View the featurization summary\n",
|
|
"Below we display the featurization that was performed on different raw features in the user data. For each raw feature in the user data, the following information is displayed:-\n",
|
|
"- Raw feature name\n",
|
|
"- Number of engineered features formed out of this raw feature\n",
|
|
"- Type detected\n",
|
|
"- If feature was dropped\n",
|
|
"- List of feature transformations for the raw feature"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"fitted_model.named_steps['datatransformer'].get_featurization_summary()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Test"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"digits = datasets.load_digits()\n",
|
|
"X_test = digits.data[:10, :]\n",
|
|
"y_test = digits.target[:10]\n",
|
|
"images = digits.images[:10]\n",
|
|
"\n",
|
|
"# Randomly select digits and test.\n",
|
|
"for index in np.random.choice(len(y_test), 2, replace = False):\n",
|
|
" print(index)\n",
|
|
" predicted = fitted_model.predict(X_test[index:index + 1])[0]\n",
|
|
" label = y_test[index]\n",
|
|
" title = \"Label value = %d Predicted value = %d \" % (label, predicted)\n",
|
|
" fig = plt.figure(1, figsize=(3,3))\n",
|
|
" ax1 = fig.add_axes((0,0,.8,.8))\n",
|
|
" ax1.set_title(title)\n",
|
|
" plt.imshow(images[index], cmap = plt.cm.gray_r, interpolation = 'nearest')\n",
|
|
" plt.show()\n"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"authors": [
|
|
{
|
|
"name": "savitam"
|
|
}
|
|
],
|
|
"kernelspec": {
|
|
"display_name": "Python 3.6",
|
|
"language": "python",
|
|
"name": "python36"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.6.6"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
} |