{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Copyright (c) Microsoft Corporation. All rights reserved.\n", "\n", "Licensed under the MIT License." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/sparse-data-train-test-split/auto-ml-sparse-data-train-test-split.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Automated Machine Learning\n", "_**Train Test Split and Handling Sparse Data**_\n", "\n", "## Contents\n", "1. [Introduction](#Introduction)\n", "1. [Setup](#Setup)\n", "1. [Data](#Data)\n", "1. [Train](#Train)\n", "1. [Results](#Results)\n", "1. [Test](#Test)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction\n", "In this example we use the scikit-learn's [20newsgroup](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html) to showcase how you can use AutoML for handling sparse data and how to specify custom cross validations splits.\n", "Make sure you have executed the [configuration](../../../configuration.ipynb) before running this notebook.\n", "\n", "In this notebook you will learn how to:\n", "1. Create an `Experiment` in an existing `Workspace`.\n", "2. Configure AutoML using `AutoMLConfig`.\n", "4. Train the model.\n", "5. Explore the results.\n", "6. Test the best fitted model.\n", "\n", "In addition this notebook showcases the following features\n", "- Explicit train test splits \n", "- Handling **sparse data** in the input" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup\n", "\n", "As part of the setup you have already created an Azure ML `Workspace` object. For AutoML you will need to create an `Experiment` object, which is a named object in a `Workspace` used to run experiments." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import logging\n", "\n", "import pandas as pd\n", "\n", "import azureml.core\n", "from azureml.core.experiment import Experiment\n", "from azureml.core.workspace import Workspace\n", "from azureml.train.automl import AutoMLConfig" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ws = Workspace.from_config()\n", "\n", "# choose a name for the experiment\n", "experiment_name = 'sparse-data-train-test-split'\n", "\n", "experiment = Experiment(ws, experiment_name)\n", "\n", "output = {}\n", "output['SDK version'] = azureml.core.VERSION\n", "output['Subscription ID'] = ws.subscription_id\n", "output['Workspace'] = ws.name\n", "output['Resource Group'] = ws.resource_group\n", "output['Location'] = ws.location\n", "output['Experiment Name'] = experiment.name\n", "pd.set_option('display.max_colwidth', -1)\n", "outputDf = pd.DataFrame(data = output, index = [''])\n", "outputDf.T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import fetch_20newsgroups\n", "from sklearn.feature_extraction.text import HashingVectorizer\n", "from sklearn.model_selection import train_test_split\n", "\n", "remove = ('headers', 'footers', 'quotes')\n", "categories = [\n", " 'alt.atheism',\n", " 'talk.religion.misc',\n", " 'comp.graphics',\n", " 'sci.space',\n", "]\n", "data_train = fetch_20newsgroups(subset = 'train', categories = categories,\n", " shuffle = True, random_state = 42,\n", " remove = remove)\n", "\n", "X_train, X_valid, y_train, y_valid = train_test_split(data_train.data, data_train.target, test_size = 0.33, random_state = 42)\n", "\n", "\n", "vectorizer = HashingVectorizer(stop_words = 'english', alternate_sign = False,\n", " n_features = 2**16)\n", "X_train = vectorizer.transform(X_train)\n", "X_valid = vectorizer.transform(X_valid)\n", "\n", "summary_df = pd.DataFrame(index = ['No of Samples', 'No of Features'])\n", "summary_df['Train Set'] = [X_train.shape[0], X_train.shape[1]]\n", "summary_df['Validation Set'] = [X_valid.shape[0], X_valid.shape[1]]\n", "summary_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Train\n", "\n", "Instantiate an `AutoMLConfig` object to specify the settings and data used to run the experiment.\n", "\n", "|Property|Description|\n", "|-|-|\n", "|**task**|classification or regression|\n", "|**primary_metric**|This is the metric that you want to optimize. Classification supports the following primary metrics:
accuracy
AUC_weighted
average_precision_score_weighted
norm_macro_recall
precision_score_weighted|\n", "|**iteration_timeout_minutes**|Time limit in minutes for each iteration.|\n", "|**iterations**|Number of iterations. In each iteration AutoML trains a specific pipeline with the data.|\n", "|**preprocess**|Setting this to *True* enables AutoML to perform preprocessing on the input to handle *missing data*, and to perform some common *feature extraction*.
**Note:** If input data is sparse, you cannot use *True*.|\n", "|**X**|(sparse) array-like, shape = [n_samples, n_features]|\n", "|**y**|(sparse) array-like, shape = [n_samples, ], Multi-class targets.|\n", "|**X_valid**|(sparse) array-like, shape = [n_samples, n_features] for the custom validation set.|\n", "|**y_valid**|(sparse) array-like, shape = [n_samples, ], Multi-class targets.|" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "automl_config = AutoMLConfig(task = 'classification',\n", " debug_log = 'automl_errors.log',\n", " primary_metric = 'AUC_weighted',\n", " iteration_timeout_minutes = 60,\n", " iterations = 5,\n", " preprocess = False,\n", " verbosity = logging.INFO,\n", " X = X_train, \n", " y = y_train,\n", " X_valid = X_valid, \n", " y_valid = y_valid)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Call the `submit` method on the experiment object and pass the run configuration. Execution of local runs is synchronous. Depending on the data and the number of iterations this can run for a while.\n", "In this example, we specify `show_output = True` to print currently running iterations to the console." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "local_run = experiment.submit(automl_config, show_output=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "local_run" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Widget for Monitoring Runs\n", "\n", "The widget will first report a \"loading\" status while running the first iteration. After completing the first iteration, an auto-updating graph and table will be shown. The widget will refresh once per minute, so you should see the graph update as child runs complete.\n", "\n", "**Note:** The widget displays a link at the bottom. Use this link to open a web interface to explore the individual run details." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from azureml.widgets import RunDetails\n", "RunDetails(local_run).show() " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "#### Retrieve All Child Runs\n", "You can also use SDK methods to fetch all the child runs and see individual metrics that we log." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "children = list(local_run.get_children())\n", "metricslist = {}\n", "for run in children:\n", " properties = run.get_properties()\n", " metrics = {k: v for k, v in run.get_metrics().items() if isinstance(v, float)}\n", " metricslist[int(properties['iteration'])] = metrics\n", " \n", "rundata = pd.DataFrame(metricslist).sort_index(1)\n", "rundata" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Retrieve the Best Model\n", "\n", "Below we select the best pipeline from our iterations. The `get_output` method returns the best run and the fitted model. The Model includes the pipeline and any pre-processing. Overloads on `get_output` allow you to retrieve the best run and fitted model for *any* logged metric or for a particular *iteration*." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "best_run, fitted_model = local_run.get_output()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Best Model Based on Any Other Metric\n", "Show the run and the model which has the smallest `accuracy` value:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# lookup_metric = \"accuracy\"\n", "# best_run, fitted_model = local_run.get_output(metric = lookup_metric)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Model from a Specific Iteration\n", "Show the run and the model from the third iteration:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# iteration = 3\n", "# best_run, fitted_model = local_run.get_output(iteration = iteration)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Test" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load test data.\n", "from pandas_ml import ConfusionMatrix\n", "\n", "data_test = fetch_20newsgroups(subset = 'test', categories = categories,\n", " shuffle = True, random_state = 42,\n", " remove = remove)\n", "\n", "X_test = vectorizer.transform(data_test.data)\n", "y_test = data_test.target\n", "\n", "# Test our best pipeline.\n", "\n", "y_pred = fitted_model.predict(X_test)\n", "y_pred_strings = [data_test.target_names[i] for i in y_pred]\n", "y_test_strings = [data_test.target_names[i] for i in y_test]\n", "\n", "cm = ConfusionMatrix(y_test_strings, y_pred_strings)\n", "print(cm)\n", "cm.plot()" ] } ], "metadata": { "authors": [ { "name": "savitam" } ], "kernelspec": { "display_name": "Python 3.6", "language": "python", "name": "python36" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.6" } }, "nbformat": 4, "nbformat_minor": 2 }