Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/sparse-data-train-test-split/auto-ml-sparse-data-train-test-split.png)

# Automated Machine Learning
_**Train Test Split and Handling Sparse Data**_

## Contents
1. [Introduction](#Introduction)
1. [Setup](#Setup)
1. [Data](#Data)
1. [Train](#Train)
1. [Results](#Results)
1. [Test](#Test)


## Introduction
In this example we use the scikit-learn's [20newsgroup](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html) to showcase how you can use AutoML for handling sparse data and how to specify custom cross validations splits.
Make sure you have executed the [configuration](../../../configuration.ipynb) before running this notebook.

In this notebook you will learn how to:
1. Create an `Experiment` in an existing `Workspace`.
2. Configure AutoML using `AutoMLConfig`.
4. Train the model.
5. Explore the results.
6. Test the best fitted model.

In addition this notebook showcases the following features
- Explicit train test splits 
- Handling **sparse data** in the input

## Setup

As part of the setup you have already created an Azure ML `Workspace` object. For AutoML you will need to create an `Experiment` object, which is a named object in a `Workspace` used to run experiments.

In [None]:
import logging

import pandas as pd

import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.train.automl import AutoMLConfig

In [None]:
ws = Workspace.from_config()

# choose a name for the experiment
experiment_name = 'sparse-data-train-test-split'
# project folder
project_folder = './sample_projects/sparse-data-train-test-split'

experiment = Experiment(ws, experiment_name)

output = {}
output['SDK version'] = azureml.core.VERSION
output['Subscription ID'] = ws.subscription_id
output['Workspace'] = ws.name
output['Resource Group'] = ws.resource_group
output['Location'] = ws.location
output['Project Directory'] = project_folder
output['Experiment Name'] = experiment.name
pd.set_option('display.max_colwidth', -1)
outputDf = pd.DataFrame(data = output, index = [''])
outputDf.T

## Data

In [None]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.model_selection import train_test_split

remove = ('headers', 'footers', 'quotes')
categories = [
    'alt.atheism',
    'talk.religion.misc',
    'comp.graphics',
    'sci.space',
]
data_train = fetch_20newsgroups(subset = 'train', categories = categories,
                                shuffle = True, random_state = 42,
                                remove = remove)

X_train, X_valid, y_train, y_valid = train_test_split(data_train.data, data_train.target, test_size = 0.33, random_state = 42)


vectorizer = HashingVectorizer(stop_words = 'english', alternate_sign = False,
                               n_features = 2**16)
X_train = vectorizer.transform(X_train)
X_valid = vectorizer.transform(X_valid)

summary_df = pd.DataFrame(index = ['No of Samples', 'No of Features'])
summary_df['Train Set'] = [X_train.shape[0], X_train.shape[1]]
summary_df['Validation Set'] = [X_valid.shape[0], X_valid.shape[1]]
summary_df

## Train

Instantiate an `AutoMLConfig` object to specify the settings and data used to run the experiment.

|Property|Description|
|-|-|
|**task**|classification or regression|
|**primary_metric**|This is the metric that you want to optimize. Classification supports the following primary metrics: <br><i>accuracy</i><br><i>AUC_weighted</i><br><i>average_precision_score_weighted</i><br><i>norm_macro_recall</i><br><i>precision_score_weighted</i>|
|**iteration_timeout_minutes**|Time limit in minutes for each iteration.|
|**iterations**|Number of iterations. In each iteration AutoML trains a specific pipeline with the data.|
|**preprocess**|Setting this to *True* enables AutoML to perform preprocessing on the input to handle *missing data*, and to perform some common *feature extraction*.<br>**Note:** If input data is sparse, you cannot use *True*.|
|**X**|(sparse) array-like, shape = [n_samples, n_features]|
|**y**|(sparse) array-like, shape = [n_samples, ], Multi-class targets.|
|**X_valid**|(sparse) array-like, shape = [n_samples, n_features] for the custom validation set.|
|**y_valid**|(sparse) array-like, shape = [n_samples, ], Multi-class targets.|
|**path**|Relative path to the project folder. AutoML stores configuration files for the experiment under this folder. You can specify a new empty folder.|

In [None]:
automl_config = AutoMLConfig(task = 'classification',
                             debug_log = 'automl_errors.log',
                             primary_metric = 'AUC_weighted',
                             iteration_timeout_minutes = 60,
                             iterations = 5,
                             preprocess = False,
                             verbosity = logging.INFO,
                             X = X_train, 
                             y = y_train,
                             X_valid = X_valid, 
                             y_valid = y_valid, 
                             path = project_folder)

Call the `submit` method on the experiment object and pass the run configuration. Execution of local runs is synchronous. Depending on the data and the number of iterations this can run for a while.
In this example, we specify `show_output = True` to print currently running iterations to the console.

In [None]:
local_run = experiment.submit(automl_config, show_output=True)

In [None]:
local_run

## Results

#### Widget for Monitoring Runs

The widget will first report a "loading" status while running the first iteration. After completing the first iteration, an auto-updating graph and table will be shown. The widget will refresh once per minute, so you should see the graph update as child runs complete.

**Note:** The widget displays a link at the bottom. Use this link to open a web interface to explore the individual run details.

In [None]:
from azureml.widgets import RunDetails
RunDetails(local_run).show() 


#### Retrieve All Child Runs
You can also use SDK methods to fetch all the child runs and see individual metrics that we log.

In [None]:
children = list(local_run.get_children())
metricslist = {}
for run in children:
    properties = run.get_properties()
    metrics = {k: v for k, v in run.get_metrics().items() if isinstance(v, float)}
    metricslist[int(properties['iteration'])] = metrics
    
rundata = pd.DataFrame(metricslist).sort_index(1)
rundata

### Retrieve the Best Model

Below we select the best pipeline from our iterations. The `get_output` method returns the best run and the fitted model. The Model includes the pipeline and any pre-processing.  Overloads on `get_output` allow you to retrieve the best run and fitted model for *any* logged metric or for a particular *iteration*.

In [None]:
best_run, fitted_model = local_run.get_output()

#### Best Model Based on Any Other Metric
Show the run and the model which has the smallest `accuracy` value:

In [None]:
# lookup_metric = "accuracy"
# best_run, fitted_model = local_run.get_output(metric = lookup_metric)

#### Model from a Specific Iteration
Show the run and the model from the third iteration:

In [None]:
# iteration = 3
# best_run, fitted_model = local_run.get_output(iteration = iteration)

## Test

In [None]:
# Load test data.
from pandas_ml import ConfusionMatrix

data_test = fetch_20newsgroups(subset = 'test', categories = categories,
                               shuffle = True, random_state = 42,
                               remove = remove)

X_test = vectorizer.transform(data_test.data)
y_test = data_test.target

# Test our best pipeline.

y_pred = fitted_model.predict(X_test)
y_pred_strings = [data_test.target_names[i] for i in y_pred]
y_test_strings = [data_test.target_names[i] for i in y_test]

cm = ConfusionMatrix(y_test_strings, y_pred_strings)
print(cm)
cm.plot()