Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/classification-text-dnn/auto-ml-classification-text-dnn.png)

# Automated Machine Learning
_**Text Classification Using Deep Learning**_

## Contents
1. [Introduction](#Introduction)
1. [Setup](#Setup)
1. [Data](#Data)
1. [Train](#Train)
1. [Evaluate](#Evaluate)

## Introduction
This notebook demonstrates classification with text data using deep learning in AutoML.

AutoML highlights here include using deep neural networks (DNNs) to create embedded features from text data. Depending on the compute cluster the user provides, AutoML tried out Bidirectional Encoder Representations from Transformers (BERT) when a GPU compute is used, and Bidirectional Long-Short Term neural network (BiLSTM) when a CPU compute is used, thereby optimizing the choice of DNN for the uesr's setup.

Make sure you have executed the [configuration](../../../configuration.ipynb) before running this notebook.

An Enterprise workspace is required for this notebook. To learn more about creating an Enterprise workspace or upgrading to an Enterprise workspace from the Azure portal, please visit our [Workspace page](https://docs.microsoft.com/azure/machine-learning/service/concept-workspace#upgrade).

Notebook synopsis:
1. Creating an Experiment in an existing Workspace
2. Configuration and remote run of AutoML for a text dataset (20 Newsgroups dataset from scikit-learn) for classification
3. Evaluating the final model on a test set
4. Deploying the model on ACI

## Setup

In [None]:
import logging
import os
import shutil

import pandas as pd

import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.core.dataset import Dataset
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget
from azureml.core.run import Run
from azureml.widgets import RunDetails
from azureml.core.model import Model 
from helper import run_inference, get_result_df
from azureml.train.automl import AutoMLConfig
from sklearn.datasets import fetch_20newsgroups

As part of the setup you have already created a <b>Workspace</b>. To run AutoML, you also need to create an <b>Experiment</b>. An Experiment corresponds to a prediction problem you are trying to solve, while a Run corresponds to a specific approach to the problem.

In [None]:
ws = Workspace.from_config()

# Choose an experiment name.
experiment_name = 'automl-classification-text-dnn'

experiment = Experiment(ws, experiment_name)

output = {}
output['SDK version'] = azureml.core.VERSION
output['Subscription ID'] = ws.subscription_id
output['Workspace Name'] = ws.name
output['Resource Group'] = ws.resource_group
output['Location'] = ws.location
output['Experiment Name'] = experiment.name
pd.set_option('display.max_colwidth', -1)
outputDf = pd.DataFrame(data = output, index = [''])
outputDf.T

## Set up a compute cluster
This section uses a user-provided compute cluster (named "cpu-cluster" in this example). If a cluster with this name does not exist in the user's workspace, the below code will create a new cluster. You can choose the parameters of the cluster as mentioned in the comments.

Whether you provide/select a CPU or GPU cluster, AutoML will choose the appropriate DNN for that setup - BiLSTM or BERT text featurizer will be included in the candidate featurizers on CPU and GPU respectively.

In [None]:
# Choose a name for your cluster.
amlcompute_cluster_name = "cpu-dnntext"

found = False
# Check if this compute target already exists in the workspace.
cts = ws.compute_targets
if amlcompute_cluster_name in cts and cts[amlcompute_cluster_name].type == 'AmlCompute':
    found = True
    print('Found existing compute target.')
    compute_target = cts[amlcompute_cluster_name]

if not found:
    print('Creating a new compute target...')
    provisioning_config = AmlCompute.provisioning_configuration(vm_size = "STANDARD_D2_V2", # CPU for BiLSTM
                                                                # To use BERT, select a GPU such as "STANDARD_NC6" 
                                                                # or similar GPU option
                                                                # available in your workspace
                                                                max_nodes = 6)

    # Create the cluster
    compute_target = ComputeTarget.create(ws, amlcompute_cluster_name, provisioning_config)

print('Checking cluster status...')
# Can poll for a minimum number of nodes and for a specific timeout.
# If no min_node_count is provided, it will use the scale settings for the cluster.
compute_target.wait_for_completion(show_output = True, min_node_count = None, timeout_in_minutes = 20)

# For a more detailed view of current AmlCompute status, use get_status().

### Get data
For this notebook we will use 20 Newsgroups data from scikit-learn. We filter the data to contain four classes and take a sample as training data. Please note that for accuracy improvement, more data is needed. For this notebook we provide a small-data example so that you can use this template to use with your larger sized data.

In [None]:
data_dir = "text-dnn-data" # Local directory to store data
blobstore_datadir = data_dir # Blob store directory to store data in
target_column_name = 'y'
feature_column_name = 'X'

def get_20newsgroups_data():
    '''Fetches 20 Newsgroups data from scikit-learn
       Returns them in form of pandas dataframes
    '''
    remove = ('headers', 'footers', 'quotes')
    categories = [
        'alt.atheism',
        'talk.religion.misc',
        'comp.graphics',
        'sci.space',
        ]

    data = fetch_20newsgroups(subset = 'train', categories = categories,
                                    shuffle = True, random_state = 42,
                                    remove = remove)
    data = pd.DataFrame({feature_column_name: data.data, target_column_name: data.target})

    data_train = data[:200]
    data_test = data[200:300]    

    data_train = remove_blanks_20news(data_train, feature_column_name, target_column_name)
    data_test = remove_blanks_20news(data_test, feature_column_name, target_column_name)
    
    return data_train, data_test
    
def remove_blanks_20news(data, feature_column_name, target_column_name):
    
    data[feature_column_name] = data[feature_column_name].replace(r'\n', ' ', regex=True).apply(lambda x: x.strip())
    data = data[data[feature_column_name] != '']
    
    return data

Featch data and upload to datastore for use in training

In [None]:
data_train, data_test = get_20newsgroups_data()

if not os.path.isdir(data_dir):
    os.mkdir(data_dir)
    
train_data_fname = data_dir + '/train_data.csv'
test_data_fname = data_dir + '/test_data.csv'

data_train.to_csv(train_data_fname, index=False)
data_test.to_csv(test_data_fname, index=False)

datastore = ws.get_default_datastore()
datastore.upload(src_dir=data_dir, target_path=blobstore_datadir,
                    overwrite=True)

In [None]:
train_dataset = Dataset.Tabular.from_delimited_files(path = [(datastore, blobstore_datadir + '/train_data.csv')])

### Prepare AutoML run

This step requires an Enterprise workspace to gain access to this feature. To learn more about creating an Enterprise workspace or upgrading to an Enterprise workspace from the Azure portal, please visit our [Workspace page](https://docs.microsoft.com/azure/machine-learning/service/concept-workspace#upgrade).

In [None]:
automl_settings = {
    "experiment_timeout_minutes": 20,
    "primary_metric": 'accuracy',
    "max_concurrent_iterations": 4, 
    "max_cores_per_iteration": -1,
    "enable_dnn": True,
    "enable_early_stopping": True,
    "validation_size": 0.3,
    "verbosity": logging.INFO,
    "enable_voting_ensemble": False,
    "enable_stack_ensemble": False,
}

automl_config = AutoMLConfig(task = 'classification',
                             debug_log = 'automl_errors.log',
                             compute_target=compute_target,
                             training_data=train_dataset,
                             label_column_name=target_column_name,
                             **automl_settings
                            )

#### Submit AutoML Run

In [None]:
automl_run = experiment.submit(automl_config, show_output=True)

In [None]:
automl_run

Displaying the run objects gives you links to the visual tools in the Azure Portal. Go try them!

### Retrieve the Best Model
Below we select the best model pipeline from our iterations, use it to test on test data on the same compute cluster.

You can test the model locally to get a feel of the input/output. This step may require additional package installations such as pytorch.

In [None]:
#best_run, fitted_model = automl_run.get_output()

### Deploying the model
We now use the best fitted model from the AutoML Run to make predictions on the test set.  

Get results stats, extract the best model from AutoML run, download and register the resultant best model

In [None]:
summary_df = get_result_df(automl_run)
best_dnn_run_id = summary_df['run_id'].iloc[0]
best_dnn_run = Run(experiment, best_dnn_run_id)

In [None]:
model_dir = 'Model' # Local folder where the model will be stored temporarily
if not os.path.isdir(model_dir):
    os.mkdir(model_dir)
    
best_dnn_run.download_file('outputs/model.pkl', model_dir + '/model.pkl')

Register the model in your Azure Machine Learning Workspace. If you previously registered a model, please make sure to delete it so as to replace it with this new model.

In [None]:
# Register the model
model_name = 'textDNN-20News'
model = Model.register(model_path = model_dir + '/model.pkl',
                       model_name = model_name,
                       tags=None,
                       workspace=ws)

## Evaluate on Test Data

We now use the best fitted model from the AutoML Run to make predictions on the test set.  

Test set schema should match that of the training set.

In [None]:
test_dataset = Dataset.Tabular.from_delimited_files(path = [(datastore, blobstore_datadir + '/test_data.csv')])

# preview the first 3 rows of the dataset
test_dataset.take(3).to_pandas_dataframe()

In [None]:
test_experiment = Experiment(ws, experiment_name + "_test")

In [None]:
script_folder = os.path.join(os.getcwd(), 'inference')
os.makedirs(script_folder, exist_ok=True)
shutil.copy2('infer.py', script_folder)

In [None]:
test_run = run_inference(test_experiment, compute_target, script_folder, best_dnn_run, test_dataset,
                 target_column_name, model_name)

Display computed metrics

In [None]:
test_run

In [None]:
RunDetails(test_run).show()

In [None]:
test_run.wait_for_completion()

In [None]:
pd.Series(test_run.get_metrics())