Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/dataprep/auto-ml-dataprep.png)

# Automated Machine Learning
_**Prepare Data using `azureml.dataprep` for Local Execution**_

## Contents
1. [Introduction](#Introduction)
1. [Setup](#Setup)
1. [Data](#Data)
1. [Train](#Train)
1. [Results](#Results)
1. [Test](#Test)

## Introduction
In this example we showcase how you can use the `azureml.dataprep` SDK to load and prepare data for AutoML. `azureml.dataprep` can also be used standalone; full documentation can be found [here](https://github.com/Microsoft/PendletonDocs).

Make sure you have executed the [configuration](../../../configuration.ipynb) before running this notebook.

In this notebook you will learn how to:
1. Define data loading and preparation steps in a `Dataflow` using `azureml.dataprep`.
2. Pass the `Dataflow` to AutoML for a local run.
3. Pass the `Dataflow` to AutoML for a remote run.

## Setup

Currently, Data Prep only supports __Ubuntu 16__ and __Red Hat Enterprise Linux 7__. We are working on supporting more linux distros.

As part of the setup you have already created an Azure ML `Workspace` object. For AutoML you will need to create an `Experiment` object, which is a named object in a `Workspace` used to run experiments.

In [None]:
import logging

import pandas as pd

import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
import azureml.dataprep as dprep
from azureml.train.automl import AutoMLConfig

In [None]:
ws = Workspace.from_config()
 
# choose a name for experiment
experiment_name = 'automl-dataprep-local'
# project folder
project_folder = './sample_projects/automl-dataprep-local'
 
experiment = Experiment(ws, experiment_name)
 
output = {}
output['SDK version'] = azureml.core.VERSION
output['Subscription ID'] = ws.subscription_id
output['Workspace Name'] = ws.name
output['Resource Group'] = ws.resource_group
output['Location'] = ws.location
output['Project Directory'] = project_folder
output['Experiment Name'] = experiment.name
pd.set_option('display.max_colwidth', -1)
outputDf = pd.DataFrame(data = output, index = [''])
outputDf.T

## Data

In [None]:
# You can use `auto_read_file` which intelligently figures out delimiters and datatypes of a file.
# The data referenced here was a 1MB simple random sample of the Chicago Crime data into a local temporary directory.
# You can also use `read_csv` and `to_*` transformations to read (with overridable delimiter)
# and convert column types manually.
example_data = 'https://dprepdata.blob.core.windows.net/demo/crime0-random.csv'
dflow = dprep.auto_read_file(example_data).skip(1) # Remove the header row.
dflow.get_profile()

In [None]:
# As `Primary Type` is our y data, we need to drop the values those are null in this column.
dflow = dflow.drop_nulls('Primary Type')
dflow.head(5)

### Review the Data Preparation Result

You can peek the result of a Dataflow at any range using `skip(i)` and `head(j)`. Doing so evaluates only `j` records for all the steps in the Dataflow, which makes it fast even against large datasets.

`Dataflow` objects are immutable and are composed of a list of data preparation steps. A `Dataflow` object can be branched at any point for further usage.

In [None]:
X = dflow.drop_columns(columns=['Primary Type', 'FBI Code'])
y = dflow.keep_columns(columns=['Primary Type'], validate_column_exists=True)

## Train

This creates a general AutoML settings object applicable for both local and remote runs.

In [None]:
automl_settings = {
 "iteration_timeout_minutes" : 10,
 "iterations" : 2,
 "primary_metric" : 'AUC_weighted',
 "preprocess" : True,
 "verbosity" : logging.INFO
}

### Pass Data with `Dataflow` Objects

The `Dataflow` objects captured above can be passed to the `submit` method for a local run. AutoML will retrieve the results from the `Dataflow` for model training.

In [None]:
automl_config = AutoMLConfig(task = 'classification',
 debug_log = 'automl_errors.log',
 X = X,
 y = y,
 **automl_settings)

In [None]:
local_run = experiment.submit(automl_config, show_output = True)

In [None]:
local_run

## Results

#### Widget for Monitoring Runs

The widget will first report a "loading" status while running the first iteration. After completing the first iteration, an auto-updating graph and table will be shown. The widget will refresh once per minute, so you should see the graph update as child runs complete.

**Note:** The widget displays a link at the bottom. Use this link to open a web interface to explore the individual run details.

In [None]:
from azureml.widgets import RunDetails
RunDetails(local_run).show()

#### Retrieve All Child Runs
You can also use SDK methods to fetch all the child runs and see individual metrics that we log.

In [None]:
children = list(local_run.get_children())
metricslist = {}
for run in children:
 properties = run.get_properties()
 metrics = {k: v for k, v in run.get_metrics().items() if isinstance(v, float)}
 metricslist[int(properties['iteration'])] = metrics
 
rundata = pd.DataFrame(metricslist).sort_index(1)
rundata

### Retrieve the Best Model

Below we select the best pipeline from our iterations. The `get_output` method returns the best run and the fitted model. Overloads on `get_output` allow you to retrieve the best run and fitted model for *any* logged metric or for a particular *iteration*.

In [None]:
best_run, fitted_model = local_run.get_output()
print(best_run)
print(fitted_model)

#### Best Model Based on Any Other Metric
Show the run and the model that has the smallest `log_loss` value:

In [None]:
lookup_metric = "log_loss"
best_run, fitted_model = local_run.get_output(metric = lookup_metric)
print(best_run)
print(fitted_model)

#### Model from a Specific Iteration
Show the run and the model from the first iteration:

In [None]:
iteration = 0
best_run, fitted_model = local_run.get_output(iteration = iteration)
print(best_run)
print(fitted_model)

## Test

#### Load Test Data
For the test data, it should have the same preparation step as the train data. Otherwise it might get failed at the preprocessing step.

In [None]:
dflow_test = dprep.auto_read_file(path='https://dprepdata.blob.core.windows.net/demo/crime0-test.csv').skip(1)
dflow_test = dflow_test.drop_nulls('Primary Type')

#### Testing Our Best Fitted Model
We will use confusion matrix to see how our model works.

In [None]:
from pandas_ml import ConfusionMatrix

y_test = dflow_test.keep_columns(columns=['Primary Type']).to_pandas_dataframe()
X_test = dflow_test.drop_columns(columns=['Primary Type', 'FBI Code']).to_pandas_dataframe()

ypred = fitted_model.predict(X_test)

cm = ConfusionMatrix(y_test['Primary Type'], ypred)

print(cm)

cm.plot()