![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/dataprep/auto-ml-dataprep.png)

Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# Automated Machine Learning
_**Load Data using `TabularDataset` for Local Execution**_

## Contents
1. [Introduction](#Introduction)
1. [Setup](#Setup)
1. [Data](#Data)
1. [Train](#Train)
1. [Results](#Results)
1. [Test](#Test)

## Introduction
In this example we showcase how you can use AzureML Dataset to load data for AutoML.

Make sure you have executed the [configuration](../../../configuration.ipynb) before running this notebook.

In this notebook you will learn how to:
1. Create a `TabularDataset` pointing to the training data.
2. Pass the `TabularDataset` to AutoML for a local run.

## Setup

As part of the setup you have already created an Azure ML `Workspace` object. For AutoML you will need to create an `Experiment` object, which is a named object in a `Workspace` used to run experiments.

In [None]:
import logging

import pandas as pd

import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.core.dataset import Dataset
from azureml.train.automl import AutoMLConfig

In [None]:
ws = Workspace.from_config()
 
# choose a name for experiment
experiment_name = 'automl-dataset-local'
# project folder
project_folder = './sample_projects/automl-dataset-local'
 
experiment = Experiment(ws, experiment_name)
 
output = {}
output['SDK version'] = azureml.core.VERSION
output['Subscription ID'] = ws.subscription_id
output['Workspace Name'] = ws.name
output['Resource Group'] = ws.resource_group
output['Location'] = ws.location
output['Project Directory'] = project_folder
output['Experiment Name'] = experiment.name
pd.set_option('display.max_colwidth', -1)
outputDf = pd.DataFrame(data = output, index = [''])
outputDf.T

## Data

In [None]:
# The data referenced here was a 1MB simple random sample of the Chicago Crime data into a local temporary directory.
example_data = 'https://dprepdata.blob.core.windows.net/demo/crime0-random.csv'
dataset = Dataset.Tabular.from_delimited_files(example_data)
dataset.take(5).to_pandas_dataframe()

### Review the data

You can peek the result of a `TabularDataset` at any range using `skip(i)` and `take(j).to_pandas_dataframe()`. Doing so evaluates only `j` records, which makes it fast even against large datasets.

`TabularDataset` objects are immutable and are composed of a list of subsetting transformations (optional).

In [None]:
X = dataset.drop_columns(columns=['Primary Type', 'FBI Code'])
y = dataset.keep_columns(columns=['Primary Type'], validate=True)

## Train

This creates a general AutoML settings object applicable for both local and remote runs.

In [None]:
automl_settings = {
    "iteration_timeout_minutes" : 10,
    "iterations" : 2,
    "primary_metric" : 'AUC_weighted',
    "preprocess" : True,
    "verbosity" : logging.INFO
}

### Pass Data with `TabularDataset` Objects

The `TabularDataset` objects captured above can be passed to the `submit` method for a local run. AutoML will retrieve the results from the `TabularDataset` for model training.

In [None]:
automl_config = AutoMLConfig(task = 'classification',
                             debug_log = 'automl_errors.log',
                             X = X,
                             y = y,
                             **automl_settings)

In [None]:
local_run = experiment.submit(automl_config, show_output = True)

In [None]:
local_run

## Results

#### Widget for Monitoring Runs

The widget will first report a "loading" status while running the first iteration. After completing the first iteration, an auto-updating graph and table will be shown. The widget will refresh once per minute, so you should see the graph update as child runs complete.

**Note:** The widget displays a link at the bottom. Use this link to open a web interface to explore the individual run details.

In [None]:
from azureml.widgets import RunDetails
RunDetails(local_run).show()

#### Retrieve All Child Runs
You can also use SDK methods to fetch all the child runs and see individual metrics that we log.

In [None]:
children = list(local_run.get_children())
metricslist = {}
for run in children:
    properties = run.get_properties()
    metrics = {k: v for k, v in run.get_metrics().items() if isinstance(v, float)}
    metricslist[int(properties['iteration'])] = metrics
    
rundata = pd.DataFrame(metricslist).sort_index(1)
rundata

### Retrieve the Best Model

Below we select the best pipeline from our iterations. The `get_output` method returns the best run and the fitted model. Overloads on `get_output` allow you to retrieve the best run and fitted model for *any* logged metric or for a particular *iteration*.

In [None]:
best_run, fitted_model = local_run.get_output()
print(best_run)
print(fitted_model)

#### Best Model Based on Any Other Metric
Show the run and the model that has the smallest `log_loss` value:

In [None]:
lookup_metric = "log_loss"
best_run, fitted_model = local_run.get_output(metric = lookup_metric)
print(best_run)
print(fitted_model)

#### Model from a Specific Iteration
Show the run and the model from the first iteration:

In [None]:
iteration = 0
best_run, fitted_model = local_run.get_output(iteration = iteration)
print(best_run)
print(fitted_model)

## Test

#### Load Test Data
For the test data, it should have the same preparation step as the train data. Otherwise it might get failed at the preprocessing step.

In [None]:
dataset_test = Dataset.Tabular.from_delimited_files(path='https://dprepdata.blob.core.windows.net/demo/crime0-test.csv')

df_test = dataset_test.to_pandas_dataframe()
df_test = df_test[pd.notnull(df_test['Primary Type'])]

y_test = df_test[['Primary Type']]
X_test = df_test.drop(['Primary Type', 'FBI Code'], axis=1)

#### Testing Our Best Fitted Model
We will use confusion matrix to see how our model works.

In [None]:
from pandas_ml import ConfusionMatrix

ypred = fitted_model.predict(X_test)

cm = ConfusionMatrix(y_test['Primary Type'], ypred)

print(cm)

cm.plot()