Copyright (c) Microsoft Corporation. All rights reserved.  
Licensed under the MIT License.

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/continous-retraining/auto-ml-continuous-retraining.png)

# Automated Machine Learning 
**Continuous retraining using Pipelines and Time-Series TabularDataset**
## Contents
1. [Introduction](#Introduction)
2. [Setup](#Setup)
3. [Compute](#Compute)
4. [Run Configuration](#Run-Configuration)
5. [Data Ingestion Pipeline](#Data-Ingestion-Pipeline)
6. [Training Pipeline](#Training-Pipeline)
7. [Publish Retraining Pipeline and Schedule](#Publish-Retraining-Pipeline-and-Schedule)
8. [Test Retraining](#Test-Retraining)

## Introduction
In this example we use AutoML and Pipelines to enable contious retraining of a model based on updates to the training dataset. We will create two pipelines, the first one to demonstrate a training dataset that gets updated over time. We leverage time-series capabilities of `TabularDataset` to achieve this. The second pipeline utilizes pipeline `Schedule` to trigger continuous retraining. 
Make sure you have executed the [configuration notebook](../../../configuration.ipynb) before running this notebook.
In this notebook you will learn how to:
* Create an Experiment in an existing Workspace.
* Configure AutoML using AutoMLConfig.
* Create data ingestion pipeline to update a time-series based TabularDataset
* Create training pipeline to prepare data, run AutoML, register the model and setup pipeline triggers.

## Setup
As part of the setup you have already created an Azure ML `Workspace` object. For AutoML you will need to create an `Experiment` object, which is a named object in a `Workspace` used to run experiments.

In [None]:
import logging

from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from sklearn import datasets

import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.train.automl import AutoMLConfig

This sample notebook may use features that are not available in previous versions of the Azure ML SDK.

In [None]:
print("This notebook was created using version 1.21.0 of the Azure ML SDK")
print("You are currently using version", azureml.core.VERSION, "of the Azure ML SDK")

Accessing the Azure ML workspace requires authentication with Azure.

The default authentication is interactive authentication using the default tenant. Executing the ws = Workspace.from_config() line in the cell below will prompt for authentication the first time that it is run.

If you have multiple Azure tenants, you can specify the tenant by replacing the ws = Workspace.from_config() line in the cell below with the following:
```
from azureml.core.authentication import InteractiveLoginAuthentication
auth = InteractiveLoginAuthentication(tenant_id = 'mytenantid')
ws = Workspace.from_config(auth = auth)
```
If you need to run in an environment where interactive login is not possible, you can use Service Principal authentication by replacing the ws = Workspace.from_config() line in the cell below with the following:
```
from azureml.core.authentication import ServicePrincipalAuthentication
auth = auth = ServicePrincipalAuthentication('mytenantid', 'myappid', 'mypassword')
ws = Workspace.from_config(auth = auth)
```
For more details, see aka.ms/aml-notebook-auth

In [None]:
ws = Workspace.from_config()
dstor = ws.get_default_datastore()

# Choose a name for the run history container in the workspace.
experiment_name = 'retrain-noaaweather'
experiment = Experiment(ws, experiment_name)

output = {}
output['Subscription ID'] = ws.subscription_id
output['Workspace'] = ws.name
output['Resource Group'] = ws.resource_group
output['Location'] = ws.location
output['Run History Name'] = experiment_name
pd.set_option('display.max_colwidth', -1)
outputDf = pd.DataFrame(data = output, index = [''])
outputDf.T

## Compute 

#### Create or Attach existing AmlCompute

You will need to create a compute target for your AutoML run. In this tutorial, you create AmlCompute as your training compute resource.
#### Creation of AmlCompute takes approximately 5 minutes. 
If the AmlCompute with that name is already in your workspace this code will skip the creation process.
As with other Azure services, there are limits on certain resources (e.g. AmlCompute) associated with the Azure Machine Learning service. Please read [this article](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-quotas) on the default limits and how to request more quota.

In [None]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# Choose a name for your CPU cluster
amlcompute_cluster_name = "cont-cluster"

# Verify that cluster does not exist already
try:
    compute_target = ComputeTarget(workspace=ws, name=amlcompute_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
                                                           max_nodes=4)
    compute_target = ComputeTarget.create(ws, amlcompute_cluster_name, compute_config)

compute_target.wait_for_completion(show_output=True)

## Run Configuration

In [None]:
from azureml.core.runconfig import CondaDependencies, RunConfiguration

# create a new RunConfig object
conda_run_config = RunConfiguration(framework="python")

# Set compute target to AmlCompute
conda_run_config.target = compute_target

conda_run_config.environment.docker.enabled = True

cd = CondaDependencies.create(pip_packages=['azureml-sdk[automl]', 'applicationinsights', 'azureml-opendatasets', 'azureml-defaults'], 
                              conda_packages=['numpy==1.16.2'], 
                              pin_sdk_version=False)
conda_run_config.environment.python.conda_dependencies = cd

print('run config is ready')

## Data Ingestion Pipeline 
For this demo, we will use NOAA weather data from [Azure Open Datasets](https://azure.microsoft.com/services/open-datasets/). You can replace this with your own dataset, or you can skip this pipeline if you already have a time-series based `TabularDataset`.


In [None]:
# The name and target column of the Dataset to create 
dataset = "NOAA-Weather-DS4"
target_column_name = "temperature"


### Upload Data Step
The data ingestion pipeline has a single step with a script to query the latest weather data and upload it to the blob store. During the first run, the script will create and register a time-series based `TabularDataset` with the past one week of weather data. For each subsequent run, the script will create a partition in the blob store by querying NOAA for new weather data since the last modified time of the dataset (`dataset.data_changed_time`) and creating a data.csv file.

In [None]:
from azureml.pipeline.core import Pipeline, PipelineParameter
from azureml.pipeline.steps import PythonScriptStep

ds_name = PipelineParameter(name="ds_name", default_value=dataset)
upload_data_step = PythonScriptStep(script_name="upload_weather_data.py", 
                                         allow_reuse=False,
                                         name="upload_weather_data",
                                         arguments=["--ds_name", ds_name],
                                         compute_target=compute_target, 
                                         runconfig=conda_run_config)

### Submit Pipeline Run

In [None]:
data_pipeline = Pipeline(
    description="pipeline_with_uploaddata",
    workspace=ws,    
    steps=[upload_data_step])
data_pipeline_run = experiment.submit(data_pipeline, pipeline_parameters={"ds_name":dataset})

In [None]:
data_pipeline_run.wait_for_completion(show_output=False)

## Training Pipeline
### Prepare Training Data Step

Script to check if new data is available since the model was last trained. If no new data is available, we cancel the remaining pipeline steps. We need to set allow_reuse flag to False to allow the pipeline to run even when inputs don't change. We also need the name of the model to check the time the model was last trained.

In [None]:
from azureml.pipeline.core import PipelineData

# The model name with which to register the trained model in the workspace.
model_name = PipelineParameter("model_name", default_value="noaaweatherds")

In [None]:
data_prep_step = PythonScriptStep(script_name="check_data.py", 
                                         allow_reuse=False,
                                         name="check_data",
                                         arguments=["--ds_name", ds_name,
                                                    "--model_name", model_name],
                                         compute_target=compute_target, 
                                         runconfig=conda_run_config)

In [None]:
from azureml.core import Dataset
train_ds = Dataset.get_by_name(ws, dataset)
train_ds = train_ds.drop_columns(["partition_date"])

### AutoMLStep
Create an AutoMLConfig and a training step.

In [None]:
from azureml.train.automl import AutoMLConfig
from azureml.pipeline.steps import AutoMLStep

automl_settings = {
    "iteration_timeout_minutes": 10,
    "experiment_timeout_hours": 0.25,
    "n_cross_validations": 3,
    "primary_metric": 'r2_score',
    "max_concurrent_iterations": 3,
    "max_cores_per_iteration": -1,
    "verbosity": logging.INFO,
    "enable_early_stopping": True
}

automl_config = AutoMLConfig(task = 'regression',
                             debug_log = 'automl_errors.log',
                             path = ".",
                             compute_target=compute_target,
                             training_data = train_ds,
                             label_column_name = target_column_name,
                             **automl_settings
                            )

In [None]:
from azureml.pipeline.core import PipelineData, TrainingOutput

metrics_output_name = 'metrics_output'
best_model_output_name = 'best_model_output'

metrics_data = PipelineData(name='metrics_data',
                           datastore=dstor,
                           pipeline_output_name=metrics_output_name,
                           training_output=TrainingOutput(type='Metrics'))
model_data = PipelineData(name='model_data',
                           datastore=dstor,
                           pipeline_output_name=best_model_output_name,
                           training_output=TrainingOutput(type='Model'))

In [None]:
automl_step = AutoMLStep(
    name='automl_module',
    automl_config=automl_config,
    outputs=[metrics_data, model_data],
    allow_reuse=False)

### Register Model Step
Script to register the model to the workspace. 

In [None]:
register_model_step = PythonScriptStep(script_name="register_model.py",
                                       name="register_model",
                                       allow_reuse=False,
                                       arguments=["--model_name", model_name, "--model_path", model_data, "--ds_name", ds_name],
                                       inputs=[model_data],
                                       compute_target=compute_target,
                                       runconfig=conda_run_config)

### Submit Pipeline Run

In [None]:
training_pipeline = Pipeline(
    description="training_pipeline",
    workspace=ws,    
    steps=[data_prep_step, automl_step, register_model_step])

In [None]:
training_pipeline_run = experiment.submit(training_pipeline, pipeline_parameters={
        "ds_name": dataset, "model_name": "noaaweatherds"})

In [None]:
training_pipeline_run.wait_for_completion(show_output=False)

### Publish Retraining Pipeline and Schedule
Once we are happy with the pipeline, we can publish the training pipeline to the workspace and create a schedule to trigger on blob change. The schedule polls the blob store where the data is being uploaded and runs the retraining pipeline if there is a data change. A new version of the model will be registered to the workspace once the run is complete.

In [None]:
pipeline_name = "Retraining-Pipeline-NOAAWeather"

published_pipeline = training_pipeline.publish(
    name=pipeline_name, 
    description="Pipeline that retrains AutoML model")

published_pipeline

In [None]:
from azureml.pipeline.core import Schedule
schedule = Schedule.create(workspace=ws, name="RetrainingSchedule",
                           pipeline_parameters={"ds_name": dataset, "model_name": "noaaweatherds"},
                           pipeline_id=published_pipeline.id, 
                           experiment_name=experiment_name, 
                           datastore=dstor,
                           wait_for_provisioning=True,
                           polling_interval=1440)

## Test Retraining
Here we setup the data ingestion pipeline to run on a schedule, to verify that the retraining pipeline runs as expected. 

Note: 
* Azure NOAA Weather data is updated daily and retraining will not trigger if there is no new data available. 
* Depending on the polling interval set in the schedule, the retraining may take some time trigger after data ingestion pipeline completes.

In [None]:
pipeline_name = "DataIngestion-Pipeline-NOAAWeather"

published_pipeline = training_pipeline.publish(
    name=pipeline_name, 
    description="Pipeline that updates NOAAWeather Dataset")

published_pipeline

In [None]:
from azureml.pipeline.core import Schedule
schedule = Schedule.create(workspace=ws, name="RetrainingSchedule-DataIngestion",
                           pipeline_parameters={"ds_name":dataset},
                           pipeline_id=published_pipeline.id, 
                           experiment_name=experiment_name, 
                           datastore=dstor,
                           wait_for_provisioning=True,
                           polling_interval=1440)