Copyright (c) Microsoft Corporation. All rights reserved.  
Licensed under the MIT License.

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-getting-started.png)

# Azure Machine Learning Pipeline with NotebookRunnerStep
This notebook demonstrates the use of `NotebookRunnerStep`. It allows you to run a local notebook as a step in Azure Machine Learning Pipeline.

## Introduction
In this example we showcase how you can run another notebook `notebook_runner/training_notebook.ipynb` as a step in Azure Machine Learning Pipeline.

If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, make sure you have executed the [configuration](https://aka.ms/pl-config) before running this notebook.

In this notebook you will learn how to:
1. Create an `Experiment` in an existing `Workspace`.
2. Create or Attach existing AmlCompute to a workspace.
3. Configure NotebookRun using `NotebokRunConfig`.
5. Use NotebookRunnerStep.
6. Run the notebook on `AmlCompute` as a pipeline step consuming the output of a python script step.

Advantages of running your notebook as a step in pipeline:
1. Run your notebook like a python script without converting into .py files, leveraging complete end to end experience of Azure Machine Learning Pipelines.
2. Use pipeline intermediate data to and from the notebook along with other steps in pipeline.
3. Parameterize your notebook with [Pipeline Parameters](./aml-pipelines-publish-and-run-using-rest-endpoint.ipynb).


## Azure Machine Learning and Pipeline SDK-specific imports

In [None]:
import os
import requests
import tempfile

import azureml.core

from azureml.core.compute import AmlCompute, ComputeTarget
from azureml.core.runconfig import RunConfiguration
from azureml.data.data_reference import DataReference
from azureml.pipeline.core import PipelineData
from azureml.core.datastore import Datastore

from azureml.core import Workspace, Experiment
from azureml.contrib.notebook import NotebookRunConfig, AzureMLNotebookHandler

from azureml.pipeline.core import Pipeline
from azureml.pipeline.steps import PythonScriptStep
from azureml.contrib.notebook import NotebookRunnerStep

# Check core SDK version number
print("SDK version:", azureml.core.VERSION)

### Initialize Workspace

Initialize a [workspace](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.workspace(class%29) object from persisted configuration.

In [None]:
ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\n')
ws.set_default_datastore("workspaceblobstore")

### Upload data to datastore

In [None]:
# download data file from remote
response = requests.get("https://dprepdata.blob.core.windows.net/demo/Titanic.csv")
titanic_file = os.path.join(tempfile.mkdtemp(), "Titanic.csv")
with open(titanic_file, "w") as f:
    f.write(response.content.decode("utf-8"))
Datastore.get(ws, "workspaceblobstore").upload_files([titanic_file], target_path="titanic", overwrite=True)
print("Upload call completed")

## Create an Azure ML experiment
Let's create an experiment named "notebook-step-run-example" and a folder to holding the notebook and other scripts. The script runs will be recorded under the experiment in Azure.

The best practice is to use separate folders for scripts and its dependent files for each step and specify that folder as the `source_directory` for the step. This helps reduce the size of the snapshot created for the step (only the specific folder is snapshotted). Since changes in any files in the `source_directory` would trigger a re-upload of the snapshot, this helps keep the reuse of the step when there are no changes in the `source_directory` of the step.

In [None]:
# Choose a name for the run history container in the workspace.
experiment_name = 'notebook-step-run-example'
source_directory = 'notebook_runner'

experiment = Experiment(ws, experiment_name)
experiment

### Create or Attach an AmlCompute cluster
You will need to create a [compute target](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.computetarget?view=azure-ml-py) for your remote run. In this tutorial, you get the default `AmlCompute` as your training compute resource.

> Note that if you have an AzureML Data Scientist role, you will not have permission to create compute resources. Talk to your workspace or IT admin to create the compute targets described in this section, if they do not already exist.

In [None]:
# Choose a name for your cluster.
amlcompute_cluster_name = "cpu-cluster"

found = False
# Check if this compute target already exists in the workspace.
cts = ws.compute_targets
if amlcompute_cluster_name in cts and cts[amlcompute_cluster_name].type == 'AmlCompute':
    found = True
    print('Found existing compute target.')
    compute_target = cts[amlcompute_cluster_name]
    
if not found:
    print('Creating a new compute target...')
    provisioning_config = AmlCompute.provisioning_configuration(vm_size = "STANDARD_D2_V2", # for GPU, use "STANDARD_NC6"
                                                                #vm_priority = 'lowpriority', # optional
                                                                max_nodes = 4)

    # Create the cluster.
    compute_target = ComputeTarget.create(ws, amlcompute_cluster_name, provisioning_config)
    
    # Can poll for a minimum number of nodes and for a specific timeout.
    # If no min_node_count is provided, it will use the scale settings for the cluster.
    compute_target.wait_for_completion(show_output = True, min_node_count = 1, timeout_in_minutes = 10)
    
     # For a more detailed view of current AmlCompute status, use get_status().

### Create a new RunConfig object

In [None]:
from azureml.core.conda_dependencies import CondaDependencies

conda_run_config = RunConfiguration(framework="python")

conda_run_config.environment.docker.enabled = True
conda_run_config.environment.docker.base_image = azureml.core.runconfig.DEFAULT_CPU_IMAGE

cd = CondaDependencies.create(pip_packages=['azureml-sdk'])
conda_run_config.environment.python.conda_dependencies = cd

print('run config is ready')

### Define input and outputs

In [None]:
input_data = DataReference(
    datastore=Datastore.get(ws, "workspaceblobstore"),
    data_reference_name="blob_test_data",
    path_on_datastore="titanic/Titanic.csv")

output_data = PipelineData(name="processed_data",
                           datastore=Datastore.get(ws, "workspaceblobstore"))

### Create notebook run configuration and set parameters values

In [None]:
handler = AzureMLNotebookHandler(timeout=600, progress_bar=False, log_output=True)

cfg = NotebookRunConfig(source_directory=source_directory, notebook="training_notebook.ipynb",
                        handler = handler,
                        parameters={"arg1": "Machine Learning"},
                        run_config=conda_run_config)

print("Notebook Run Config is created.")

### Define PythonScriptStep

In [None]:
print('Source directory for the step is {}.'.format(os.path.realpath('./train')))
python_script_step = PythonScriptStep(
                         script_name="train.py",
                         arguments=["--input_data", input_data],
                         inputs=[input_data],
                         outputs=[output_data],
                         compute_target=compute_target, 
                         source_directory="./train",
                         allow_reuse=True)
print("python_script_step created")

### Define NotebookRunnerStep

This step will consume intermediate output produced by `python_script_step` as an input.

Optionally, a output of type `output_notebook_pipeline_data_name` can be added to the `NotebookRunnerStep` to redirect the `output_notebook` of notebook run to `NotebookRunnerStep`'s step output produced as `PipelineData` and can be further passed along the pipeline.

In [None]:
from azureml.pipeline.core import PipelineParameter

output_from_notebook = PipelineData(name="notebook_processed_data",
                                    datastore=Datastore.get(ws, "workspaceblobstore"))

my_pipeline_param = PipelineParameter(name="pipeline_param", default_value="my_param")

print('Source directory for the step is {}.'.format(os.path.realpath(source_directory)))
notebook_runner_step = NotebookRunnerStep(name="training_notebook_step",
                                          notebook_run_config=cfg,
                                          params={"my_pipeline_param": my_pipeline_param},
                                          inputs=[output_data],
                                          outputs=[output_from_notebook],
                                          allow_reuse=True,
                                          compute_target=compute_target,
                                          output_notebook_pipeline_data_name="notebook_result")

print("Notebook Runner Step is Created.")

### Build Pipeline

Once we have the steps (or steps collection), we can build the [pipeline](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-core/azureml.pipeline.core.pipeline.pipeline?view=azure-ml-py). By deafult, all these steps will run in **parallel** once we submit the pipeline for run.

A pipeline is created with a list of steps and a workspace. Submit a pipeline using [submit](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-core/azureml.pipeline.core.pipeline.pipeline?view=azure-ml-py#submit-experiment-name--pipeline-parameters-none--continue-on-step-failure-false--regenerate-outputs-false--parent-run-id-none----kwargs-). When submit is called, a [PipelineRun](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-core/azureml.pipeline.core.pipelinerun?view=azure-ml-py) is created which in turn creates [StepRun](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-core/azureml.pipeline.core.steprun?view=azure-ml-py) objects for each step in the workflow.

In [None]:
pipeline1 = Pipeline(workspace=ws, steps=[notebook_runner_step])
print("Pipeline creation complete")

In [None]:
pipeline_run1 = experiment.submit(pipeline1)

In [None]:
from azureml.widgets import RunDetails
RunDetails(pipeline_run1).show()

### Download output notebook

`output_notebook` can be retrieved via pipeline step output if `output_notebook_pipeline_data_name` is provided to the `NotebookRunnerStep`

In [None]:
pipeline_run1.wait_for_completion()
train_step = pipeline_run1.find_step_run('training_notebook_step') # Retrieve the step runs by name `train.py`

if train_step:
    train_step_obj = train_step[0] # since we have only one step by name `training_notebook_step`
    train_step_obj.get_output_data('notebook_result').download(source_directory) # download the output to source_directory