Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/training/manage-runs/manage-runs.png)

# Manage runs

## Table of contents

1. [Introduction](#Introduction)
1. [Setup](#Setup)
1. [Start, monitor and complete a run](#Start,-monitor-and-complete-a-run)
1. [Add properties and tags](#Add-properties-and-tags)
1. [Query properties and tags](#Query-properties-and-tags)
1. [Start and query child runs](#Start-and-query-child-runs)
1. [Cancel or fail runs](#Cancel-or-fail-runs)
1. [Reproduce a run](#Reproduce-a-run)
1. [Next steps](#Next-steps)

## Introduction

When you're building enterprise-grade machine learning models, it is important to track, organize, monitor and reproduce your training runs. For example, you might want to trace the lineage behind a model deployed to production, and re-run the training experiment to troubleshoot issues. 

This notebooks shows examples how to use Azure Machine Learning services to manage your training runs.

## Setup

If you are using an Azure Machine Learning Notebook VM, you are all set.  Otherwise, go through the [configuration](../../../configuration.ipynb) Notebook first if you haven't already to establish your connection to the AzureML Workspace. Also, if you're new to Azure ML, we recommend that you go through [the tutorial](https://docs.microsoft.com/en-us/azure/machine-learning/service/tutorial-train-models-with-aml) first to learn the basic concepts.

Let's first import required packages, check Azure ML SDK version, connect to your workspace and create an Experiment to hold the runs.

In [None]:
import azureml.core
from azureml.core import Workspace, Experiment, Run
from azureml.core import ScriptRunConfig

print(azureml.core.VERSION)

In [None]:
ws = Workspace.from_config()

In [None]:
exp = Experiment(workspace=ws, name="explore-runs")

## Start, monitor and complete a run

A run is an unit of execution, typically to train a model, but for other purposes as well, such as loading or transforming data. Runs are tracked by Azure ML service, and can be instrumented with metrics and artifact logging.

A simplest way to start a run in your interactive Python session is to call *Experiment.start_logging* method. You can then log metrics from within the run.

In [None]:
notebook_run = exp.start_logging()

notebook_run.log(name="message", value="Hello from run!")

print(notebook_run.get_status())

Use *get_status method* to get the status of the run.

In [None]:
print(notebook_run.get_status())

Also, you can simply enter the run to get a link to Azure Portal details

In [None]:
notebook_run

Method *get_details* gives you more details on the run.

In [None]:
notebook_run.get_details()

Use *complete* method to end the run.

In [None]:
notebook_run.complete()
print(notebook_run.get_status())

You can also use Python's *with...as* pattern. The run will automatically complete when moving out of scope. This way you don't need to manually complete the run.

In [None]:
with exp.start_logging() as notebook_run:
    notebook_run.log(name="message", value="Hello from run!")
    print("Is it still running?",notebook_run.get_status())
    
print("Has it completed?",notebook_run.get_status())

Next, let's look at submitting a run as a separate Python process. To keep the example simple, we submit the run on local computer. Other targets could include remote VMs and Machine Learning Compute clusters in your Azure ML Workspace.

We use *hello.py* script as an example. To perform logging, we need to get a reference to the Run instance from within the scope of the script. We do this using *Run.get_context* method.

In [None]:
!more hello.py

Let's submit the run on a local computer. A standard pattern in Azure ML SDK is to create run configuration, and then use *Experiment.submit* method.

In [None]:
run_config = ScriptRunConfig(source_directory='.', script='hello.py')

local_script_run = exp.submit(run_config)

You can view the status of the run as before

In [None]:
print(local_script_run.get_status())
local_script_run

Submitted runs have additional log files you can inspect using *get_details_with_logs*.

In [None]:
local_script_run.get_details_with_logs()

Use *wait_for_completion* method to block the local execution until remote run is complete.

In [None]:
local_script_run.wait_for_completion(show_output=True)
print(local_script_run.get_status())

## Add properties and tags

Properties and tags help you organize your runs. You can use them to describe, for example, who authored the run, what the results were, and what machine learning approach was used. And as you'll later learn, properties and tags can be used to query the history of your runs to find the important ones.

For example, let's add "author" property to the run:

In [None]:
local_script_run.add_properties({"author":"azureml-user"})
print(local_script_run.get_properties())

Properties are immutable. Once you assign a value it cannot be changed, making them useful as a permanent record for auditing purposes.

In [None]:
try:
    local_script_run.add_properties({"author":"different-user"})
except Exception as e:
    print(e)

Tags on the other hand can be changed:

In [None]:
local_script_run.tag("quality", "great run")
print(local_script_run.get_tags())

In [None]:
local_script_run.tag("quality", "fantastic run")
print(local_script_run.get_tags())

You can also add a simple string tag. It appears in the tag dictionary with value of None

In [None]:
local_script_run.tag("worth another look")
print(local_script_run.get_tags())

## Query properties and tags

You can quary runs within an experiment that match specific properties and tags. 

In [None]:
list(exp.get_runs(properties={"author":"azureml-user"},tags={"quality":"fantastic run"}))

In [None]:
list(exp.get_runs(properties={"author":"azureml-user"},tags="worth another look"))

## Start and query child runs

You can use child runs to group together related runs, for example different hyperparameter tuning iterations.

Let's use *hello_with_children* script to create a batch of 5 child runs from within a submitted run.

In [None]:
!more hello_with_children.py

In [None]:
run_config = ScriptRunConfig(source_directory='.', script='hello_with_children.py')

local_script_run = exp.submit(run_config)
local_script_run.wait_for_completion(show_output=True)
print(local_script_run.get_status())

You can start child runs one by one. Note that this is less efficient than submitting a batch of runs, because each creation results in a network call.

Child runs too complete automatically as they move out of scope.

In [None]:
with exp.start_logging() as parent_run:
    for c,count in enumerate(range(5)):
        with parent_run.child_run() as child:
            child.log(name="Hello from child run", value=c)

To query the child runs belonging to specific parent, use *get_children* method.

In [None]:
list(parent_run.get_children())

## Cancel or fail runs

Sometimes, you realize that the run is not performing as intended, and you want to cancel it instead of waiting for it to complete.

As an example, let's create a Python script with a delay in the middle.

In [None]:
!more hello_with_delay.py

You can use *cancel* method to cancel a run.

In [None]:
run_config = ScriptRunConfig(source_directory='.', script='hello_with_delay.py')

local_script_run = exp.submit(run_config)
print("Did the run start?",local_script_run.get_status())
local_script_run.cancel()
print("Did the run cancel?",local_script_run.get_status())

You can also mark an unsuccessful run as failed.

In [None]:
local_script_run = exp.submit(run_config)
local_script_run.fail()
print(local_script_run.get_status())

## Reproduce a run

When updating or troubleshooting on a model deployed to production, you sometimes need to revisit the original training run that produced the model. To help you with this, Azure ML service by default creates snapshots of your scripts a the time of run submission:

You can use *restore_snapshot* to obtain a zip package of the latest snapshot of the script folder. 

In [None]:
local_script_run.restore_snapshot(path="snapshots")

You can then extract the zip package, examine the code, and submit your run again.

## Next steps

 * To learn more about logging APIs, see [logging API notebook](./logging-api/logging-api.ipynb)
 * To learn more about remote runs, see [train on AML compute notebook](./train-on-amlcompute/train-on-amlcompute.ipynb)