Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/reinforcement-learning/multiagent-particle-envs/particle.png)

# Reinforcement Learning in Azure Machine Learning - Training multiple agents on collaborative ParticleEnv tasks

This tutorial will show you how to train policies in a multi-agent scenario.
We use OpenAI Gym's [Particle environments](https://github.com/openai/multiagent-particle-envs),
which model agents and landmarks in a two-dimensional world. Particle comes with
several predefined scenarios, both competitive and collaborative, and with or without communication.

For this tutorial, we pick a cooperative navigation scenario where N agents are in a world with N
landmarks.  The agents' goal is to cover all the landmarks without collisions,
so agents must learn to avoid each other (social distancing!).  The video below shows training
results for N=3 agents/landmarks:

<table style="width:50%">
  <tr>
      <th style="text-align: center;">
          <img src="./images/particle_simple_spread.gif" alt="Particle video" align="middle" margin-left="auto" margin-right="auto"/>
      </th>
  </tr>
  <tr style="text-align: center;">
      <th>Fig 1. Video of 3 agents covering 3 landmarks in a multiagent Particle scenario.</th>
  </tr>
</table>

The tutorial will cover the following steps:
- Initializing Azure Machine Learning resources for training
- Training policies in a multi-agent environment with Azure Machine Learning service
- Monitoring training progress

## Prerequisites

The user should have completed the Azure Machine Learning introductory tutorial. You will need to make sure that you have a valid subscription id, a resource group and a workspace. For detailed instructions see [Tutorial: Get started creating your first ML experiment](https://docs.microsoft.com/en-us/azure/machine-learning/tutorial-1st-experiment-sdk-setup).

Please ensure that you have a current version of IPython (>= 7.15) installed.

While this is a standalone notebook, we highly recommend going over the introductory notebooks for RL first.
- Getting started:
  - [RL using a compute instance with Azure Machine Learning](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/reinforcement-learning/cartpole-on-compute-instance/cartpole_ci.ipynb)
  - [RL using Azure Machine Learning compute](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/reinforcement-learning/cartpole-on-single-compute/cartpole_sc.ipynb)
- [Scaling RL training runs with Azure Machine Learning](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/reinforcement-learning/atari-on-distributed-compute/pong_rllib.ipynb)

Advanced users might also be interested in [this notebook](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/reinforcement-learning/minecraft-on-distributed-compute/minecraft.ipynb) demonstrating how to train a Minecraft RL agent in Azure Machine Learning.

## Initialize resources

All required Azure Machine Learning service resources for this tutorial can be set up from Jupyter. This includes:

- Connecting to your existing Azure Machine Learning workspace.
- Creating an experiment to track runs.
- Creating remote compute targets for [Ray](https://docs.ray.io/en/latest/index.html).


### Azure Machine Learning SDK

Display the Azure Machine Learning SDK version.

In [None]:
import azureml.core
print('Azure Machine Learning SDK Version: ', azureml.core.VERSION)

### Connect to workspace

Get a reference to an existing Azure Machine Learning workspace.

In [None]:
from azureml.core import Workspace

ws = Workspace.from_config()
print(ws.name, ws.location, ws.resource_group, sep=' | ')

### Create an experiment

Create an experiment to track the runs in your workspace. A
workspace can have multiple experiments and each experiment
can be used to track multiple runs (see [documentation](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.experiment.experiment?view=azure-ml-py)
for details).

In [None]:
from azureml.core import Experiment

exp = Experiment(workspace=ws, name='particle-multiagent')

### Create or attach an existing compute resource

A compute target is a designated compute resource where you run your training script. For more information, see [What are compute targets in Azure Machine Learning service?](https://docs.microsoft.com/en-us/azure/machine-learning/concept-compute-target).

#### CPU target for Ray head

In the experiment setup for this tutorial, the Ray head node will
run on a CPU node (D3 type). A maximum cluster size of 1 node is
therefore sufficient. If you wish to run multiple experiments in
parallel using the same CPU cluster, you may elect to increase this
number. The cluster will automatically scale down to 0 nodes when
no training jobs are scheduled (see min_nodes).

The code below creates a compute cluster of D3 type nodes.
If the cluster with the specified name is already in your workspace
the code will skip the creation process.

**Note: Creation of a compute resource can take several minutes**

In [None]:
from azureml.core.compute import AmlCompute, ComputeTarget

cpu_cluster_name = 'cpu-cl-d3'

if cpu_cluster_name in ws.compute_targets:
    cpu_cluster = ws.compute_targets[cpu_cluster_name]
    if cpu_cluster and type(cpu_cluster) is AmlCompute:
        if cpu_cluster.provisioning_state == 'Succeeded':
            print('Found existing compute target for {}. Using it.'.format(cpu_cluster_name))
        else: 
            raise Exception('Found existing compute target for {} '.format(cpu_cluster_name)
                            + 'but it is in state {}'.format(cpu_cluster.provisioning_state))
else:
    print('Creating a new compute target for {}...'.format(cpu_cluster_name))
    provisioning_config = AmlCompute.provisioning_configuration(
        vm_size='STANDARD_D3',
        min_nodes=0, 
        max_nodes=1)

    cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, provisioning_config)
    cpu_cluster.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)
    
    print('Cluster created.')

## Training the policies

### Training environment

This tutorial uses a custom docker image
with the necessary software installed. The [Environment](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-use-environments)
class stores the configuration for the training environment. The
docker image is set via `env.docker.base_image`.
`user_managed_dependencies` is set so that
the preinstalled Python packages in the image are preserved.

Note that since we want to capture videos of the training runs requiring a display, we set the interpreter_path such that the Python process is started via **xvfb-run**.

In [None]:
import os
from azureml.core import Environment
    
cpu_particle_env = Environment(name='particle-cpu')

cpu_particle_env.docker.enabled = True
cpu_particle_env.docker.base_image = 'akdmsft/particle-cpu'
cpu_particle_env.python.interpreter_path = 'xvfb-run -s "-screen 0 640x480x16 -ac +extension GLX +render" python'

max_train_time = os.environ.get('AML_MAX_TRAIN_TIME_SECONDS', 2 * 60 * 60)
cpu_particle_env.environment_variables['AML_MAX_TRAIN_TIME_SECONDS'] = str(max_train_time)
cpu_particle_env.python.user_managed_dependencies = True

### Training script

This tutorial uses the multiagent algorithm [Multi-Agent Deep Deterministic Policy Gradient (MADDPG)](https://docs.ray.io/en/latest/rllib-algorithms.html?highlight=maddpg#multi-agent-deep-deterministic-policy-gradient-contrib-maddpg).
For training policies in a multiagent scenario, Ray's RLlib also
requires the `multiagent` configuration section to be specified. You
can find more information in the [common parameters](https://docs.ray.io/en/latest/rllib-training.html?highlight=multiagent#common-parameters)
documentation.

For monitoring and understanding the training progress, one
of the training environments is wrapped in a [Gym monitor](https://github.com/openai/gym/blob/master/gym/wrappers/monitor.py)
which periodically captures videos - by default every 200 training
iterations.

The stopping criteria are set such that the training run is
terminated after either a mean reward of -400 is observed, or
training has run for over 2 hours.

### Submitting a training run

Below, you create the training run using a `ReinforcementLearningEstimator`
object, which contains all the configuration parameters for this experiment:

- `source_directory`: Contains the training script and helper files to be
  copied onto the node.
- `entry_script`: The training script, described in more detail above.
- `script_params`: The command line arguments to pass to the entry script.
- `compute_target`: The compute target for training script execution.
- `environment`: The Azure Machine Learning environment definition for the node running the training.
- `max_run_duration_seconds`: The time after which to abort the run if it is still running.

For more details, please take a look at the [online documentation](https://docs.microsoft.com/en-us/python/api/azureml-contrib-reinforcementlearning/?view=azure-ml-py)
for Azure Machine Learning service's reinforcement learning offering.

Note that you can use the same notebook and scripts to experiment with
different Particle environments.  You can find a list of supported
environments [here](https://github.com/openai/multiagent-particle-envs/tree/master#list-of-environments).
Simply change the `--scenario` parameter to a supported scenario.

In order to get the best training results, you can also adjust the
`--final-reward` parameter to determine when to stop training. A greater
reward means longer running time, but improved results. By default,
the final reward will be -400, which should show good progress after
about one hour of run time.

For this notebook, we use a single D3 nodes, giving us a total of 4 CPUs and
0 GPUs. One CPU is used by the MADDPG trainer, and an additional CPU is
consumed by the RLlib rollout worker. The other 2 CPUs are not used, though
smaller node types will run out of memory for this task.

Lastly, the RunDetails widget displays information about the submitted RL
experiment, including a link to the Azure portal with more details.

In [None]:
from azureml.contrib.train.rl import ReinforcementLearningEstimator
from azureml.widgets import RunDetails

estimator = ReinforcementLearningEstimator(
    source_directory='files',
    entry_script='particle_train.py',
    script_params={
        '--scenario': 'simple_spread',
        '--final-reward': -400
    },
    compute_target=cpu_cluster,
    environment=cpu_particle_env,
    max_run_duration_seconds=3 * 60 * 60
)

train_run = exp.submit(config=estimator)

RunDetails(train_run).show()

In [None]:
# If you wish to cancel the run before it completes, uncomment and execute:
#train_run.cancel()

## Monitoring training progress

### View the Tensorboard

The Tensorboard can be displayed via the Azure Machine Learning
service's [Tensorboard API](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-monitor-tensorboard).
When running locally, please make sure to follow the instructions
in the link and install required packages. Running this cell will output a URL for the Tensorboard.

Note that the training script sets the log directory when
starting RLlib via the local_dir parameter. ./logs will automatically
appear in the downloadable files for a run. Since this script is
executed on the Ray head node run, we need to get a reference to it
as shown below.

The Tensorboard API will continuously stream logs from the run.

**Note: It may take a couple of minutes after the run is in "Running"
state before Tensorboard files are available and the board will refresh automatically**

In [None]:
import time
from azureml.tensorboard import Tensorboard

head_run = None

timeout = 60
while timeout > 0 and head_run is None:
    timeout -= 1
    
    try:
        head_run = next(r for r in train_run.get_children() if r.id.endswith('head'))
    except StopIteration:
        time.sleep(1)

tb = Tensorboard([head_run])
tb.start()

### View training videos

As mentioned above, we record videos of the agents interacting with the
Particle world. These videos are often a crucial indicator for training
success. The code below downloads the latest video as it becomes available
and displays it in-line.

Over time, the agents learn to cooperate and avoid collisions while
traveling to all landmarks.

**Note: It can take several minutes for a video to appear after the run
was started.**

In [None]:
import tempfile
from azureml.core import Dataset
from azureml.data.dataset_error_handling import DatasetValidationError

from IPython.display import clear_output
from IPython.core.display import display, Video

datastore = ws.get_default_datastore()
path_prefix = './tmp_videos'

def download_latest_training_video(run, video_checkpoint_counter):
    run_artifacts_path = os.path.join('azureml', run.id)
    
    try:
        run_artifacts_ds = Dataset.File.from_files(datastore.path(os.path.join(run_artifacts_path, '**')))
    except DatasetValidationError as e:
        # This happens at the start of the run when there is no data available
        # in the run's artifacts
        return None, video_checkpoint_counter
    
    video_files = [file for file in run_artifacts_ds.to_path() if file.endswith('.mp4')]
    if len(video_files) == video_checkpoint_counter:
        return None, video_checkpoint_counter
    
    iteration_numbers = [int(vf[vf.rindex('video') + len('video') : vf.index('.mp4')]) for vf in video_files]
    latest_video = next(vf for vf in video_files if vf.endswith('{num}.mp4'.format(num=max(iteration_numbers))))
    latest_video = os.path.join(run_artifacts_path, os.path.normpath(latest_video[1:]))
    
    datastore.download(
        target_path=path_prefix,
        prefix=latest_video.replace('\\', '/'),
        show_progress=False)
    
    return os.path.join(path_prefix, latest_video), len(video_files)


def render_video(vf):
    clear_output(wait=True)
    display(Video(data=vf, embed=True, html_attributes='loop autoplay width=50%'))

In [None]:
import shutil

terminal_statuses = ['Canceled', 'Completed', 'Failed']
video_checkpoint_counter = 0

while head_run.get_status() not in terminal_statuses:
    video_file, video_checkpoint_counter = download_latest_training_video(head_run, video_checkpoint_counter)
    if video_file is not None:
        render_video(video_file)
        
        print('Displaying video number {}'.format(video_checkpoint_counter))
        shutil.rmtree(path_prefix)
    
    # Interrupting the kernel can take up to 15 seconds
    # depending on when time.sleep started
    time.sleep(15)
    
train_run.wait_for_completion()
print('The training run has reached a terminal status.')

## Cleaning up

Below, you can find code snippets for your convenience to clean up any resources created as part of this tutorial you don't wish to retain.

In [None]:
# to stop the Tensorboard, uncomment and run
#tb.stop()

In [None]:
# to delete the cpu compute target, uncomment and run
#cpu_cluster.delete()

## Next steps

We would love to hear your feedback! Please let us know what you think of Reinforcement Learning in Azure Machine Learning and what features you are looking forward to.