Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/tutorials/get-started-day1/day1-part3-train-model.png)

# Tutorial: Train your first ML model (Part 3 of 4)

---
## Introduction
In the [previous tutorial](day1-part2-hello-world.ipynb), you ran a trivial "Hello world!" script in the cloud using Azure Machine Learning's Python SDK. This time you take it a step further by submitting a script that will train a machine learning model. This example will help you understand how Azure Machine Learning eases consistent behavior between debugging on a compute instance or laptop development environment, and remote runs.

Learning these concepts means that by the end of this session, you can:

* Use Conda to define an Azure Machine Learning environment.
* Train a model in the cloud.
* Log metrics to Azure Machine Learning.

This notebook follows the steps provided on the [Python (day 1) - train a model documentation page](https://aka.ms/day1aml).

## Prerequisites

- You have completed the following:
  - [Setup on your compute cluster](day1-part1-setup.ipynb)
  - [Tutorial: Hello World example](day1-part2-hello-world.md)
- Familiarity with Python and Machine Learning concepts
- If you are using a compute instance in Azure Machine Learning to run this notebook series, you are all set. Otherwise, please follow the [Configure a development environment for Azure Machine Learning](https://docs.microsoft.com/azure/machine-learning/how-to-configure-environment)
---

## Your machine learning code

This tutorial shows you how to train a PyTorch model on the CIFAR 10 dataset using an Azure Machine Learning Cluster. In this case you will be using a CPU cluster, but this could equally be a GPU cluster. Whilst this tutorial uses PyTorch, the steps we show you apply to *any* machine learning code. 

In the `code/pytorch-cifar10-train` subdirectory you will see 2 files:

1. [model.py](code/pytorch-cifar10-train/model.py) - this defines the neural network architecture
1. [train.py](code/pytorch-cifar10-train/train.py) - This is the training script. This script downloads the CIFAR10 dataset using PyTorch `torchvision.dataset` APIs, sets up the network defined in
`model.py`, and trains it for two epochs using standard SGD and cross-entropy loss.

Note the code is based on [this introductory example from PyTorch](https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html). 

### Define the Python environment for your machine learning code

For demonstration purposes, we're going to use a Conda environment but the steps for a pip virtual environment are almost identical. This environment has all the dependencies that your model and training script require. 

In the `configuration` directory there is a *conda dependencies* file called [pytorch-env.yml](configuration/pytorch-env.yml) that specifies the dependencies to run the python code. 

### Test in your development environment

Test your script runs on either your compute instance or laptop using this environment.

In [None]:
!python code/pytorch-cifar10-train/train.py

**You should notice that the script has downloaded the data into a directory called `data`.**

## Submit your machine learning code to Azure Machine Learning

The difference to the control script below and the one used to submit "hello world" is that you adjust the environment to be set from the conda dependencies file you created earlier.

> <span style="color:purple; font-weight:bold">! NOTE <br>
> The first time you run this script, Azure Machine Learning will build a new docker image from your PyTorch environment. The whole run could take 5-10 minutes to complete. You can see the docker build logs in the widget by selecting the `20_image_build_log.txt` in the log files dropdown. This image will be reused in future runs making them run much quicker.</span>


In [None]:
from azureml.core import Workspace, Experiment, Environment, ScriptRunConfig
from azureml.widgets import RunDetails

ws = Workspace.from_config()
experiment = Experiment(workspace=ws, name='day1-experiment-train')
config = ScriptRunConfig(source_directory='code/pytorch-cifar10-train/', script='train.py', compute_target='cpu-cluster')

env = Environment.from_conda_specification(name='pytorch-env', file_path='configuration/pytorch-env.yml')
config.run_config.environment = env

run = experiment.submit(config)

RunDetails(run).show()

### Understand the control code

Compared to the control script that submitted the "hello world" example, this control script introduces the following:

| Code | Description
| --- | --- |
| `env = Environment.from_conda_specification( ...)` | Azure Machine Learning provides the concept of an `Environment` to represent a reproducible, <br>versioned Python environment for running experiments. Here you have created it from a yaml conda dependencies file.|
| `config.run_config.environment = env` | adds the environment to the ScriptRunConfig. |


**There are many ways to create AML environments, including [from a pip requirements.txt](https://docs.microsoft.com/python/api/azureml-core/azureml.core.environment.environment?view=azure-ml-py&preserve-view=true#from-pip-requirements-name--file-path-), or even [from an existing local Conda environment](https://docs.microsoft.com/python/api/azureml-core/azureml.core.environment.environment?view=azure-ml-py&preserve-view=true#from-existing-conda-environment-name--conda-environment-name-).**


Once your image is built, select `70_driver_log.txt` to see the output of your training script, which should look like:

```txt
Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz
...
Files already downloaded and verified
epoch=1, batch= 2000: loss 2.19
...
epoch=2, batch=12000: loss 1.27
Finished Training
```

Environments can be registered to a workspace with `env.register(ws)`, allowing them to be easily shared, reused, and versioned. Environments make it easy to reproduce previous results and to collaborate with your team.

Azure Machine Learning also maintains a collection of curated environments. These environments cover common ML scenarios and are backed by cached Docker images. Cached Docker images make the first remote run faster.

In short, using registered environments can save you time! More details can be found on the [environments documentation](./how-to-use-environments.md)

## Log training metrics

Now that you have a model training in Azure Machine Learning, start tracking some performance metrics.
The current training script prints metrics to the terminal. Azure Machine Learning provides a
mechanism for logging metrics with more functionality. By adding a few lines of code, you gain the ability to visualize metrics in the studio and to compare metrics between multiple runs.

### Machine learning code updates

In the `code/pytorch-cifar10-train-with-logging` directory you will notice the [train.py](code/pytorch-cifar10-train-with-logging/train.py) script has been modified with two additional lines that will log the loss to the Azure Machine Learning Studio:

```python
# in train.py
run = Run.get_context()
...
run.log('loss', loss)
```

Metrics in Azure Machine Learning are:

- Organized by experiment and run so it's easy to keep track of and
compare metrics.
- Equipped with a UI so we can visualize training performance in the studio or in the notebook widget.
- **Designed to scale** You can submit concurrent experiments and the Azure Machine Learning cluster will scale out (up to the maximum node count of the cluster) to run the experiments in parallel.

### Update the Environment for your machine learning code

The `train.py` script just took a new dependency on `azureml.core`. Therefore, the conda dependecies file [pytorch-aml-env](configuration/pytorch-aml-env.yml) reflects this change.

### Submit your machine learning code to Azure Machine Learning
Submit your code once more. This time the widget includes the metrics where you can now see live updates on the model training loss!

In [None]:
from azureml.core import Workspace, Experiment, Environment, ScriptRunConfig
from azureml.widgets import RunDetails

ws = Workspace.from_config()
experiment = Experiment(workspace=ws, name='day1-experiment-train')
config = ScriptRunConfig(source_directory='code/pytorch-cifar10-train-with-logging', script='train.py', compute_target='cpu-cluster')

env = Environment.from_conda_specification(name='pytorch-aml-env', file_path='configuration/pytorch-aml-env.yml')
config.run_config.environment = env

run = experiment.submit(config)
RunDetails(run).show()

## Next steps

In this session, you upgraded from a basic "Hello world!" script to a more realistic
training script that required a specific Python environment to run. You saw how
to take a local Conda environment to the cloud with Azure Machine Learning Environments. Finally, you
saw how in a few lines of code you can log metrics to Azure Machine Learning.

In the next session, you'll see how to work with data in Azure Machine Learning by uploading the CIFAR10
dataset to Azure.

[Tutorial: Bring your own data](day1-part4-data.ipynb)
