Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/training/train-in-spark/train-in-spark.png)

# 05. Train in Spark
* Create Workspace
* Create Experiment
* Copy relevant files to the script folder
* Configure and Run

## Prerequisites
If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, go through the [configuration](../../../configuration.ipynb) Notebook first if you haven't already to establish your connection to the AzureML Workspace.

In [None]:
# Check core SDK version number
import azureml.core

print("SDK version:", azureml.core.VERSION)

## Initialize Workspace

Initialize a workspace object from persisted configuration.

In [None]:
from azureml.core import Workspace

ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep='\n')

## Create Experiment


In [None]:
experiment_name = 'train-on-spark'

from azureml.core import Experiment
exp = Experiment(workspace=ws, name=experiment_name)

## View `train-spark.py`

For convenience, we created a training script for you. It is printed below as a text, but you can also run `%pfile ./train-spark.py` in a cell to show the file.

In [None]:
with open('train-spark.py', 'r') as training_script:
 print(training_script.read())

## Configure & Run

**Note** You can use Docker-based execution to run the Spark job in local computer or a remote VM. Please see the `train-in-remote-vm` notebook for example on how to configure and run in Docker mode in a VM. Make sure you choose a Docker image that has Spark installed, such as `microsoft/mmlspark:0.12`.

### Attach an HDI cluster
Here we will use a actual Spark cluster, HDInsight for Spark, to run this job. To use HDI commpute target:
 1. Create a Spark for HDI cluster in Azure. Here are some [quick instructions](https://docs.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-jupyter-spark-sql). Make sure you use the Ubuntu flavor, NOT CentOS.
 2. Enter the IP address, username and password below

In [None]:
from azureml.core.compute import ComputeTarget, HDInsightCompute
from azureml.exceptions import ComputeTargetException
import os

try:
 # if you want to connect using SSH key instead of username/password you can provide parameters private_key_file and private_key_passphrase
 attach_config = HDInsightCompute.attach_configuration(address=os.environ.get('hdiservername', '-ssh.azurehdinsight.net'), 
 ssh_port=22, 
 username=os.environ.get('hdiusername', ''), 
 password=os.environ.get('hdipassword', ''))
 hdi_compute = ComputeTarget.attach(workspace=ws, 
 name='myhdi', 
 attach_configuration=attach_config)

except ComputeTargetException as e:
 print("Caught = {}".format(e.message))
 
 
hdi_compute.wait_for_completion(show_output=True)

### Configure HDI run

Configure an execution using the HDInsight cluster with a conda environment that has `numpy`.

In [None]:
from azureml.core.runconfig import RunConfiguration
from azureml.core.conda_dependencies import CondaDependencies

# use pyspark framework
hdi_run_config = RunConfiguration(framework="pyspark")

# Set compute target to the HDI cluster
hdi_run_config.target = hdi_compute.name

# specify CondaDependencies object to ask system installing numpy
cd = CondaDependencies()
cd.add_conda_package('numpy')
hdi_run_config.environment.python.conda_dependencies = cd

### Submit the script to HDI

In [None]:
from azureml.core import ScriptRunConfig

script_run_config = ScriptRunConfig(source_directory = '.',
 script= 'train-spark.py',
 run_config = hdi_run_config)
run = exp.submit(config=script_run_config)

Monitor the run using a Juypter widget

In [None]:
from azureml.widgets import RunDetails
RunDetails(run).show()

Note: if you need to cancel a run, you can follow [these instructions](https://aka.ms/aml-docs-cancel-run).

After the run is succesfully finished, you can check the metrics logged.

In [None]:
# get all metris logged in the run
metrics = run.get_metrics()
print(metrics)