Copyright (c) Microsoft Corporation. All rights reserved.  
Licensed under the MIT License.

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/contrib/gbdt/lightgbm/lightgbm-example.png)

# Use LightGBM Estimator in Azure Machine Learning
In this notebook we will demonstrate how to run a training job using LightGBM Estimator. [LightGBM](https://lightgbm.readthedocs.io/en/latest/) is a gradient boosting framework that uses tree based learning algorithms. 

## Prerequisites
This notebook uses azureml-contrib-gbdt package, if you don't already have the package, please install by uncommenting below cell.

In [None]:
#!pip install azureml-contrib-gbdt

In [None]:
from azureml.core import Workspace, Run, Experiment
import shutil, os
from azureml.widgets import RunDetails
from azureml.contrib.gbdt import LightGBM
from azureml.train.dnn import Mpi
from azureml.core.compute import AmlCompute, ComputeTarget
from azureml.core.compute_target import ComputeTargetException

If you are using an AzureML Compute Instance, you are all set. Otherwise, go through the [configuration.ipynb](../../../configuration.ipynb) notebook to install the Azure Machine Learning Python SDK and create an Azure ML Workspace

## Set up machine learning resources

In [None]:
ws = Workspace.from_config()

print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')

In [None]:
cluster_vm_size = "STANDARD_DS14_V2"
cluster_min_nodes = 0
cluster_max_nodes = 20
cpu_cluster_name = 'TrainingCompute2' 

try:
    cpu_cluster = AmlCompute(ws, cpu_cluster_name)
    if cpu_cluster and type(cpu_cluster) is AmlCompute:
        print('found compute target: ' + cpu_cluster_name)
except ComputeTargetException:
    print('creating a new compute target...')
    provisioning_config = AmlCompute.provisioning_configuration(vm_size = cluster_vm_size, 
                                                                vm_priority = 'lowpriority', 
                                                                min_nodes = cluster_min_nodes, 
                                                                max_nodes = cluster_max_nodes)
    cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, provisioning_config)
    
    # can poll for a minimum number of nodes and for a specific timeout. 
    # if no min node count is provided it will use the scale settings for the cluster
    cpu_cluster.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)
    
     # For a more detailed view of current Azure Machine Learning Compute  status, use get_status()
    print(cpu_cluster.get_status().serialize())

From this point, you can either upload training data file directly or use Datastore for training data storage
## Upload training file from local

In [None]:
scripts_folder = "scripts_folder"
if not os.path.isdir(scripts_folder):
    os.mkdir(scripts_folder)
shutil.copy('./train.conf', os.path.join(scripts_folder, 'train.conf'))
shutil.copy('./binary0.train', os.path.join(scripts_folder, 'binary0.train'))
shutil.copy('./binary1.train', os.path.join(scripts_folder, 'binary1.train'))
shutil.copy('./binary0.test', os.path.join(scripts_folder, 'binary0.test'))
shutil.copy('./binary1.test', os.path.join(scripts_folder, 'binary1.test'))

In [None]:
training_data_list=["binary0.train", "binary1.train"]
validation_data_list = ["binary0.test", "binary1.test"]
lgbm = LightGBM(source_directory=scripts_folder, 
                compute_target=cpu_cluster, 
                distributed_training=Mpi(),
                node_count=2,
                lightgbm_config='train.conf',
                data=training_data_list,
                valid=validation_data_list
               )
experiment_name = 'lightgbm-estimator-test'
experiment = Experiment(ws, name=experiment_name)
run = experiment.submit(lgbm, tags={"test public docker image": None})
RunDetails(run).show()

In [None]:
run.wait_for_completion(show_output=True)

## Use data reference

In [None]:
from azureml.core.datastore import Datastore
from azureml.data.data_reference import DataReference
datastore = ws.get_default_datastore()

In [None]:
datastore.upload(src_dir='.',
                 target_path='.',
                 show_progress=True)

In [None]:
training_data_list=["binary0.train", "binary1.train"]
validation_data_list = ["binary0.test", "binary1.test"]
lgbm = LightGBM(source_directory='.', 
                compute_target=cpu_cluster, 
                distributed_training=Mpi(),
                node_count=2,
                inputs=[datastore.as_mount()],
                lightgbm_config='train.conf',
                data=training_data_list,
                valid=validation_data_list
               )
experiment_name = 'lightgbm-estimator-test'
experiment = Experiment(ws, name=experiment_name)
run = experiment.submit(lgbm, tags={"use datastore.as_mount()": None})
RunDetails(run).show()

In [None]:
run.wait_for_completion(show_output=True)

In [None]:
# uncomment below and run if compute resources are no longer needed
# cpu_cluster.delete() 