MachineLearningNotebooks/contrib/gbdt/lightgbm/lightgbm-example.ipynb

{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Copyright (c) Microsoft Corporation. All rights reserved.  \n",
        "Licensed under the MIT License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/contrib/gbdt/lightgbm/lightgbm-example.png)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Use LightGBM Estimator in Azure Machine Learning\n",
        "In this notebook we will demonstrate how to run a training job using LightGBM Estimator. [LightGBM](https://lightgbm.readthedocs.io/en/latest/) is a gradient boosting framework that uses tree based learning algorithms. "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Prerequisites\n",
        "This notebook uses azureml-contrib-gbdt package, if you don't already have the package, please install by uncommenting below cell."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "#!pip install azureml-contrib-gbdt"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.core import Workspace, Run, Experiment\n",
        "import shutil, os\n",
        "from azureml.widgets import RunDetails\n",
        "from azureml.contrib.gbdt import LightGBM\n",
        "from azureml.train.dnn import Mpi\n",
        "from azureml.core.compute import AmlCompute, ComputeTarget\n",
        "from azureml.core.compute_target import ComputeTargetException"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "If you are using an AzureML Compute Instance, you are all set. Otherwise, go through the [configuration.ipynb](../../../configuration.ipynb) notebook to install the Azure Machine Learning Python SDK and create an Azure ML Workspace"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Set up machine learning resources"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "ws = Workspace.from_config()\n",
        "\n",
        "print('Workspace name: ' + ws.name, \n",
        "      'Azure region: ' + ws.location, \n",
        "      'Subscription id: ' + ws.subscription_id, \n",
        "      'Resource group: ' + ws.resource_group, sep = '\\n')"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "cluster_vm_size = \"STANDARD_DS14_V2\"\n",
        "cluster_min_nodes = 0\n",
        "cluster_max_nodes = 20\n",
        "cpu_cluster_name = 'TrainingCompute2' \n",
        "\n",
        "try:\n",
        "    cpu_cluster = AmlCompute(ws, cpu_cluster_name)\n",
        "    if cpu_cluster and type(cpu_cluster) is AmlCompute:\n",
        "        print('found compute target: ' + cpu_cluster_name)\n",
        "except ComputeTargetException:\n",
        "    print('creating a new compute target...')\n",
        "    provisioning_config = AmlCompute.provisioning_configuration(vm_size = cluster_vm_size, \n",
        "                                                                vm_priority = 'lowpriority', \n",
        "                                                                min_nodes = cluster_min_nodes, \n",
        "                                                                max_nodes = cluster_max_nodes)\n",
        "    cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, provisioning_config)\n",
        "    \n",
        "    # can poll for a minimum number of nodes and for a specific timeout. \n",
        "    # if no min node count is provided it will use the scale settings for the cluster\n",
        "    cpu_cluster.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)\n",
        "    \n",
        "     # For a more detailed view of current Azure Machine Learning Compute  status, use get_status()\n",
        "    print(cpu_cluster.get_status().serialize())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "From this point, you can either upload training data file directly or use Datastore for training data storage\n",
        "## Upload training file from local"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "scripts_folder = \"scripts_folder\"\n",
        "if not os.path.isdir(scripts_folder):\n",
        "    os.mkdir(scripts_folder)\n",
        "shutil.copy('./train.conf', os.path.join(scripts_folder, 'train.conf'))\n",
        "shutil.copy('./binary0.train', os.path.join(scripts_folder, 'binary0.train'))\n",
        "shutil.copy('./binary1.train', os.path.join(scripts_folder, 'binary1.train'))\n",
        "shutil.copy('./binary0.test', os.path.join(scripts_folder, 'binary0.test'))\n",
        "shutil.copy('./binary1.test', os.path.join(scripts_folder, 'binary1.test'))"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "training_data_list=[\"binary0.train\", \"binary1.train\"]\n",
        "validation_data_list = [\"binary0.test\", \"binary1.test\"]\n",
        "lgbm = LightGBM(source_directory=scripts_folder, \n",
        "                compute_target=cpu_cluster, \n",
        "                distributed_training=Mpi(),\n",
        "                node_count=2,\n",
        "                lightgbm_config='train.conf',\n",
        "                data=training_data_list,\n",
        "                valid=validation_data_list\n",
        "               )\n",
        "experiment_name = 'lightgbm-estimator-test'\n",
        "experiment = Experiment(ws, name=experiment_name)\n",
        "run = experiment.submit(lgbm, tags={\"test public docker image\": None})\n",
        "RunDetails(run).show()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "run.wait_for_completion(show_output=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Use data reference"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.core.datastore import Datastore\n",
        "from azureml.data.data_reference import DataReference\n",
        "datastore = ws.get_default_datastore()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "datastore.upload(src_dir='.',\n",
        "                 target_path='.',\n",
        "                 show_progress=True)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "training_data_list=[\"binary0.train\", \"binary1.train\"]\n",
        "validation_data_list = [\"binary0.test\", \"binary1.test\"]\n",
        "lgbm = LightGBM(source_directory='.', \n",
        "                compute_target=cpu_cluster, \n",
        "                distributed_training=Mpi(),\n",
        "                node_count=2,\n",
        "                inputs=[datastore.as_mount()],\n",
        "                lightgbm_config='train.conf',\n",
        "                data=training_data_list,\n",
        "                valid=validation_data_list\n",
        "               )\n",
        "experiment_name = 'lightgbm-estimator-test'\n",
        "experiment = Experiment(ws, name=experiment_name)\n",
        "run = experiment.submit(lgbm, tags={\"use datastore.as_mount()\": None})\n",
        "RunDetails(run).show()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "run.wait_for_completion(show_output=True)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# uncomment below and run if compute resources are no longer needed\n",
        "# cpu_cluster.delete() "
      ]
    }
  ],
  "metadata": {
    "authors": [
      {
        "name": "jingywa"
      }
    ],
    "kernelspec": {
      "display_name": "Python 3.6",
      "language": "python",
      "name": "python36"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.9"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 2
}