update RAPIDS sample

2025-12-19 17:17:04 -05:00 · 2019-03-18 11:40:43 -04:00
parent 6b444b1467
commit 7b41675355
1 changed files with 407 additions and 407 deletions
--- a/contrib/RAPIDS/azure-ml-with-nvidia-rapids.ipynb
+++ b/contrib/RAPIDS/azure-ml-with-nvidia-rapids.ipynb
@@ -1,409 +1,409 @@
 {
-  "cells": [
+ "cells": [
-    {
+  {
-      "cell_type": "markdown",
+   "cell_type": "markdown",
-      "metadata": {},
+   "metadata": {},
-      "source": [
+   "source": [
-        "Copyright (c) Microsoft Corporation. All rights reserved.\n",
+    "Copyright (c) Microsoft Corporation. All rights reserved.\n",
-        "\n",
+    "\n",
-        "Licensed under the MIT License."
+    "Licensed under the MIT License."
-      ]
+   ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# NVIDIA RAPIDS in Azure Machine Learning"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The [RAPIDS](https://www.developer.nvidia.com/rapids) suite of software libraries from NVIDIA enables the execution of end-to-end data science and analytics pipelines entirely on GPUs. In many machine learning projects, a significant portion of the model training time is spent in setting up the data; this stage of the process is known as Extraction, Transformation and Loading, or ETL. By using the DataFrame API for ETL\u00c2\u00a0and GPU-capable ML algorithms in RAPIDS, data preparation and training models can be done in GPU-accelerated end-to-end pipelines without incurring serialization costs between the pipeline stages. This notebook demonstrates how to use NVIDIA RAPIDS to prepare data and train model\u00c2\u00a0in Azure.\n",
        " \n",
        "In this notebook, we will do the following:\n",
        " \n",
        "* Create an Azure Machine Learning Workspace\n",
        "* Create an AMLCompute target\n",
        "* Use a script to process our data and train a model\n",
        "* Obtain the data required to run this sample\n",
        "* Create an AML run configuration to launch a machine learning job\n",
        "* Run the script to prepare data for training and train the model\n",
        " \n",
        "Prerequisites:\n",
        "* An Azure subscription to create a Machine Learning Workspace\n",
        "* Familiarity with the Azure ML SDK (refer to [notebook samples](https://github.com/Azure/MachineLearningNotebooks))\n",
        "* A Jupyter notebook environment with Azure Machine Learning SDK installed. Refer to instructions to [setup the environment](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-configure-environment#local)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Verify if Azure ML SDK is installed"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import azureml.core\n",
        "print(\"SDK version:\", azureml.core.VERSION)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import os\n",
        "from azureml.core import Workspace, Experiment\n",
        "from azureml.core.compute import AmlCompute, ComputeTarget\n",
        "from azureml.data.data_reference import DataReference\n",
        "from azureml.core.runconfig import RunConfiguration\n",
        "from azureml.core import ScriptRunConfig\n",
        "from azureml.widgets import RunDetails"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Create Azure ML Workspace"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The following step is optional if you already have a workspace. If you want to use an existing workspace, then\n",
        "skip this workspace creation step and move on to the next step to load the workspace.\n",
        " \n",
        "<font color='red'>Important</font>: in the code cell below, be sure to set the correct values for the subscription_id, \n",
        "resource_group, workspace_name, region before executing this code cell."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "subscription_id = os.environ.get(\"SUBSCRIPTION_ID\", \"<subscription_id>\")\n",
        "resource_group = os.environ.get(\"RESOURCE_GROUP\", \"<resource_group>\")\n",
        "workspace_name = os.environ.get(\"WORKSPACE_NAME\", \"<workspace_name>\")\n",
        "workspace_region = os.environ.get(\"WORKSPACE_REGION\", \"<region>\")\n",
        "\n",
        "ws = Workspace.create(workspace_name, subscription_id=subscription_id, resource_group=resource_group, location=workspace_region)\n",
        "\n",
        "# write config to a local directory for future use\n",
        "ws.write_config()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Load existing Workspace"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "ws = Workspace.from_config()\n",
        "# if a locally-saved configuration file for the workspace is not available, use the following to load workspace\n",
        "# ws = Workspace(subscription_id=subscription_id, resource_group=resource_group, workspace_name=workspace_name)\n",
        "print('Workspace name: ' + ws.name, \n",
        "      'Azure region: ' + ws.location, \n",
        "      'Subscription id: ' + ws.subscription_id, \n",
        "      'Resource group: ' + ws.resource_group, sep = '\\n')\n",
        "\n",
        "scripts_folder = \"scripts_folder\"\n",
        "\n",
        "if not os.path.isdir(scripts_folder):\n",
        "    os.mkdir(scripts_folder)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Create AML Compute Target"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Because NVIDIA RAPIDS requires P40 or V100 GPUs, the user needs to specify compute targets from one of [NC_v3](https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes-gpu#ncv3-series), [NC_v2](https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes-gpu#ncv2-series), [ND](https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes-gpu#nd-series) or [ND_v2](https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes-gpu#ndv2-series-preview) virtual machine types in Azure; these are the families of virtual machines in Azure that are provisioned with these GPUs.\n",
        " \n",
        "Pick one of the supported VM SKUs based on the number of GPUs you want to use for ETL and training in RAPIDS.\n",
        " \n",
        "The script in this notebook is implemented for single-machine scenarios. An example supporting multiple nodes will be published later."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "gpu_cluster_name = \"gpucluster\"\n",
        "\n",
        "if gpu_cluster_name in ws.compute_targets:\n",
        "    gpu_cluster = ws.compute_targets[gpu_cluster_name]\n",
        "    if gpu_cluster and type(gpu_cluster) is AmlCompute:\n",
        "        print('found compute target. just use it. ' + gpu_cluster_name)\n",
        "else:\n",
        "    print(\"creating new cluster\")\n",
        "    # vm_size parameter below could be modified to one of the RAPIDS-supported VM types\n",
        "    provisioning_config = AmlCompute.provisioning_configuration(vm_size = \"Standard_NC6s_v2\", min_nodes=1, max_nodes = 1)\n",
        "\n",
        "    # create the cluster\n",
        "    gpu_cluster = ComputeTarget.create(ws, gpu_cluster_name, provisioning_config)\n",
        "    gpu_cluster.wait_for_completion(show_output=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Script to process data and train model"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The _process&#95;data.py_ script used in the step below is a slightly modified implementation of [RAPIDS E2E example](https://github.com/rapidsai/notebooks/blob/master/mortgage/E2E.ipynb)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# copy process_data.py into the script folder\n",
        "import shutil\n",
        "shutil.copy('./process_data.py', os.path.join(scripts_folder, 'process_data.py'))\n",
        "\n",
        "with open(os.path.join(scripts_folder, './process_data.py'), 'r') as process_data_script:\n",
        "    print(process_data_script.read())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Data required to run this sample"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "This sample uses [Fannie Mae\u00e2\u20ac\u2122s Single-Family Loan Performance Data](http://www.fanniemae.com/portal/funding-the-market/data/loan-performance-data.html). Refer to the 'Available mortgage datasets' section in [instructions](https://rapidsai.github.io/demos/datasets/mortgage-data) to get sample data.\n",
        "\n",
        "Once you obtain access to the data, you will need to make this data available in an [Azure Machine Learning Datastore](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-access-data), for use in this sample."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<font color='red'>Important</font>: The following step assumes the data is uploaded to the Workspace's default data store under a folder named 'mortgagedata2000_01'. Note that uploading data to the Workspace's default data store is not necessary and the data can be referenced from any datastore, e.g., from Azure Blob or File service, once it is added as a datastore to the workspace. The path_on_datastore parameter needs to be updated, depending on where the data is available.  The directory where the data is available should have the following folder structure, as the process_data.py script expects this directory structure:\n",
        "* _&lt;data directory>_/acq\n",
        "* _&lt;data directory>_/perf\n",
        "* _names.csv_\n",
        "\n",
        "The 'acq' and 'perf' refer to directories containing data files. The _&lt;data directory>_ is the path specified in _path&#95;on&#95;datastore_ parameter in the step below."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "ds = ws.get_default_datastore()\n",
        "\n",
        "# download and uncompress data in a local directory before uploading to data store\n",
        "# directory specified in src_dir parameter below should have the acq, perf directories with data and names.csv file\n",
        "# ds.upload(src_dir='<local directory that has data>', target_path='mortgagedata2000_01', overwrite=True, show_progress=True)\n",
        "\n",
        "# data already uploaded to the datastore\n",
        "data_ref = DataReference(data_reference_name='data', datastore=ds, path_on_datastore='mortgagedata2000_01')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Create AML run configuration to launch a machine learning job"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "AML allows the option of using existing Docker images with prebuilt conda environments. The following step use an existing image from [Docker Hub](https://hub.docker.com/r/rapidsai/rapidsai/)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "run_config = RunConfiguration()\n",
        "run_config.framework = 'python'\n",
        "run_config.environment.python.user_managed_dependencies = True\n",
        "# use conda environment named 'rapids' available in the Docker image\n",
        "# this conda environment does not include azureml-defaults package that is required for using AML functionality like metrics tracking, model management etc.\n",
        "run_config.environment.python.interpreter_path = '/conda/envs/rapids/bin/python'\n",
        "run_config.target = gpu_cluster_name\n",
        "run_config.environment.docker.enabled = True\n",
        "run_config.environment.docker.gpu_support = True\n",
        "# if registry is not mentioned the image is pulled from Docker Hub\n",
        "run_config.environment.docker.base_image = \"rapidsai/rapidsai:cuda9.2_ubuntu16.04_root\"\n",
        "run_config.environment.spark.precache_packages = False\n",
        "run_config.data_references={'data':data_ref.to_config()}"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Wrapper function to submit Azure Machine Learning experiment"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# parameter cpu_predictor indicates if training should be done on CPU. If set to true, GPUs are used *only* for ETL and *not* for training\n",
        "# parameter num_gpu indicates number of GPUs to use among the GPUs available in the VM for ETL and if cpu_predictor is false, for training as well \n",
        "def run_rapids_experiment(cpu_training, gpu_count):\n",
        "    # any value between 1-4 is allowed here depending the type of VMs available in gpu_cluster\n",
        "    if gpu_count not in [1, 2, 3, 4]:\n",
        "        raise Exception('Value specified for the number of GPUs to use {0} is invalid'.format(gpu_count))\n",
        "\n",
        "    # following data partition mapping is empirical (specific to GPUs used and current data partitioning scheme) and may need to be tweaked\n",
        "    gpu_count_data_partition_mapping = {1: 2, 2: 4, 3: 5, 4: 7}\n",
        "    part_count = gpu_count_data_partition_mapping[gpu_count]\n",
        "\n",
        "    end_year = 2000\n",
        "    if gpu_count > 2:\n",
        "        end_year = 2001 # use more data with more GPUs\n",
        "\n",
        "    src = ScriptRunConfig(source_directory=scripts_folder, \n",
        "                          script='process_data.py', \n",
        "                          arguments = ['--num_gpu', gpu_count, '--data_dir', str(data_ref),\n",
        "                                      '--part_count', part_count, '--end_year', end_year,\n",
        "                                      '--cpu_predictor', cpu_training\n",
        "                                      ],\n",
        "                          run_config=run_config\n",
        "                         )\n",
        "\n",
        "    exp = Experiment(ws, 'rapidstest')\n",
        "    run = exp.submit(config=src)\n",
        "    RunDetails(run).show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Submit experiment (ETL & training on GPU)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "cpu_predictor = False\n",
        "# the value for num_gpu should be less than or equal to the number of GPUs available in the VM\n",
        "num_gpu = 1 \n",
        "# train using CPU, use GPU for both ETL and training\n",
        "run_rapids_experiment(cpu_predictor, num_gpu)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Submit experiment (ETL on GPU, training on CPU)\n",
        "\n",
        "To observe performance difference between GPU-accelerated RAPIDS based training with CPU-only training, set 'cpu_predictor' predictor to 'True' and rerun the experiment"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "cpu_predictor = True\n",
        "# the value for num_gpu should be less than or equal to the number of GPUs available in the VM\n",
        "num_gpu = 1\n",
        "# train using CPU, use GPU for ETL\n",
        "run_rapids_experiment(cpu_predictor, num_gpu)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Delete cluster"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# delete the cluster\n",
        "# gpu_cluster.delete()"
      ]
    }
  ],
  "metadata": {
    "authors": [
      {
        "name": "ksivas"
      }
    ],
    "kernelspec": {
      "display_name": "Python 3.6",
      "language": "python",
      "name": "python36"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.6"
    }
  },
-  "nbformat": 4,
+  {
-  "nbformat_minor": 2
+   "cell_type": "markdown",
-}
+   "metadata": {},
   "source": [
    "# NVIDIA RAPIDS in Azure Machine Learning"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The [RAPIDS](https://www.developer.nvidia.com/rapids) suite of software libraries from NVIDIA enables the execution of end-to-end data science and analytics pipelines entirely on GPUs. In many machine learning projects, a significant portion of the model training time is spent in setting up the data; this stage of the process is known as Extraction, Transformation and Loading, or ETL. By using the DataFrame API for ETL and GPU-capable ML algorithms in RAPIDS, data preparation and training models can be done in GPU-accelerated end-to-end pipelines without incurring serialization costs between the pipeline stages. This notebook demonstrates how to use NVIDIA RAPIDS to prepare data and train model in Azure.\n",
    " \n",
    "In this notebook, we will do the following:\n",
    " \n",
    "* Create an Azure Machine Learning Workspace\n",
    "* Create an AMLCompute target\n",
    "* Use a script to process our data and train a model\n",
    "* Obtain the data required to run this sample\n",
    "* Create an AML run configuration to launch a machine learning job\n",
    "* Run the script to prepare data for training and train the model\n",
    " \n",
    "Prerequisites:\n",
    "* An Azure subscription to create a Machine Learning Workspace\n",
    "* Familiarity with the Azure ML SDK (refer to [notebook samples](https://github.com/Azure/MachineLearningNotebooks))\n",
    "* A Jupyter notebook environment with Azure Machine Learning SDK installed. Refer to instructions to [setup the environment](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-configure-environment#local)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Verify if Azure ML SDK is installed"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import azureml.core\n",
    "print(\"SDK version:\", azureml.core.VERSION)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "from azureml.core import Workspace, Experiment\n",
    "from azureml.core.compute import AmlCompute, ComputeTarget\n",
    "from azureml.data.data_reference import DataReference\n",
    "from azureml.core.runconfig import RunConfiguration\n",
    "from azureml.core import ScriptRunConfig\n",
    "from azureml.widgets import RunDetails"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create Azure ML Workspace"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following step is optional if you already have a workspace. If you want to use an existing workspace, then\n",
    "skip this workspace creation step and move on to the next step to load the workspace.\n",
    " \n",
    "<font color='red'>Important</font>: in the code cell below, be sure to set the correct values for the subscription_id, \n",
    "resource_group, workspace_name, region before executing this code cell."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "subscription_id = os.environ.get(\"SUBSCRIPTION_ID\", \"<subscription_id>\")\n",
    "resource_group = os.environ.get(\"RESOURCE_GROUP\", \"<resource_group>\")\n",
    "workspace_name = os.environ.get(\"WORKSPACE_NAME\", \"<workspace_name>\")\n",
    "workspace_region = os.environ.get(\"WORKSPACE_REGION\", \"<region>\")\n",
    "\n",
    "ws = Workspace.create(workspace_name, subscription_id=subscription_id, resource_group=resource_group, location=workspace_region)\n",
    "\n",
    "# write config to a local directory for future use\n",
    "ws.write_config()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load existing Workspace"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ws = Workspace.from_config()\n",
    "# if a locally-saved configuration file for the workspace is not available, use the following to load workspace\n",
    "# ws = Workspace(subscription_id=subscription_id, resource_group=resource_group, workspace_name=workspace_name)\n",
    "print('Workspace name: ' + ws.name, \n",
    "      'Azure region: ' + ws.location, \n",
    "      'Subscription id: ' + ws.subscription_id, \n",
    "      'Resource group: ' + ws.resource_group, sep = '\\n')\n",
    "\n",
    "scripts_folder = \"scripts_folder\"\n",
    "\n",
    "if not os.path.isdir(scripts_folder):\n",
    "    os.mkdir(scripts_folder)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create AML Compute Target"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Because NVIDIA RAPIDS requires P40 or V100 GPUs, the user needs to specify compute targets from one of [NC_v3](https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes-gpu#ncv3-series), [NC_v2](https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes-gpu#ncv2-series), [ND](https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes-gpu#nd-series) or [ND_v2](https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes-gpu#ndv2-series-preview) virtual machine types in Azure; these are the families of virtual machines in Azure that are provisioned with these GPUs.\n",
    " \n",
    "Pick one of the supported VM SKUs based on the number of GPUs you want to use for ETL and training in RAPIDS.\n",
    " \n",
    "The script in this notebook is implemented for single-machine scenarios. An example supporting multiple nodes will be published later."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "gpu_cluster_name = \"gpucluster\"\n",
    "\n",
    "if gpu_cluster_name in ws.compute_targets:\n",
    "    gpu_cluster = ws.compute_targets[gpu_cluster_name]\n",
    "    if gpu_cluster and type(gpu_cluster) is AmlCompute:\n",
    "        print('found compute target. just use it. ' + gpu_cluster_name)\n",
    "else:\n",
    "    print(\"creating new cluster\")\n",
    "    # vm_size parameter below could be modified to one of the RAPIDS-supported VM types\n",
    "    provisioning_config = AmlCompute.provisioning_configuration(vm_size = \"Standard_NC6s_v2\", min_nodes=1, max_nodes = 1)\n",
    "\n",
    "    # create the cluster\n",
    "    gpu_cluster = ComputeTarget.create(ws, gpu_cluster_name, provisioning_config)\n",
    "    gpu_cluster.wait_for_completion(show_output=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Script to process data and train model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The _process&#95;data.py_ script used in the step below is a slightly modified implementation of [RAPIDS E2E example](https://github.com/rapidsai/notebooks/blob/master/mortgage/E2E.ipynb)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# copy process_data.py into the script folder\n",
    "import shutil\n",
    "shutil.copy('./process_data.py', os.path.join(scripts_folder, 'process_data.py'))\n",
    "\n",
    "with open(os.path.join(scripts_folder, './process_data.py'), 'r') as process_data_script:\n",
    "    print(process_data_script.read())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data required to run this sample"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This sample uses [Fannie Mae’s Single-Family Loan Performance Data](http://www.fanniemae.com/portal/funding-the-market/data/loan-performance-data.html). Refer to the 'Available mortgage datasets' section in [instructions](https://rapidsai.github.io/demos/datasets/mortgage-data) to get sample data.\n",
    "\n",
    "Once you obtain access to the data, you will need to make this data available in an [Azure Machine Learning Datastore](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-access-data), for use in this sample."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font color='red'>Important</font>: The following step assumes the data is uploaded to the Workspace's default data store under a folder named 'mortgagedata2000_01'. Note that uploading data to the Workspace's default data store is not necessary and the data can be referenced from any datastore, e.g., from Azure Blob or File service, once it is added as a datastore to the workspace. The path_on_datastore parameter needs to be updated, depending on where the data is available.  The directory where the data is available should have the following folder structure, as the process_data.py script expects this directory structure:\n",
    "* _&lt;data directory>_/acq\n",
    "* _&lt;data directory>_/perf\n",
    "* _names.csv_\n",
    "\n",
    "The 'acq' and 'perf' refer to directories containing data files. The _&lt;data directory>_ is the path specified in _path&#95;on&#95;datastore_ parameter in the step below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ds = ws.get_default_datastore()\n",
    "\n",
    "# download and uncompress data in a local directory before uploading to data store\n",
    "# directory specified in src_dir parameter below should have the acq, perf directories with data and names.csv file\n",
    "# ds.upload(src_dir='<local directory that has data>', target_path='mortgagedata2000_01', overwrite=True, show_progress=True)\n",
    "\n",
    "# data already uploaded to the datastore\n",
    "data_ref = DataReference(data_reference_name='data', datastore=ds, path_on_datastore='mortgagedata2000_01')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create AML run configuration to launch a machine learning job"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "AML allows the option of using existing Docker images with prebuilt conda environments. The following step use an existing image from [Docker Hub](https://hub.docker.com/r/rapidsai/rapidsai/)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "run_config = RunConfiguration()\n",
    "run_config.framework = 'python'\n",
    "run_config.environment.python.user_managed_dependencies = True\n",
    "# use conda environment named 'rapids' available in the Docker image\n",
    "# this conda environment does not include azureml-defaults package that is required for using AML functionality like metrics tracking, model management etc.\n",
    "run_config.environment.python.interpreter_path = '/conda/envs/rapids/bin/python'\n",
    "run_config.target = gpu_cluster_name\n",
    "run_config.environment.docker.enabled = True\n",
    "run_config.environment.docker.gpu_support = True\n",
    "# if registry is not mentioned the image is pulled from Docker Hub\n",
    "run_config.environment.docker.base_image = \"rapidsai/rapidsai:cuda9.2_ubuntu16.04_root\"\n",
    "run_config.environment.spark.precache_packages = False\n",
    "run_config.data_references={'data':data_ref.to_config()}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Wrapper function to submit Azure Machine Learning experiment"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# parameter cpu_predictor indicates if training should be done on CPU. If set to true, GPUs are used *only* for ETL and *not* for training\n",
    "# parameter num_gpu indicates number of GPUs to use among the GPUs available in the VM for ETL and if cpu_predictor is false, for training as well \n",
    "def run_rapids_experiment(cpu_training, gpu_count):\n",
    "    # any value between 1-4 is allowed here depending the type of VMs available in gpu_cluster\n",
    "    if gpu_count not in [1, 2, 3, 4]:\n",
    "        raise Exception('Value specified for the number of GPUs to use {0} is invalid'.format(gpu_count))\n",
    "\n",
    "    # following data partition mapping is empirical (specific to GPUs used and current data partitioning scheme) and may need to be tweaked\n",
    "    gpu_count_data_partition_mapping = {1: 2, 2: 4, 3: 5, 4: 7}\n",
    "    part_count = gpu_count_data_partition_mapping[gpu_count]\n",
    "\n",
    "    end_year = 2000\n",
    "    if gpu_count > 2:\n",
    "        end_year = 2001 # use more data with more GPUs\n",
    "\n",
    "    src = ScriptRunConfig(source_directory=scripts_folder, \n",
    "                          script='process_data.py', \n",
    "                          arguments = ['--num_gpu', gpu_count, '--data_dir', str(data_ref),\n",
    "                                      '--part_count', part_count, '--end_year', end_year,\n",
    "                                      '--cpu_predictor', cpu_training\n",
    "                                      ],\n",
    "                          run_config=run_config\n",
    "                         )\n",
    "\n",
    "    exp = Experiment(ws, 'rapidstest')\n",
    "    run = exp.submit(config=src)\n",
    "    RunDetails(run).show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Submit experiment (ETL & training on GPU)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cpu_predictor = False\n",
    "# the value for num_gpu should be less than or equal to the number of GPUs available in the VM\n",
    "num_gpu = 1 \n",
    "# train using CPU, use GPU for both ETL and training\n",
    "run_rapids_experiment(cpu_predictor, num_gpu)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Submit experiment (ETL on GPU, training on CPU)\n",
    "\n",
    "To observe performance difference between GPU-accelerated RAPIDS based training with CPU-only training, set 'cpu_predictor' predictor to 'True' and rerun the experiment"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cpu_predictor = True\n",
    "# the value for num_gpu should be less than or equal to the number of GPUs available in the VM\n",
    "num_gpu = 1\n",
    "# train using CPU, use GPU for ETL\n",
    "run_rapids_experiment(cpu_predictor, num_gpu)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Delete cluster"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# delete the cluster\n",
    "# gpu_cluster.delete()"
   ]
  }
 ],
 "metadata": {
  "authors": [
   {
    "name": "ksivas"
   }
  ],
  "kernelspec": {
   "display_name": "Python 3.6",
   "language": "python",
   "name": "python36"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }