MachineLearningNotebooks/contrib/RAPIDS/azure-ml-with-nvidia-rapids.ipynb

{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Copyright (c) Microsoft Corporation. All rights reserved.\n",
        "\n",
        "Licensed under the MIT License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/contrib/RAPIDS/azure-ml-with-nvidia-rapids/azure-ml-with-nvidia-rapids.png)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# NVIDIA RAPIDS in Azure Machine Learning"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The [RAPIDS](https://www.developer.nvidia.com/rapids) suite of software libraries from NVIDIA enables the execution of end-to-end data science and analytics pipelines entirely on GPUs. In many machine learning projects, a significant portion of the model training time is spent in setting up the data; this stage of the process is known as Extraction, Transformation and Loading, or ETL. By using the DataFrame API for ETL\u00c2\u00a0and GPU-capable ML algorithms in RAPIDS, data preparation and training models can be done in GPU-accelerated end-to-end pipelines without incurring serialization costs between the pipeline stages. This notebook demonstrates how to use NVIDIA RAPIDS to prepare data and train model\u00c3\u201a\u00c2\u00a0in Azure.\n",
        " \n",
        "In this notebook, we will do the following:\n",
        " \n",
        "* Create an Azure Machine Learning Workspace\n",
        "* Create an AMLCompute target\n",
        "* Use a script to process our data and train a model\n",
        "* Obtain the data required to run this sample\n",
        "* Create an AML run configuration to launch a machine learning job\n",
        "* Run the script to prepare data for training and train the model\n",
        " \n",
        "Prerequisites:\n",
        "* An Azure subscription to create a Machine Learning Workspace\n",
        "* Familiarity with the Azure ML SDK (refer to [notebook samples](https://github.com/Azure/MachineLearningNotebooks))\n",
        "* A Jupyter notebook environment with Azure Machine Learning SDK installed. Refer to instructions to [setup the environment](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-configure-environment#local)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Verify if Azure ML SDK is installed"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import azureml.core\n",
        "print(\"SDK version:\", azureml.core.VERSION)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import os\n",
        "from azureml.core import Workspace, Experiment\n",
        "from azureml.core.conda_dependencies import CondaDependencies\n",
        "from azureml.core.compute import AmlCompute, ComputeTarget\n",
        "from azureml.data.data_reference import DataReference\n",
        "from azureml.core.runconfig import RunConfiguration\n",
        "from azureml.core import ScriptRunConfig\n",
        "from azureml.widgets import RunDetails"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Create Azure ML Workspace"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The following step is optional if you already have a workspace. If you want to use an existing workspace, then\n",
        "skip this workspace creation step and move on to the next step to load the workspace.\n",
        " \n",
        "<font color='red'>Important</font>: in the code cell below, be sure to set the correct values for the subscription_id, \n",
        "resource_group, workspace_name, region before executing this code cell."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "subscription_id = os.environ.get(\"SUBSCRIPTION_ID\", \"<subscription_id>\")\n",
        "resource_group = os.environ.get(\"RESOURCE_GROUP\", \"<resource_group>\")\n",
        "workspace_name = os.environ.get(\"WORKSPACE_NAME\", \"<workspace_name>\")\n",
        "workspace_region = os.environ.get(\"WORKSPACE_REGION\", \"<region>\")\n",
        "\n",
        "ws = Workspace.create(workspace_name, subscription_id=subscription_id, resource_group=resource_group, location=workspace_region)\n",
        "\n",
        "# write config to a local directory for future use\n",
        "ws.write_config()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Load existing Workspace"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "ws = Workspace.from_config()\n",
        "\n",
        "# if a locally-saved configuration file for the workspace is not available, use the following to load workspace\n",
        "# ws = Workspace(subscription_id=subscription_id, resource_group=resource_group, workspace_name=workspace_name)\n",
        "\n",
        "print('Workspace name: ' + ws.name, \n",
        "      'Azure region: ' + ws.location, \n",
        "      'Subscription id: ' + ws.subscription_id, \n",
        "      'Resource group: ' + ws.resource_group, sep = '\\n')\n",
        "\n",
        "scripts_folder = \"scripts_folder\"\n",
        "\n",
        "if not os.path.isdir(scripts_folder):\n",
        "    os.mkdir(scripts_folder)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Create AML Compute Target"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Because NVIDIA RAPIDS requires P40 or V100 GPUs, the user needs to specify compute targets from one of [NC_v3](https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes-gpu#ncv3-series), [NC_v2](https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes-gpu#ncv2-series), [ND](https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes-gpu#nd-series) or [ND_v2](https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes-gpu#ndv2-series-preview) virtual machine types in Azure; these are the families of virtual machines in Azure that are provisioned with these GPUs.\n",
        " \n",
        "Pick one of the supported VM SKUs based on the number of GPUs you want to use for ETL and training in RAPIDS.\n",
        " \n",
        "The script in this notebook is implemented for single-machine scenarios. An example supporting multiple nodes will be published later."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "gpu_cluster_name = \"gpucluster\"\n",
        "\n",
        "if gpu_cluster_name in ws.compute_targets:\n",
        "    gpu_cluster = ws.compute_targets[gpu_cluster_name]\n",
        "    if gpu_cluster and type(gpu_cluster) is AmlCompute:\n",
        "        print('Found compute target. Will use {0} '.format(gpu_cluster_name))\n",
        "else:\n",
        "    print(\"creating new cluster\")\n",
        "    # vm_size parameter below could be modified to one of the RAPIDS-supported VM types\n",
        "    provisioning_config = AmlCompute.provisioning_configuration(vm_size = \"Standard_NC6s_v3\", min_nodes=1, max_nodes = 1)\n",
        "\n",
        "    # create the cluster\n",
        "    gpu_cluster = ComputeTarget.create(ws, gpu_cluster_name, provisioning_config)\n",
        "    gpu_cluster.wait_for_completion(show_output=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Script to process data and train model"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# copy process_data.py into the script folder\n",
        "import shutil\n",
        "shutil.copy('./process_data.py', os.path.join(scripts_folder, 'process_data.py'))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Data required to run this sample"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "This sample uses [Fannie Mae's Single-Family Loan Performance Data](http://www.fanniemae.com/portal/funding-the-market/data/loan-performance-data.html). Once you obtain access to the data, you will need to make this data available in an [Azure Machine Learning Datastore](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-access-data), for use in this sample. The following code shows how to do that."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Downloading Data"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import tarfile\n",
        "import hashlib\n",
        "from urllib.request import urlretrieve\n",
        "\n",
        "def validate_downloaded_data(path):\n",
        "    if(os.path.isdir(path) and os.path.exists(path + '//names.csv')) :\n",
        "        if(os.path.isdir(path + '//acq' ) and len(os.listdir(path + '//acq')) == 8):\n",
        "            if(os.path.isdir(path + '//perf' ) and len(os.listdir(path + '//perf')) == 11):\n",
        "                print(\"Data has been downloaded and decompressed at: {0}\".format(path))\n",
        "                return True\n",
        "    print(\"Data has not been downloaded and decompressed\")\n",
        "    return False\n",
        "\n",
        "def show_progress(count, block_size, total_size):\n",
        "    global pbar\n",
        "    global processed\n",
        "    \n",
        "    if count == 0:\n",
        "        pbar = ProgressBar(maxval=total_size)\n",
        "        processed = 0\n",
        "    \n",
        "    processed += block_size\n",
        "    processed = min(processed,total_size)\n",
        "    pbar.update(processed)\n",
        "\n",
        "        \n",
        "def download_file(fileroot):\n",
        "    filename = fileroot + '.tgz'\n",
        "    if(not os.path.exists(filename) or hashlib.md5(open(filename, 'rb').read()).hexdigest() != '82dd47135053303e9526c2d5c43befd5' ):\n",
        "        url_format = 'http://rapidsai-data.s3-website.us-east-2.amazonaws.com/notebook-mortgage-data/{0}.tgz'\n",
        "        url = url_format.format(fileroot)\n",
        "        print(\"...Downloading file :{0}\".format(filename))\n",
        "        urlretrieve(url, filename)\n",
        "        pbar.finish()\n",
        "        print(\"...File :{0} finished downloading\".format(filename))\n",
        "    else:\n",
        "        print(\"...File :{0} has been downloaded already\".format(filename))\n",
        "    return filename\n",
        "\n",
        "def decompress_file(filename,path):\n",
        "    tar = tarfile.open(filename)\n",
        "    print(\"...Getting information from {0} about files to decompress\".format(filename))\n",
        "    members = tar.getmembers()\n",
        "    numFiles = len(members)\n",
        "    so_far = 0\n",
        "    for member_info in members:\n",
        "        tar.extract(member_info,path=path)\n",
        "        so_far += 1\n",
        "    print(\"...All {0} files have been decompressed\".format(numFiles))\n",
        "    tar.close()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "fileroot = 'mortgage_2000-2001'\n",
        "path = '.\\\\{0}'.format(fileroot)\n",
        "pbar = None\n",
        "processed = 0\n",
        "\n",
        "if(not validate_downloaded_data(path)):\n",
        "    print(\"Downloading and Decompressing Input Data\")\n",
        "    filename = download_file(fileroot)\n",
        "    decompress_file(filename,path)\n",
        "    print(\"Input Data has been Downloaded and Decompressed\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Uploading Data to Workspace"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "ds = ws.get_default_datastore()\n",
        "\n",
        "# download and uncompress data in a local directory before uploading to data store\n",
        "# directory specified in src_dir parameter below should have the acq, perf directories with data and names.csv file\n",
        "\n",
        "# ---->>>> UNCOMMENT THE BELOW LINE TO UPLOAD YOUR DATA IF NOT DONE SO ALREADY <<<<----\n",
        "# ds.upload(src_dir=path, target_path=fileroot, overwrite=True, show_progress=True)\n",
        "\n",
        "# data already uploaded to the datastore\n",
        "data_ref = DataReference(data_reference_name='data', datastore=ds, path_on_datastore=fileroot)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Create AML run configuration to launch a machine learning job"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "RunConfiguration is used to submit jobs to Azure Machine Learning service. When creating RunConfiguration for a job, users can either \n",
        "1. specify a Docker image with prebuilt conda environment and use it without any modifications to run the job, or \n",
        "2. specify a Docker image as the base image and conda or pip packages as dependnecies to let AML build a new Docker image with a conda environment containing specified dependencies to use in the job\n",
        "\n",
        "The second option is the recommended option in AML. \n",
        "The following steps have code for both options. You can pick the one that is more appropriate for your requirements. "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "#### Specify prebuilt conda environment"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The following code shows how to install RAPIDS using conda. The `rapids.yml` file contains the list of packages necessary to run this tutorial. **NOTE:** Initial build of the image might take up to 20 minutes as the service needs to build and cache the new image; once the image is built the subequent runs use the cached image and the overhead is minimal."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "cd = CondaDependencies(conda_dependencies_file_path='rapids.yml')\n",
        "run_config = RunConfiguration(conda_dependencies=cd)\n",
        "run_config.framework = 'python'\n",
        "run_config.target = gpu_cluster_name\n",
        "run_config.environment.docker.enabled = True\n",
        "run_config.environment.docker.gpu_support = True\n",
        "run_config.environment.docker.base_image = \"mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.1-cudnn8-ubuntu20.04\"\n",
        "run_config.environment.spark.precache_packages = False\n",
        "run_config.data_references={'data':data_ref.to_config()}"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "#### Using Docker"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Alternatively, you can specify RAPIDS Docker image."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# run_config = RunConfiguration()\n",
        "# run_config.framework = 'python'\n",
        "# run_config.environment.python.user_managed_dependencies = True\n",
        "# run_config.environment.python.interpreter_path = '/conda/envs/rapids/bin/python'\n",
        "# run_config.target = gpu_cluster_name\n",
        "# run_config.environment.docker.enabled = True\n",
        "# run_config.environment.docker.gpu_support = True\n",
        "# run_config.environment.docker.base_image = \"rapidsai/rapidsai:cuda9.2-runtime-ubuntu20.04\"\n",
        "# # run_config.environment.docker.base_image_registry.address = '<registry_url>' # not required if the base_image is in Docker hub\n",
        "# # run_config.environment.docker.base_image_registry.username = '<user_name>' # needed only for private images\n",
        "# # run_config.environment.docker.base_image_registry.password = '<password>' # needed only for private images\n",
        "# run_config.environment.spark.precache_packages = False\n",
        "# run_config.data_references={'data':data_ref.to_config()}"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Wrapper function to submit Azure Machine Learning experiment"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# parameter cpu_predictor indicates if training should be done on CPU. If set to true, GPUs are used *only* for ETL and *not* for training\n",
        "# parameter num_gpu indicates number of GPUs to use among the GPUs available in the VM for ETL and if cpu_predictor is false, for training as well \n",
        "def run_rapids_experiment(cpu_training, gpu_count, part_count):\n",
        "    # any value between 1-4 is allowed here depending the type of VMs available in gpu_cluster\n",
        "    if gpu_count not in [1, 2, 3, 4]:\n",
        "        raise Exception('Value specified for the number of GPUs to use {0} is invalid'.format(gpu_count))\n",
        "\n",
        "    # following data partition mapping is empirical (specific to GPUs used and current data partitioning scheme) and may need to be tweaked\n",
        "    max_gpu_count_data_partition_mapping = {1: 3, 2: 4, 3: 6, 4: 8}\n",
        "    \n",
        "    if part_count > max_gpu_count_data_partition_mapping[gpu_count]:\n",
        "        print(\"Too many partitions for the number of GPUs, exceeding memory threshold\")\n",
        "        \n",
        "    if part_count > 11:\n",
        "        print(\"Warning: Maximum number of partitions available is 11\")\n",
        "        part_count = 11\n",
        "        \n",
        "    end_year = 2000\n",
        "    \n",
        "    if part_count > 4:\n",
        "        end_year = 2001 # use more data with more GPUs\n",
        "\n",
        "    src = ScriptRunConfig(source_directory=scripts_folder, \n",
        "                          script='process_data.py', \n",
        "                          arguments = ['--num_gpu', gpu_count, '--data_dir', str(data_ref),\n",
        "                                      '--part_count', part_count, '--end_year', end_year,\n",
        "                                      '--cpu_predictor', cpu_training\n",
        "                                      ],\n",
        "                          run_config=run_config\n",
        "                         )\n",
        "\n",
        "    exp = Experiment(ws, 'rapidstest')\n",
        "    run = exp.submit(config=src)\n",
        "    RunDetails(run).show()\n",
        "    return run"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Submit experiment (ETL & training on GPU)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "cpu_predictor = False\n",
        "# the value for num_gpu should be less than or equal to the number of GPUs available in the VM\n",
        "num_gpu = 1\n",
        "data_part_count = 1\n",
        "# train using CPU, use GPU for both ETL and training\n",
        "run = run_rapids_experiment(cpu_predictor, num_gpu, data_part_count)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Submit experiment (ETL on GPU, training on CPU)\n",
        "\n",
        "To observe performance difference between GPU-accelerated RAPIDS based training with CPU-only training, set 'cpu_predictor' predictor to 'True' and rerun the experiment"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "cpu_predictor = True\n",
        "# the value for num_gpu should be less than or equal to the number of GPUs available in the VM\n",
        "num_gpu = 1\n",
        "data_part_count = 1\n",
        "# train using CPU, use GPU for ETL\n",
        "run = run_rapids_experiment(cpu_predictor, num_gpu, data_part_count)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Delete cluster"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# delete the cluster\n",
        "# gpu_cluster.delete()"
      ]
    }
  ],
  "metadata": {
    "authors": [
      {
        "name": "ksivas"
      }
    ],
    "kernelspec": {
      "display_name": "Python 3.8 - AzureML",
      "language": "python",
      "name": "python38-azureml"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.8"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 4
}