update RAPIDS 2

2025-12-19 17:17:04 -05:00 · 2019-03-18 12:08:10 -04:00
parent 7b41675355
commit 75c393a221
53 changed files with 670 additions and 185 deletions
--- a/contrib/RAPIDS/azure-ml-with-nvidia-rapids.ipynb
+++ b/contrib/RAPIDS/azure-ml-with-nvidia-rapids.ipynb
@@ -20,7 +20,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The [RAPIDS](https://www.developer.nvidia.com/rapids) suite of software libraries from NVIDIA enables the execution of end-to-end data science and analytics pipelines entirely on GPUs. In many machine learning projects, a significant portion of the model training time is spent in setting up the data; this stage of the process is known as Extraction, Transformation and Loading, or ETL. By using the DataFrame API for ETL and GPU-capable ML algorithms in RAPIDS, data preparation and training models can be done in GPU-accelerated end-to-end pipelines without incurring serialization costs between the pipeline stages. This notebook demonstrates how to use NVIDIA RAPIDS to prepare data and train model in Azure.\n",
+    "The [RAPIDS](https://www.developer.nvidia.com/rapids) suite of software libraries from NVIDIA enables the execution of end-to-end data science and analytics pipelines entirely on GPUs. In many machine learning projects, a significant portion of the model training time is spent in setting up the data; this stage of the process is known as Extraction, Transformation and Loading, or ETL. By using the DataFrame API for ETLÂ and GPU-capable ML algorithms in RAPIDS, data preparation and training models can be done in GPU-accelerated end-to-end pipelines without incurring serialization costs between the pipeline stages. This notebook demonstrates how to use NVIDIA RAPIDS to prepare data and train model in Azure.\n",
    " \n",
    "In this notebook, we will do the following:\n",
    " \n",
@@ -62,6 +62,7 @@
   "source": [
    "import os\n",
    "from azureml.core import Workspace, Experiment\n",
+    "from azureml.core.conda_dependencies import CondaDependencies\n",
    "from azureml.core.compute import AmlCompute, ComputeTarget\n",
    "from azureml.data.data_reference import DataReference\n",
    "from azureml.core.runconfig import RunConfiguration\n",
@@ -210,21 +211,107 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "This sample uses [Fannie Mae’s Single-Family Loan Performance Data](http://www.fanniemae.com/portal/funding-the-market/data/loan-performance-data.html). Refer to the 'Available mortgage datasets' section in [instructions](https://rapidsai.github.io/demos/datasets/mortgage-data) to get sample data.\n",
-    "\n",
-    "Once you obtain access to the data, you will need to make this data available in an [Azure Machine Learning Datastore](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-access-data), for use in this sample."
+    "This sample uses [Fannie Mae's Single-Family Loan Performance Data](http://www.fanniemae.com/portal/funding-the-market/data/loan-performance-data.html). Once you obtain access to the data, you will need to make this data available in an [Azure Machine Learning Datastore](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-access-data), for use in this sample. The following code shows how to do that."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "<font color='red'>Important</font>: The following step assumes the data is uploaded to the Workspace's default data store under a folder named 'mortgagedata2000_01'. Note that uploading data to the Workspace's default data store is not necessary and the data can be referenced from any datastore, e.g., from Azure Blob or File service, once it is added as a datastore to the workspace. The path_on_datastore parameter needs to be updated, depending on where the data is available.  The directory where the data is available should have the following folder structure, as the process_data.py script expects this directory structure:\n",
-    "* _&lt;data directory>_/acq\n",
-    "* _&lt;data directory>_/perf\n",
-    "* _names.csv_\n",
+    "### Downloading Data"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<font color='red'>Important</font>: Python package progressbar2 is necessary to run the following cell. If it is not available in your environment where this notebook is running, please install it."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import tarfile\n",
+    "import hashlib\n",
+    "from urllib.request import urlretrieve\n",
+    "from progressbar import ProgressBar\n",
    "\n",
-    "The 'acq' and 'perf' refer to directories containing data files. The _&lt;data directory>_ is the path specified in _path&#95;on&#95;datastore_ parameter in the step below."
+    "def validate_downloaded_data(path):\n",
+    "    if(os.path.isdir(path) and os.path.exists(path + '//names.csv')) :\n",
+    "        if(os.path.isdir(path + '//acq' ) and len(os.listdir(path + '//acq')) == 8):\n",
+    "            if(os.path.isdir(path + '//perf' ) and len(os.listdir(path + '//perf')) == 11):\n",
+    "                print(\"Data has been downloaded and decompressed at: {0}\".format(path))\n",
+    "                return True\n",
+    "    print(\"Data has not been downloaded and decompressed\")\n",
+    "    return False\n",
+    "\n",
+    "def show_progress(count, block_size, total_size):\n",
+    "    global pbar\n",
+    "    global processed\n",
+    "    \n",
+    "    if count == 0:\n",
+    "        pbar = ProgressBar(maxval=total_size)\n",
+    "        processed = 0\n",
+    "    \n",
+    "    processed += block_size\n",
+    "    processed = min(processed,total_size)\n",
+    "    pbar.update(processed)\n",
+    "\n",
+    "        \n",
+    "def download_file(fileroot):\n",
+    "    filename = fileroot + '.tgz'\n",
+    "    if(not os.path.exists(filename) or hashlib.md5(open(filename, 'rb').read()).hexdigest() != '82dd47135053303e9526c2d5c43befd5' ):\n",
+    "        url_format = 'http://rapidsai-data.s3-website.us-east-2.amazonaws.com/notebook-mortgage-data/{0}.tgz'\n",
+    "        url = url_format.format(fileroot)\n",
+    "        print(\"...Downloading file :{0}\".format(filename))\n",
+    "        urlretrieve(url, filename,show_progress)\n",
+    "        pbar.finish()\n",
+    "        print(\"...File :{0} finished downloading\".format(filename))\n",
+    "    else:\n",
+    "        print(\"...File :{0} has been downloaded already\".format(filename))\n",
+    "    return filename\n",
+    "\n",
+    "def decompress_file(filename,path):\n",
+    "    tar = tarfile.open(filename)\n",
+    "    print(\"...Getting information from {0} about files to decompress\".format(filename))\n",
+    "    members = tar.getmembers()\n",
+    "    numFiles = len(members)\n",
+    "    so_far = 0\n",
+    "    for member_info in members:\n",
+    "        tar.extract(member_info,path=path)\n",
+    "        show_progress(so_far, 1, numFiles)\n",
+    "        so_far += 1\n",
+    "    pbar.finish()\n",
+    "    print(\"...All {0} files have been decompressed\".format(numFiles))\n",
+    "    tar.close()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "fileroot = 'mortgage_2000-2001'\n",
+    "path = '.\\\\{0}'.format(fileroot)\n",
+    "pbar = None\n",
+    "processed = 0\n",
+    "\n",
+    "if(not validate_downloaded_data(path)):\n",
+    "    print(\"Downloading and Decompressing Input Data\")\n",
+    "    filename = download_file(fileroot)\n",
+    "    decompress_file(filename,path)\n",
+    "    print(\"Input Data has been Downloaded and Decompressed\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Uploading Data to Workspace"
   ]
  },
  {
@@ -237,10 +324,10 @@
    "\n",
    "# download and uncompress data in a local directory before uploading to data store\n",
    "# directory specified in src_dir parameter below should have the acq, perf directories with data and names.csv file\n",
-    "# ds.upload(src_dir='<local directory that has data>', target_path='mortgagedata2000_01', overwrite=True, show_progress=True)\n",
+    "ds.upload(src_dir=path, target_path=fileroot, overwrite=True, show_progress=True)\n",
    "\n",
    "# data already uploaded to the datastore\n",
-    "data_ref = DataReference(data_reference_name='data', datastore=ds, path_on_datastore='mortgagedata2000_01')"
+    "data_ref = DataReference(data_reference_name='data', datastore=ds, path_on_datastore=fileroot)"
   ]
  },
  {
@@ -254,7 +341,26 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "AML allows the option of using existing Docker images with prebuilt conda environments. The following step use an existing image from [Docker Hub](https://hub.docker.com/r/rapidsai/rapidsai/)."
+    "RunConfiguration is used to submit jobs to Azure Machine Learning service. When creating RunConfiguration for a job, users can either \n",
+    "1. specify a Docker image with prebuilt conda environment and use it without any modifications to run the job, or \n",
+    "2. specify a Docker image as the base image and conda or pip packages as dependnecies to let AML build a new Docker image with a conda environment containing specified dependencies to use in the job\n",
+    "\n",
+    "The second option is the recommended option in AML. \n",
+    "The following steps have code for both options. You can pick the one that is more appropriate for your requirements. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Specify prebuilt conda environment"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The following code shows how to use an existing image from [Docker Hub](https://hub.docker.com/r/rapidsai/rapidsai/) that has a prebuilt conda environment named 'rapids' when creating a RunConfiguration. Note that this conda environment does not include azureml-defaults package that is required for using AML functionality like metrics tracking, model management etc. This package is automatically installed when you use 'Specify package dependencies' option and that is why it is the recommended option to create RunConfiguraiton in AML."
   ]
  },
  {
@@ -266,18 +372,52 @@
    "run_config = RunConfiguration()\n",
    "run_config.framework = 'python'\n",
    "run_config.environment.python.user_managed_dependencies = True\n",
-    "# use conda environment named 'rapids' available in the Docker image\n",
-    "# this conda environment does not include azureml-defaults package that is required for using AML functionality like metrics tracking, model management etc.\n",
    "run_config.environment.python.interpreter_path = '/conda/envs/rapids/bin/python'\n",
    "run_config.target = gpu_cluster_name\n",
    "run_config.environment.docker.enabled = True\n",
    "run_config.environment.docker.gpu_support = True\n",
-    "# if registry is not mentioned the image is pulled from Docker Hub\n",
-    "run_config.environment.docker.base_image = \"rapidsai/rapidsai:cuda9.2_ubuntu16.04_root\"\n",
+    "run_config.environment.docker.base_image = \"rapidsai/rapidsai:cuda9.2-runtime-ubuntu18.04\"\n",
+    "# run_config.environment.docker.base_image_registry.address = '<registry_url>' # not required if the base_image is in Docker hub\n",
+    "# run_config.environment.docker.base_image_registry.username = '<user_name>' # needed only for private images\n",
+    "# run_config.environment.docker.base_image_registry.password = '<password>' # needed only for private images\n",
    "run_config.environment.spark.precache_packages = False\n",
    "run_config.data_references={'data':data_ref.to_config()}"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Specify package dependencies"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The following code shows how to list package dependencies in a conda environment definition file (rapids.yml) when creating a RunConfiguration"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# cd = CondaDependencies(conda_dependencies_file_path='rapids.yml')\n",
+    "# run_config = RunConfiguration(conda_dependencies=cd)\n",
+    "# run_config.framework = 'python'\n",
+    "# run_config.target = gpu_cluster_name\n",
+    "# run_config.environment.docker.enabled = True\n",
+    "# run_config.environment.docker.gpu_support = True\n",
+    "# run_config.environment.docker.base_image = \"<image>\"\n",
+    "# run_config.environment.docker.base_image_registry.address = '<registry_url>' # not required if the base_image is in Docker hub\n",
+    "# run_config.environment.docker.base_image_registry.username = '<user_name>' # needed only for private images\n",
+    "# run_config.environment.docker.base_image_registry.password = '<password>' # needed only for private images\n",
+    "# run_config.environment.spark.precache_packages = False\n",
+    "# run_config.data_references={'data':data_ref.to_config()}"
+   ]
+  },
  {
   "cell_type": "markdown",
   "metadata": {},
@@ -293,17 +433,24 @@
   "source": [
    "# parameter cpu_predictor indicates if training should be done on CPU. If set to true, GPUs are used *only* for ETL and *not* for training\n",
    "# parameter num_gpu indicates number of GPUs to use among the GPUs available in the VM for ETL and if cpu_predictor is false, for training as well \n",
-    "def run_rapids_experiment(cpu_training, gpu_count):\n",
+    "def run_rapids_experiment(cpu_training, gpu_count, part_count):\n",
    "    # any value between 1-4 is allowed here depending the type of VMs available in gpu_cluster\n",
    "    if gpu_count not in [1, 2, 3, 4]:\n",
    "        raise Exception('Value specified for the number of GPUs to use {0} is invalid'.format(gpu_count))\n",
    "\n",
    "    # following data partition mapping is empirical (specific to GPUs used and current data partitioning scheme) and may need to be tweaked\n",
-    "    gpu_count_data_partition_mapping = {1: 2, 2: 4, 3: 5, 4: 7}\n",
-    "    part_count = gpu_count_data_partition_mapping[gpu_count]\n",
-    "\n",
+    "    max_gpu_count_data_partition_mapping = {1: 3, 2: 4, 3: 6, 4: 8}\n",
+    "    \n",
+    "    if part_count > max_gpu_count_data_partition_mapping[gpu_count]:\n",
+    "        print(\"Too many partitions for the number of GPUs, exceeding memory threshold\")\n",
+    "        \n",
+    "    if part_count > 11:\n",
+    "        print(\"Warning: Maximum number of partitions available is 11\")\n",
+    "        part_count = 11\n",
+    "        \n",
    "    end_year = 2000\n",
-    "    if gpu_count > 2:\n",
+    "    \n",
+    "    if part_count > 4:\n",
    "        end_year = 2001 # use more data with more GPUs\n",
    "\n",
    "    src = ScriptRunConfig(source_directory=scripts_folder, \n",
@@ -317,7 +464,8 @@
    "\n",
    "    exp = Experiment(ws, 'rapidstest')\n",
    "    run = exp.submit(config=src)\n",
-    "    RunDetails(run).show()"
+    "    RunDetails(run).show()\n",
+    "    return run"
   ]
  },
  {
@@ -335,9 +483,10 @@
   "source": [
    "cpu_predictor = False\n",
    "# the value for num_gpu should be less than or equal to the number of GPUs available in the VM\n",
-    "num_gpu = 1 \n",
+    "num_gpu = 1\n",
+    "data_part_count = 1\n",
    "# train using CPU, use GPU for both ETL and training\n",
-    "run_rapids_experiment(cpu_predictor, num_gpu)"
+    "run = run_rapids_experiment(cpu_predictor, num_gpu, data_part_count)"
   ]
  },
  {
@@ -358,8 +507,9 @@
    "cpu_predictor = True\n",
    "# the value for num_gpu should be less than or equal to the number of GPUs available in the VM\n",
    "num_gpu = 1\n",
+    "data_part_count = 1\n",
    "# train using CPU, use GPU for ETL\n",
-    "run_rapids_experiment(cpu_predictor, num_gpu)"
+    "run = run_rapids_experiment(cpu_predictor, num_gpu, data_part_count)"
   ]
  },
  {