mirror of
https://github.com/Azure/MachineLearningNotebooks.git
synced 2025-12-19 17:17:04 -05:00
547 lines
22 KiB
Plaintext
547 lines
22 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Copyright (c) Microsoft Corporation. All rights reserved.\n",
|
|
"\n",
|
|
"Licensed under the MIT License."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# NVIDIA RAPIDS in Azure Machine Learning"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"The [RAPIDS](https://www.developer.nvidia.com/rapids) suite of software libraries from NVIDIA enables the execution of end-to-end data science and analytics pipelines entirely on GPUs. In many machine learning projects, a significant portion of the model training time is spent in setting up the data; this stage of the process is known as Extraction, Transformation and Loading, or ETL. By using the DataFrame API for ETL\u00c2\u00a0and GPU-capable ML algorithms in RAPIDS, data preparation and training models can be done in GPU-accelerated end-to-end pipelines without incurring serialization costs between the pipeline stages. This notebook demonstrates how to use NVIDIA RAPIDS to prepare data and train model\u00c3\u201a\u00c2\u00a0in Azure.\n",
|
|
" \n",
|
|
"In this notebook, we will do the following:\n",
|
|
" \n",
|
|
"* Create an Azure Machine Learning Workspace\n",
|
|
"* Create an AMLCompute target\n",
|
|
"* Use a script to process our data and train a model\n",
|
|
"* Obtain the data required to run this sample\n",
|
|
"* Create an AML run configuration to launch a machine learning job\n",
|
|
"* Run the script to prepare data for training and train the model\n",
|
|
" \n",
|
|
"Prerequisites:\n",
|
|
"* An Azure subscription to create a Machine Learning Workspace\n",
|
|
"* Familiarity with the Azure ML SDK (refer to [notebook samples](https://github.com/Azure/MachineLearningNotebooks))\n",
|
|
"* A Jupyter notebook environment with Azure Machine Learning SDK installed. Refer to instructions to [setup the environment](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-configure-environment#local)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Verify if Azure ML SDK is installed"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import azureml.core\n",
|
|
"print(\"SDK version:\", azureml.core.VERSION)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import os\n",
|
|
"from azureml.core import Workspace, Experiment\n",
|
|
"from azureml.core.conda_dependencies import CondaDependencies\n",
|
|
"from azureml.core.compute import AmlCompute, ComputeTarget\n",
|
|
"from azureml.data.data_reference import DataReference\n",
|
|
"from azureml.core.runconfig import RunConfiguration\n",
|
|
"from azureml.core import ScriptRunConfig\n",
|
|
"from azureml.widgets import RunDetails"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Create Azure ML Workspace"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"The following step is optional if you already have a workspace. If you want to use an existing workspace, then\n",
|
|
"skip this workspace creation step and move on to the next step to load the workspace.\n",
|
|
" \n",
|
|
"<font color='red'>Important</font>: in the code cell below, be sure to set the correct values for the subscription_id, \n",
|
|
"resource_group, workspace_name, region before executing this code cell."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"subscription_id = os.environ.get(\"SUBSCRIPTION_ID\", \"<subscription_id>\")\n",
|
|
"resource_group = os.environ.get(\"RESOURCE_GROUP\", \"<resource_group>\")\n",
|
|
"workspace_name = os.environ.get(\"WORKSPACE_NAME\", \"<workspace_name>\")\n",
|
|
"workspace_region = os.environ.get(\"WORKSPACE_REGION\", \"<region>\")\n",
|
|
"\n",
|
|
"ws = Workspace.create(workspace_name, subscription_id=subscription_id, resource_group=resource_group, location=workspace_region)\n",
|
|
"\n",
|
|
"# write config to a local directory for future use\n",
|
|
"ws.write_config()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Load existing Workspace"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"ws = Workspace.from_config()\n",
|
|
"\n",
|
|
"# if a locally-saved configuration file for the workspace is not available, use the following to load workspace\n",
|
|
"# ws = Workspace(subscription_id=subscription_id, resource_group=resource_group, workspace_name=workspace_name)\n",
|
|
"\n",
|
|
"print('Workspace name: ' + ws.name, \n",
|
|
" 'Azure region: ' + ws.location, \n",
|
|
" 'Subscription id: ' + ws.subscription_id, \n",
|
|
" 'Resource group: ' + ws.resource_group, sep = '\\n')\n",
|
|
"\n",
|
|
"scripts_folder = \"scripts_folder\"\n",
|
|
"\n",
|
|
"if not os.path.isdir(scripts_folder):\n",
|
|
" os.mkdir(scripts_folder)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Create AML Compute Target"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Because NVIDIA RAPIDS requires P40 or V100 GPUs, the user needs to specify compute targets from one of [NC_v3](https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes-gpu#ncv3-series), [NC_v2](https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes-gpu#ncv2-series), [ND](https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes-gpu#nd-series) or [ND_v2](https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes-gpu#ndv2-series-preview) virtual machine types in Azure; these are the families of virtual machines in Azure that are provisioned with these GPUs.\n",
|
|
" \n",
|
|
"Pick one of the supported VM SKUs based on the number of GPUs you want to use for ETL and training in RAPIDS.\n",
|
|
" \n",
|
|
"The script in this notebook is implemented for single-machine scenarios. An example supporting multiple nodes will be published later."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"gpu_cluster_name = \"gpucluster\"\n",
|
|
"\n",
|
|
"if gpu_cluster_name in ws.compute_targets:\n",
|
|
" gpu_cluster = ws.compute_targets[gpu_cluster_name]\n",
|
|
" if gpu_cluster and type(gpu_cluster) is AmlCompute:\n",
|
|
" print('Found compute target. Will use {0} '.format(gpu_cluster_name))\n",
|
|
"else:\n",
|
|
" print(\"creating new cluster\")\n",
|
|
" # vm_size parameter below could be modified to one of the RAPIDS-supported VM types\n",
|
|
" provisioning_config = AmlCompute.provisioning_configuration(vm_size = \"Standard_NC6s_v3\", min_nodes=1, max_nodes = 1)\n",
|
|
"\n",
|
|
" # create the cluster\n",
|
|
" gpu_cluster = ComputeTarget.create(ws, gpu_cluster_name, provisioning_config)\n",
|
|
" gpu_cluster.wait_for_completion(show_output=True)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Script to process data and train model"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# copy process_data.py into the script folder\n",
|
|
"import shutil\n",
|
|
"shutil.copy('./process_data.py', os.path.join(scripts_folder, 'process_data.py'))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Data required to run this sample"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"This sample uses [Fannie Mae's Single-Family Loan Performance Data](http://www.fanniemae.com/portal/funding-the-market/data/loan-performance-data.html). Once you obtain access to the data, you will need to make this data available in an [Azure Machine Learning Datastore](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-access-data), for use in this sample. The following code shows how to do that."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Downloading Data"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import tarfile\n",
|
|
"import hashlib\n",
|
|
"from urllib.request import urlretrieve\n",
|
|
"\n",
|
|
"def validate_downloaded_data(path):\n",
|
|
" if(os.path.isdir(path) and os.path.exists(path + '//names.csv')) :\n",
|
|
" if(os.path.isdir(path + '//acq' ) and len(os.listdir(path + '//acq')) == 8):\n",
|
|
" if(os.path.isdir(path + '//perf' ) and len(os.listdir(path + '//perf')) == 11):\n",
|
|
" print(\"Data has been downloaded and decompressed at: {0}\".format(path))\n",
|
|
" return True\n",
|
|
" print(\"Data has not been downloaded and decompressed\")\n",
|
|
" return False\n",
|
|
"\n",
|
|
"def show_progress(count, block_size, total_size):\n",
|
|
" global pbar\n",
|
|
" global processed\n",
|
|
" \n",
|
|
" if count == 0:\n",
|
|
" pbar = ProgressBar(maxval=total_size)\n",
|
|
" processed = 0\n",
|
|
" \n",
|
|
" processed += block_size\n",
|
|
" processed = min(processed,total_size)\n",
|
|
" pbar.update(processed)\n",
|
|
"\n",
|
|
" \n",
|
|
"def download_file(fileroot):\n",
|
|
" filename = fileroot + '.tgz'\n",
|
|
" if(not os.path.exists(filename) or hashlib.md5(open(filename, 'rb').read()).hexdigest() != '82dd47135053303e9526c2d5c43befd5' ):\n",
|
|
" url_format = 'http://rapidsai-data.s3-website.us-east-2.amazonaws.com/notebook-mortgage-data/{0}.tgz'\n",
|
|
" url = url_format.format(fileroot)\n",
|
|
" print(\"...Downloading file :{0}\".format(filename))\n",
|
|
" urlretrieve(url, filename)\n",
|
|
" pbar.finish()\n",
|
|
" print(\"...File :{0} finished downloading\".format(filename))\n",
|
|
" else:\n",
|
|
" print(\"...File :{0} has been downloaded already\".format(filename))\n",
|
|
" return filename\n",
|
|
"\n",
|
|
"def decompress_file(filename,path):\n",
|
|
" tar = tarfile.open(filename)\n",
|
|
" print(\"...Getting information from {0} about files to decompress\".format(filename))\n",
|
|
" members = tar.getmembers()\n",
|
|
" numFiles = len(members)\n",
|
|
" so_far = 0\n",
|
|
" for member_info in members:\n",
|
|
" tar.extract(member_info,path=path)\n",
|
|
" so_far += 1\n",
|
|
" print(\"...All {0} files have been decompressed\".format(numFiles))\n",
|
|
" tar.close()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"fileroot = 'mortgage_2000-2001'\n",
|
|
"path = '.\\\\{0}'.format(fileroot)\n",
|
|
"pbar = None\n",
|
|
"processed = 0\n",
|
|
"\n",
|
|
"if(not validate_downloaded_data(path)):\n",
|
|
" print(\"Downloading and Decompressing Input Data\")\n",
|
|
" filename = download_file(fileroot)\n",
|
|
" decompress_file(filename,path)\n",
|
|
" print(\"Input Data has been Downloaded and Decompressed\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Uploading Data to Workspace"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"ds = ws.get_default_datastore()\n",
|
|
"\n",
|
|
"# download and uncompress data in a local directory before uploading to data store\n",
|
|
"# directory specified in src_dir parameter below should have the acq, perf directories with data and names.csv file\n",
|
|
"\n",
|
|
"# ---->>>> UNCOMMENT THE BELOW LINE TO UPLOAD YOUR DATA IF NOT DONE SO ALREADY <<<<----\n",
|
|
"# ds.upload(src_dir=path, target_path=fileroot, overwrite=True, show_progress=True)\n",
|
|
"\n",
|
|
"# data already uploaded to the datastore\n",
|
|
"data_ref = DataReference(data_reference_name='data', datastore=ds, path_on_datastore=fileroot)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Create AML run configuration to launch a machine learning job"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"RunConfiguration is used to submit jobs to Azure Machine Learning service. When creating RunConfiguration for a job, users can either \n",
|
|
"1. specify a Docker image with prebuilt conda environment and use it without any modifications to run the job, or \n",
|
|
"2. specify a Docker image as the base image and conda or pip packages as dependnecies to let AML build a new Docker image with a conda environment containing specified dependencies to use in the job\n",
|
|
"\n",
|
|
"The second option is the recommended option in AML. \n",
|
|
"The following steps have code for both options. You can pick the one that is more appropriate for your requirements. "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Specify prebuilt conda environment"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"The following code shows how to install RAPIDS using conda. The `rapids.yml` file contains the list of packages necessary to run this tutorial. **NOTE:** Initial build of the image might take up to 20 minutes as the service needs to build and cache the new image; once the image is built the subequent runs use the cached image and the overhead is minimal."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"cd = CondaDependencies(conda_dependencies_file_path='rapids.yml')\n",
|
|
"run_config = RunConfiguration(conda_dependencies=cd)\n",
|
|
"run_config.framework = 'python'\n",
|
|
"run_config.target = gpu_cluster_name\n",
|
|
"run_config.environment.docker.enabled = True\n",
|
|
"run_config.environment.docker.gpu_support = True\n",
|
|
"run_config.environment.docker.base_image = \"mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.1-cudnn8-ubuntu20.04\"\n",
|
|
"run_config.environment.spark.precache_packages = False\n",
|
|
"run_config.data_references={'data':data_ref.to_config()}"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Using Docker"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Alternatively, you can specify RAPIDS Docker image."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# run_config = RunConfiguration()\n",
|
|
"# run_config.framework = 'python'\n",
|
|
"# run_config.environment.python.user_managed_dependencies = True\n",
|
|
"# run_config.environment.python.interpreter_path = '/conda/envs/rapids/bin/python'\n",
|
|
"# run_config.target = gpu_cluster_name\n",
|
|
"# run_config.environment.docker.enabled = True\n",
|
|
"# run_config.environment.docker.gpu_support = True\n",
|
|
"# run_config.environment.docker.base_image = \"rapidsai/rapidsai:cuda9.2-runtime-ubuntu20.04\"\n",
|
|
"# # run_config.environment.docker.base_image_registry.address = '<registry_url>' # not required if the base_image is in Docker hub\n",
|
|
"# # run_config.environment.docker.base_image_registry.username = '<user_name>' # needed only for private images\n",
|
|
"# # run_config.environment.docker.base_image_registry.password = '<password>' # needed only for private images\n",
|
|
"# run_config.environment.spark.precache_packages = False\n",
|
|
"# run_config.data_references={'data':data_ref.to_config()}"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Wrapper function to submit Azure Machine Learning experiment"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# parameter cpu_predictor indicates if training should be done on CPU. If set to true, GPUs are used *only* for ETL and *not* for training\n",
|
|
"# parameter num_gpu indicates number of GPUs to use among the GPUs available in the VM for ETL and if cpu_predictor is false, for training as well \n",
|
|
"def run_rapids_experiment(cpu_training, gpu_count, part_count):\n",
|
|
" # any value between 1-4 is allowed here depending the type of VMs available in gpu_cluster\n",
|
|
" if gpu_count not in [1, 2, 3, 4]:\n",
|
|
" raise Exception('Value specified for the number of GPUs to use {0} is invalid'.format(gpu_count))\n",
|
|
"\n",
|
|
" # following data partition mapping is empirical (specific to GPUs used and current data partitioning scheme) and may need to be tweaked\n",
|
|
" max_gpu_count_data_partition_mapping = {1: 3, 2: 4, 3: 6, 4: 8}\n",
|
|
" \n",
|
|
" if part_count > max_gpu_count_data_partition_mapping[gpu_count]:\n",
|
|
" print(\"Too many partitions for the number of GPUs, exceeding memory threshold\")\n",
|
|
" \n",
|
|
" if part_count > 11:\n",
|
|
" print(\"Warning: Maximum number of partitions available is 11\")\n",
|
|
" part_count = 11\n",
|
|
" \n",
|
|
" end_year = 2000\n",
|
|
" \n",
|
|
" if part_count > 4:\n",
|
|
" end_year = 2001 # use more data with more GPUs\n",
|
|
"\n",
|
|
" src = ScriptRunConfig(source_directory=scripts_folder, \n",
|
|
" script='process_data.py', \n",
|
|
" arguments = ['--num_gpu', gpu_count, '--data_dir', str(data_ref),\n",
|
|
" '--part_count', part_count, '--end_year', end_year,\n",
|
|
" '--cpu_predictor', cpu_training\n",
|
|
" ],\n",
|
|
" run_config=run_config\n",
|
|
" )\n",
|
|
"\n",
|
|
" exp = Experiment(ws, 'rapidstest')\n",
|
|
" run = exp.submit(config=src)\n",
|
|
" RunDetails(run).show()\n",
|
|
" return run"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Submit experiment (ETL & training on GPU)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"cpu_predictor = False\n",
|
|
"# the value for num_gpu should be less than or equal to the number of GPUs available in the VM\n",
|
|
"num_gpu = 1\n",
|
|
"data_part_count = 1\n",
|
|
"# train using CPU, use GPU for both ETL and training\n",
|
|
"run = run_rapids_experiment(cpu_predictor, num_gpu, data_part_count)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Submit experiment (ETL on GPU, training on CPU)\n",
|
|
"\n",
|
|
"To observe performance difference between GPU-accelerated RAPIDS based training with CPU-only training, set 'cpu_predictor' predictor to 'True' and rerun the experiment"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"cpu_predictor = True\n",
|
|
"# the value for num_gpu should be less than or equal to the number of GPUs available in the VM\n",
|
|
"num_gpu = 1\n",
|
|
"data_part_count = 1\n",
|
|
"# train using CPU, use GPU for ETL\n",
|
|
"run = run_rapids_experiment(cpu_predictor, num_gpu, data_part_count)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Delete cluster"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# delete the cluster\n",
|
|
"# gpu_cluster.delete()"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"authors": [
|
|
{
|
|
"name": "ksivas"
|
|
}
|
|
],
|
|
"kernelspec": {
|
|
"display_name": "Python 3.8 - AzureML",
|
|
"language": "python",
|
|
"name": "python38-azureml"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.6.8"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 4
|
|
} |