update samples from Release-193 as a part of 1.53.0 SDK stable release

2025-12-23 20:00:06 -05:00 · 2023-08-23 03:24:03 +00:00
parent d0961b98bf
commit bb11c80b1b
96 changed files with 1116 additions and 2055 deletions
--- a/how-to-use-azureml/ml-frameworks/pytorch/distributed-pytorch-with-distributeddataparallel/distributed-pytorch-with-distributeddataparallel.ipynb
+++ b/how-to-use-azureml/ml-frameworks/pytorch/distributed-pytorch-with-distributeddataparallel/distributed-pytorch-with-distributeddataparallel.ipynb
@@ -97,7 +97,7 @@
      "metadata": {},
      "source": [
        "## Create or attach existing AmlCompute\n",
-        "You will need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for training your model. In this tutorial, we use Azure ML managed compute ([AmlCompute](https://docs.microsoft.com/azure/machine-learning/service/how-to-set-up-training-targets#amlcompute)) for our remote training compute resource. Specifically, the below code creates an `STANDARD_NC6` GPU cluster that autoscales from `0` to `4` nodes.\n",
+        "You will need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for training your model. In this tutorial, we use Azure ML managed compute ([AmlCompute](https://docs.microsoft.com/azure/machine-learning/service/how-to-set-up-training-targets#amlcompute)) for our remote training compute resource. Specifically, the below code creates an `Standard_NC6s_v3` GPU cluster that autoscales from `0` to `4` nodes.\n",
        "\n",
        "> Note that if you have an AzureML Data Scientist role, you will not have permission to create compute resources. Talk to your workspace or IT admin to create the compute targets described in this section, if they do not already exist.\n",
        "\n",
@@ -123,7 +123,7 @@
        "    print('Found existing compute target.')\n",
        "except ComputeTargetException:\n",
        "    print('Creating a new compute target...')\n",
-        "    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_NC6',\n",
+        "    compute_config = AmlCompute.provisioning_configuration(vm_size='Standard_NC6s_v3',\n",
        "                                                           max_nodes=4)\n",
        "\n",
        "    # create the cluster\n",
@@ -293,7 +293,7 @@
      "source": [
        "from azureml.core import Environment\n",
        "\n",
-        "pytorch_env = Environment.get(ws, name='AzureML-PyTorch-1.6-GPU')"
+        "pytorch_env = Environment.get(ws, name='azureml-acpt-pytorch-1.11-cuda11.3')"
      ]
    },
    {
@@ -323,7 +323,7 @@
        "To use the per-process launch option in which Azure ML will handle launching each of the processes to run your training script,\n",
        "\n",
        "1. Specify the training script and arguments\n",
-        "2. Create a `PyTorchConfiguration` and specify `node_count` and `process_count`. The `process_count` is the total number of processes you want to run for the job; this should typically equal the # of GPUs available on each node multiplied by the # of nodes. Since this tutorial uses the `STANDARD_NC6` SKU, which has one GPU, the total process count for a 2-node job is `2`. If you are using a SKU with >1 GPUs, adjust the `process_count` accordingly.\n",
+        "2. Create a `PyTorchConfiguration` and specify `node_count` and `process_count`. The `process_count` is the total number of processes you want to run for the job; this should typically equal the # of GPUs available on each node multiplied by the # of nodes. Since this tutorial uses the `Standard_NC6s_v3` SKU, which has one GPU, the total process count for a 2-node job is `2`. If you are using a SKU with >1 GPUs, adjust the `process_count` accordingly.\n",
        "\n",
        "Azure ML will set the `MASTER_ADDR`, `MASTER_PORT`, `NODE_RANK`, `WORLD_SIZE` environment variables on each node, in addition to the process-level `RANK` and `LOCAL_RANK` environment variables, that are needed for distributed PyTorch training."
      ]