MachineLearningNotebooks/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-with-commandstep-r.ipynb

{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Copyright (c) Microsoft Corporation. All rights reserved.\n",
        "\n",
        "Licensed under the MIT License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-how-to-use-estimatorstep.png)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# How to use CommandStep in Azure ML Pipelines\n",
        "\n",
        "This notebook shows how to use the CommandStep with Azure Machine Learning Pipelines for running R scripts in a pipeline.\n",
        "\n",
        "The example shows training a model in R to predict probability of fatality for vehicle crashes.\n",
        "\n",
        "\n",
        "## Prerequisite:\n",
        "* Understand the [architecture and terms](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture) introduced by Azure Machine Learning\n",
        "* If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, go through the [configuration notebook](https://aka.ms/pl-config) to:\n",
        "    * install the Azure ML SDK\n",
        "    * create a workspace and its configuration file (`config.json`)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Let's get started. First let's import some Python libraries."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import azureml.core\n",
        "# check core SDK version number\n",
        "print(\"Azure ML SDK Version: \", azureml.core.VERSION)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Initialize workspace\n",
        "Initialize a [Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace) object from the existing workspace you created in the Prerequisites step. `Workspace.from_config()` creates a workspace object from the details stored in `config.json`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.core import Workspace\n",
        "ws = Workspace.from_config()\n",
        "print('Workspace name: ' + ws.name, \n",
        "      'Azure region: ' + ws.location, \n",
        "      'Subscription id: ' + ws.subscription_id, \n",
        "      'Resource group: ' + ws.resource_group, sep = '\\n')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Create or Attach existing AmlCompute\n",
        "You will need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for training your model. In this tutorial, you create `AmlCompute` as your training compute resource."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "If we could not find the cluster with the given name, then we will create a new cluster here. We will create an `AmlCompute` cluster of `STANDARD_D2_V2` CPU VMs. This process is broken down into 3 steps:\n",
        "1. create the configuration (this step is local and only takes a second)\n",
        "2. create the cluster (this step will take about **20 seconds**)\n",
        "3. provision the VMs to bring the cluster to the initial size (of 1 in this case). This step will take about **3-5 minutes** and is providing only sparse output in the process. Please make sure to wait until the call returns before moving to the next cell"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.core.compute import ComputeTarget, AmlCompute\n",
        "from azureml.core.compute_target import ComputeTargetException\n",
        "\n",
        "# choose a name for your cluster\n",
        "cluster_name = \"cpu-cluster\"\n",
        "\n",
        "try:\n",
        "    compute_target = ComputeTarget(workspace=ws, name=cluster_name)\n",
        "    print('Found existing compute target')\n",
        "except ComputeTargetException:\n",
        "    print('Creating a new compute target...')\n",
        "    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2', max_nodes=4)\n",
        "\n",
        "    # create the cluster\n",
        "    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)\n",
        "\n",
        "    # can poll for a minimum number of nodes and for a specific timeout. \n",
        "    # if no min node count is provided it uses the scale settings for the cluster\n",
        "    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)\n",
        "\n",
        "# use get_status() to get a detailed status for the current cluster. \n",
        "print(compute_target.get_status().serialize())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now that you have created the compute target, let's see what the workspace's `compute_targets` property returns. You should now see one entry named 'cpu-cluster' of type `AmlCompute`."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Create a CommandStep\n",
        "CommandStep adds a step to run a command in a Pipeline. For the full set of configurable options see the CommandStep [reference docs](https://docs.microsoft.com/python/api/azureml-pipeline-steps/azureml.pipeline.steps.commandstep?view=azure-ml-py).\n",
        "\n",
        "- **name:** Name of the step\n",
        "- **runconfig:** ScriptRunConfig object. You can configure a ScriptRunConfig object as you would for a standalone non-pipeline run and pass it in to this parameter. If using this option, you do not have to specify the `command`, `source_directory`, `compute_target` parameters of the CommandStep constructor as they are already defined in your ScriptRunConfig.\n",
        "- **runconfig_pipeline_params:** Override runconfig properties at runtime using key-value pairs each with name of the runconfig property and PipelineParameter for that property\n",
        "- **command:** The command to run or path of the executable/script relative to `source_directory`. It is required unless the `runconfig` parameter is specified. It can be specified with string arguments in a single string or with input/output/PipelineParameter in a list.\n",
        "- **source_directory:** A folder containing the script and other resources used in the step.\n",
        "- **compute_target:** Compute target to use \n",
        "- **allow_reuse:** Whether the step should reuse previous results when run with the same settings/inputs. If this is false, a new run will always be generated for this step during pipeline execution.\n",
        "- **version:** Optional version tag to denote a change in functionality for the step\n",
        "\n",
        "> The best practice is to use separate folders for scripts and its dependent files for each step and specify that folder as the `source_directory` for the step. This helps reduce the size of the snapshot created for the step (only the specific folder is snapshotted). Since changes in any files in the `source_directory` would trigger a re-upload of the snapshot, this helps keep the reuse of the step when there are no changes in the `source_directory` of the step."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Configure environment\n",
        "\n",
        "Configure the environment for the train step. In this example we will create an environment from the Dockerfile we have included.\n",
        "\n",
        "> Azure ML currently requires Python as an implicit dependency, so Python must installed in your image even if your training script does not have this dependency."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.core import Environment\n",
        "import os\n",
        "\n",
        "src_dir = 'commandstep_r'\n",
        "\n",
        "env = Environment.from_dockerfile(name='r_env', dockerfile=os.path.join(src_dir, 'Dockerfile'))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Configure input training dataset\n",
        "\n",
        "This tutorial uses data from the US National Highway Traffic Safety Administration. This dataset includes data from over 25,000 car crashes in the US, with variables you can use to predict the likelihood of a fatality. We have included an Rdata file that includes the accidents data for analysis.\n",
        "\n",
        "Here we use the workspace's default datastore to upload the training data file (**accidents.Rd**); in practice you can use any datastore you want."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "datastore = ws.get_default_datastore()\n",
        "data_ref = datastore.upload_files(files=[os.path.join(src_dir, 'accidents.Rd')], target_path='accidentdata')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now create a FileDataset from the data, which will be used as an input to the train step."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.core import Dataset\n",
        "dataset = Dataset.File.from_files(datastore.path('accidentdata'))\n",
        "dataset"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now create a ScriptRunConfig that configures the training run. Note that in the `command` we include the input dataset for the training data.\n",
        "\n",
        "> For detailed guidance on how to move data in pipelines for input and output data, see the documentation [Moving data into and between ML pipelines](https://docs.microsoft.com/azure/machine-learning/how-to-move-data-in-out-of-pipelines)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.core import ScriptRunConfig\n",
        "\n",
        "train_config = ScriptRunConfig(source_directory=src_dir,\n",
        "                            command=['Rscript accidents.R --data_folder', dataset.as_mount(), '--output_folder outputs'],\n",
        "                            compute_target=compute_target,\n",
        "                            environment=env)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now create a CommandStep and pass in the ScriptRunConfig object to the `runconfig` parameter."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.pipeline.steps import CommandStep\n",
        "\n",
        "train = CommandStep(name='train', runconfig=train_config)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Build and Submit the Pipeline"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.pipeline.core import Pipeline\n",
        "from azureml.core import Experiment\n",
        "\n",
        "pipeline = Pipeline(workspace=ws, steps=[train])\n",
        "pipeline_run = Experiment(ws, 'r-commandstep-pipeline').submit(pipeline)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## View Run Details"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.widgets import RunDetails\n",
        "RunDetails(pipeline_run).show()"
      ]
    }
  ],
  "metadata": {
    "authors": [
      {
        "name": "minxia"
      }
    ],
    "category": "tutorial",
    "compute": [
      "AML Compute"
    ],
    "datasets": [
      "Custom"
    ],
    "deployment": [
      "None"
    ],
    "exclude_from_index": false,
    "framework": [
      "Azure ML"
    ],
    "friendly_name": "Azure Machine Learning Pipeline with CommandStep for R",
    "kernelspec": {
      "display_name": "Python 3.6",
      "language": "python",
      "name": "python36"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.7.7"
    },
    "order_index": 7,
    "star_tag": [
      "None"
    ],
    "tags": [
      "None"
    ],
    "task": "Demonstrates the use of CommandStep for running R scripts"
  },
  "nbformat": 4,
  "nbformat_minor": 2
}