MachineLearningNotebooks/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-with-notebook-runner-step.ipynb

{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Copyright (c) Microsoft Corporation. All rights reserved.  \n",
        "Licensed under the MIT License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-getting-started.png)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Azure Machine Learning Pipeline with NotebookRunnerStep\n",
        "This notebook demonstrates the use of `NotebookRunnerStep`. It allows you to run a local notebook as a step in Azure Machine Learning Pipeline."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Introduction\n",
        "In this example we showcase how you can run another notebook `notebook_runner/training_notebook.ipynb` as a step in Azure Machine Learning Pipeline.\n",
        "\n",
        "If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, make sure you have executed the [configuration](https://aka.ms/pl-config) before running this notebook.\n",
        "\n",
        "In this notebook you will learn how to:\n",
        "1. Create an `Experiment` in an existing `Workspace`.\n",
        "2. Create or Attach existing AmlCompute to a workspace.\n",
        "3. Configure NotebookRun using `NotebokRunConfig`.\n",
        "5. Use NotebookRunnerStep.\n",
        "6. Run the notebook on `AmlCompute` as a pipeline step consuming the output of a python script step.\n",
        "\n",
        "Advantages of running your notebook as a step in pipeline:\n",
        "1. Run your notebook like a python script without converting into .py files, leveraging complete end to end experience of Azure Machine Learning Pipelines.\n",
        "2. Use pipeline intermediate data to and from the notebook along with other steps in pipeline.\n",
        "3. Parameterize your notebook with [Pipeline Parameters](./aml-pipelines-publish-and-run-using-rest-endpoint.ipynb).\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Azure Machine Learning and Pipeline SDK-specific imports"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import os\n",
        "import requests\n",
        "import tempfile\n",
        "\n",
        "import azureml.core\n",
        "\n",
        "from azureml.core.compute import AmlCompute, ComputeTarget\n",
        "from azureml.core.runconfig import RunConfiguration\n",
        "from azureml.data.data_reference import DataReference\n",
        "from azureml.pipeline.core import PipelineData\n",
        "from azureml.core.datastore import Datastore\n",
        "\n",
        "from azureml.core import Workspace, Experiment\n",
        "from azureml.contrib.notebook import NotebookRunConfig, AzureMLNotebookHandler\n",
        "\n",
        "from azureml.pipeline.core import Pipeline\n",
        "from azureml.pipeline.steps import PythonScriptStep\n",
        "from azureml.contrib.notebook import NotebookRunnerStep\n",
        "\n",
        "# Check core SDK version number\n",
        "print(\"SDK version:\", azureml.core.VERSION)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Initialize Workspace\n",
        "\n",
        "Initialize a [workspace](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.workspace(class%29) object from persisted configuration."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "ws = Workspace.from_config()\n",
        "print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\\n')\n",
        "ws.set_default_datastore(\"workspaceblobstore\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Upload data to datastore"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# download data file from remote\n",
        "response = requests.get(\"https://dprepdata.blob.core.windows.net/demo/Titanic.csv\")\n",
        "titanic_file = os.path.join(tempfile.mkdtemp(), \"Titanic.csv\")\n",
        "with open(titanic_file, \"w\") as f:\n",
        "    f.write(response.content.decode(\"utf-8\"))\n",
        "Datastore.get(ws, \"workspaceblobstore\").upload_files([titanic_file], target_path=\"titanic\", overwrite=True)\n",
        "print(\"Upload call completed\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Create an Azure ML experiment\n",
        "Let's create an experiment named \"notebook-step-run-example\" and a folder to holding the notebook and other scripts. The script runs will be recorded under the experiment in Azure.\n",
        "\n",
        "The best practice is to use separate folders for scripts and its dependent files for each step and specify that folder as the `source_directory` for the step. This helps reduce the size of the snapshot created for the step (only the specific folder is snapshotted). Since changes in any files in the `source_directory` would trigger a re-upload of the snapshot, this helps keep the reuse of the step when there are no changes in the `source_directory` of the step."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Choose a name for the run history container in the workspace.\n",
        "experiment_name = 'notebook-step-run-example'\n",
        "source_directory = 'notebook_runner'\n",
        "\n",
        "experiment = Experiment(ws, experiment_name)\n",
        "experiment"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Create or Attach an AmlCompute cluster\n",
        "You will need to create a [compute target](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.computetarget?view=azure-ml-py) for your remote run. In this tutorial, you get the default `AmlCompute` as your training compute resource.\n",
        "\n",
        "> Note that if you have an AzureML Data Scientist role, you will not have permission to create compute resources. Talk to your workspace or IT admin to create the compute targets described in this section, if they do not already exist."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Choose a name for your cluster.\n",
        "amlcompute_cluster_name = \"cpu-cluster\"\n",
        "\n",
        "found = False\n",
        "# Check if this compute target already exists in the workspace.\n",
        "cts = ws.compute_targets\n",
        "if amlcompute_cluster_name in cts and cts[amlcompute_cluster_name].type == 'AmlCompute':\n",
        "    found = True\n",
        "    print('Found existing compute target.')\n",
        "    compute_target = cts[amlcompute_cluster_name]\n",
        "    \n",
        "if not found:\n",
        "    print('Creating a new compute target...')\n",
        "    provisioning_config = AmlCompute.provisioning_configuration(vm_size = \"STANDARD_D2_V2\", # for GPU, use \"STANDARD_NC6\"\n",
        "                                                                #vm_priority = 'lowpriority', # optional\n",
        "                                                                max_nodes = 4)\n",
        "\n",
        "    # Create the cluster.\n",
        "    compute_target = ComputeTarget.create(ws, amlcompute_cluster_name, provisioning_config)\n",
        "    \n",
        "    # Can poll for a minimum number of nodes and for a specific timeout.\n",
        "    # If no min_node_count is provided, it will use the scale settings for the cluster.\n",
        "    compute_target.wait_for_completion(show_output = True, min_node_count = 1, timeout_in_minutes = 10)\n",
        "    \n",
        "     # For a more detailed view of current AmlCompute status, use get_status()."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Create a new RunConfig object"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.core.conda_dependencies import CondaDependencies\n",
        "\n",
        "conda_run_config = RunConfiguration(framework=\"python\")\n",
        "\n",
        "conda_run_config.environment.docker.enabled = True\n",
        "conda_run_config.environment.docker.base_image = azureml.core.runconfig.DEFAULT_CPU_IMAGE\n",
        "\n",
        "cd = CondaDependencies.create(pip_packages=['azureml-sdk'])\n",
        "conda_run_config.environment.python.conda_dependencies = cd\n",
        "\n",
        "print('run config is ready')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Define input and outputs"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "input_data = DataReference(\n",
        "    datastore=Datastore.get(ws, \"workspaceblobstore\"),\n",
        "    data_reference_name=\"blob_test_data\",\n",
        "    path_on_datastore=\"titanic/Titanic.csv\")\n",
        "\n",
        "output_data = PipelineData(name=\"processed_data\",\n",
        "                           datastore=Datastore.get(ws, \"workspaceblobstore\"))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Create notebook run configuration and set parameters values"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "handler = AzureMLNotebookHandler(timeout=600, progress_bar=False, log_output=True)\n",
        "\n",
        "cfg = NotebookRunConfig(source_directory=source_directory, notebook=\"training_notebook.ipynb\",\n",
        "                        handler = handler,\n",
        "                        parameters={\"arg1\": \"Machine Learning\"},\n",
        "                        run_config=conda_run_config)\n",
        "\n",
        "print(\"Notebook Run Config is created.\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Define PythonScriptStep"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "print('Source directory for the step is {}.'.format(os.path.realpath('./train')))\n",
        "python_script_step = PythonScriptStep(\n",
        "                         script_name=\"train.py\",\n",
        "                         arguments=[\"--input_data\", input_data],\n",
        "                         inputs=[input_data],\n",
        "                         outputs=[output_data],\n",
        "                         compute_target=compute_target, \n",
        "                         source_directory=\"./train\",\n",
        "                         allow_reuse=True)\n",
        "print(\"python_script_step created\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Define NotebookRunnerStep\n",
        "\n",
        "This step will consume intermediate output produced by `python_script_step` as an input.\n",
        "\n",
        "Optionally, a output of type `output_notebook_pipeline_data_name` can be added to the `NotebookRunnerStep` to redirect the `output_notebook` of notebook run to `NotebookRunnerStep`'s step output produced as `PipelineData` and can be further passed along the pipeline."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.pipeline.core import PipelineParameter\n",
        "\n",
        "output_from_notebook = PipelineData(name=\"notebook_processed_data\",\n",
        "                                    datastore=Datastore.get(ws, \"workspaceblobstore\"))\n",
        "\n",
        "my_pipeline_param = PipelineParameter(name=\"pipeline_param\", default_value=\"my_param\")\n",
        "\n",
        "print('Source directory for the step is {}.'.format(os.path.realpath(source_directory)))\n",
        "notebook_runner_step = NotebookRunnerStep(name=\"training_notebook_step\",\n",
        "                                          notebook_run_config=cfg,\n",
        "                                          params={\"my_pipeline_param\": my_pipeline_param},\n",
        "                                          inputs=[output_data],\n",
        "                                          outputs=[output_from_notebook],\n",
        "                                          allow_reuse=True,\n",
        "                                          compute_target=compute_target,\n",
        "                                          output_notebook_pipeline_data_name=\"notebook_result\")\n",
        "\n",
        "print(\"Notebook Runner Step is Created.\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Build Pipeline\n",
        "\n",
        "Once we have the steps (or steps collection), we can build the [pipeline](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-core/azureml.pipeline.core.pipeline.pipeline?view=azure-ml-py). By deafult, all these steps will run in **parallel** once we submit the pipeline for run.\n",
        "\n",
        "A pipeline is created with a list of steps and a workspace. Submit a pipeline using [submit](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-core/azureml.pipeline.core.pipeline.pipeline?view=azure-ml-py#submit-experiment-name--pipeline-parameters-none--continue-on-step-failure-false--regenerate-outputs-false--parent-run-id-none----kwargs-). When submit is called, a [PipelineRun](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-core/azureml.pipeline.core.pipelinerun?view=azure-ml-py) is created which in turn creates [StepRun](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-core/azureml.pipeline.core.steprun?view=azure-ml-py) objects for each step in the workflow."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "pipeline1 = Pipeline(workspace=ws, steps=[notebook_runner_step])\n",
        "print(\"Pipeline creation complete\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "pipeline_run1 = experiment.submit(pipeline1)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.widgets import RunDetails\n",
        "RunDetails(pipeline_run1).show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Download output notebook\n",
        "\n",
        "`output_notebook` can be retrieved via pipeline step output if `output_notebook_pipeline_data_name` is provided to the `NotebookRunnerStep`"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "pipeline_run1.wait_for_completion()\n",
        "train_step = pipeline_run1.find_step_run('training_notebook_step') # Retrieve the step runs by name `train.py`\n",
        "\n",
        "if train_step:\n",
        "    train_step_obj = train_step[0] # since we have only one step by name `training_notebook_step`\n",
        "    train_step_obj.get_output_data('notebook_result').download(source_directory) # download the output to source_directory"
      ]
    }
  ],
  "metadata": {
    "authors": [
      {
        "name": "sanpil"
      }
    ],
    "category": "tutorial",
    "compute": [
      "AML Compute"
    ],
    "datasets": [
      "Custom"
    ],
    "deployment": [
      "None"
    ],
    "exclude_from_index": false,
    "framework": [
      "Azure ML"
    ],
    "friendly_name": "How to use run a notebook as a step in AML Pipelines",
    "kernelspec": {
      "display_name": "Python 3.8 - AzureML",
      "language": "python",
      "name": "python38-azureml"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.7"
    },
    "order_index": 12,
    "star_tag": [
      "None"
    ],
    "tags": [
      "None"
    ],
    "task": "Demonstrates the use of NotebookRunnerStep"
  },
  "nbformat": 4,
  "nbformat_minor": 2
}