mirror of
https://github.com/Azure/MachineLearningNotebooks.git
synced 2025-12-20 09:37:04 -05:00
432 lines
17 KiB
Plaintext
432 lines
17 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Copyright (c) Microsoft Corporation. All rights reserved. \n",
|
|
"Licensed under the MIT License."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Azure Machine Learning Pipeline with NotebookRunnerStep\n",
|
|
"This notebook demonstrates the use of `NotebookRunnerStep`. It allows you to run a local notebook as a step in Azure Machine Learning Pipeline."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Introduction\n",
|
|
"In this example we showcase how you can run another notebook `notebook_runner/training_notebook.ipynb` as a step in Azure Machine Learning Pipeline.\n",
|
|
"\n",
|
|
"If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, make sure you have executed the [configuration](https://aka.ms/pl-config) before running this notebook.\n",
|
|
"\n",
|
|
"In this notebook you will learn how to:\n",
|
|
"1. Create an `Experiment` in an existing `Workspace`.\n",
|
|
"2. Create or Attach existing AmlCompute to a workspace.\n",
|
|
"3. Configure NotebookRun using `NotebokRunConfig`.\n",
|
|
"5. Use NotebookRunnerStep.\n",
|
|
"6. Run the notebook on `AmlCompute` as a pipeline step consuming the output of a python script step.\n",
|
|
"\n",
|
|
"Advantages of running your notebook as a step in pipeline:\n",
|
|
"1. Run your notebook like a python script without converting into .py files, leveraging complete end to end experience of Azure Machine Learning Pipelines.\n",
|
|
"2. Use pipeline intermediate data to and from the notebook along with other steps in pipeline.\n",
|
|
"3. Parameterize your notebook with [Pipeline Parameters](./aml-pipelines-publish-and-run-using-rest-endpoint.ipynb).\n",
|
|
"\n",
|
|
"Try some more [quick start notebooks](https://github.com/microsoft/recommenders/tree/master/notebooks/00_quick_start) with `NotebookRunnerStep`."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Azure Machine Learning and Pipeline SDK-specific imports"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import os\n",
|
|
"\n",
|
|
"import azureml.core\n",
|
|
"\n",
|
|
"from azureml.core.compute import AmlCompute, ComputeTarget\n",
|
|
"from azureml.core.runconfig import RunConfiguration\n",
|
|
"from azureml.data.data_reference import DataReference\n",
|
|
"from azureml.pipeline.core import PipelineData\n",
|
|
"from azureml.core.datastore import Datastore\n",
|
|
"\n",
|
|
"from azureml.core import Workspace, Experiment\n",
|
|
"from azureml.contrib.notebook import NotebookRunConfig, AzureMLNotebookHandler\n",
|
|
"\n",
|
|
"from azureml.pipeline.core import Pipeline\n",
|
|
"from azureml.pipeline.steps import PythonScriptStep\n",
|
|
"from azureml.contrib.notebook import NotebookRunnerStep\n",
|
|
"\n",
|
|
"# Check core SDK version number\n",
|
|
"print(\"SDK version:\", azureml.core.VERSION)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Initialize Workspace\n",
|
|
"\n",
|
|
"Initialize a [workspace](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.workspace(class%29) object from persisted configuration."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"ws = Workspace.from_config()\n",
|
|
"print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\\n')\n",
|
|
"ws.set_default_datastore(\"workspaceblobstore\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Upload data to datastore"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"Datastore.get(ws, \"workspaceblobstore\").upload_files([\"./20news.pkl\"], target_path=\"20newsgroups\", overwrite=True)\n",
|
|
"print(\"Upload call completed\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Create an Azure ML experiment\n",
|
|
"Let's create an experiment named \"notebook-step-run-example\" and a folder to holding the notebook and other scripts. The script runs will be recorded under the experiment in Azure.\n",
|
|
"\n",
|
|
"The best practice is to use separate folders for scripts and its dependent files for each step and specify that folder as the `source_directory` for the step. This helps reduce the size of the snapshot created for the step (only the specific folder is snapshotted). Since changes in any files in the `source_directory` would trigger a re-upload of the snapshot, this helps keep the reuse of the step when there are no changes in the `source_directory` of the step."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Choose a name for the run history container in the workspace.\n",
|
|
"experiment_name = 'notebook-step-run-example'\n",
|
|
"source_directory = 'notebook_runner'\n",
|
|
"\n",
|
|
"experiment = Experiment(ws, experiment_name)\n",
|
|
"experiment"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Create or Attach an AmlCompute cluster\n",
|
|
"You will need to create a [compute target](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.computetarget?view=azure-ml-py) for your remote run. In this tutorial, you get the default `AmlCompute` as your training compute resource."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Choose a name for your cluster.\n",
|
|
"amlcompute_cluster_name = \"cpu-cluster\"\n",
|
|
"\n",
|
|
"found = False\n",
|
|
"# Check if this compute target already exists in the workspace.\n",
|
|
"cts = ws.compute_targets\n",
|
|
"if amlcompute_cluster_name in cts and cts[amlcompute_cluster_name].type == 'AmlCompute':\n",
|
|
" found = True\n",
|
|
" print('Found existing compute target.')\n",
|
|
" compute_target = cts[amlcompute_cluster_name]\n",
|
|
" \n",
|
|
"if not found:\n",
|
|
" print('Creating a new compute target...')\n",
|
|
" provisioning_config = AmlCompute.provisioning_configuration(vm_size = \"STANDARD_D2_V2\", # for GPU, use \"STANDARD_NC6\"\n",
|
|
" #vm_priority = 'lowpriority', # optional\n",
|
|
" max_nodes = 4)\n",
|
|
"\n",
|
|
" # Create the cluster.\n",
|
|
" compute_target = ComputeTarget.create(ws, amlcompute_cluster_name, provisioning_config)\n",
|
|
" \n",
|
|
" # Can poll for a minimum number of nodes and for a specific timeout.\n",
|
|
" # If no min_node_count is provided, it will use the scale settings for the cluster.\n",
|
|
" compute_target.wait_for_completion(show_output = True, min_node_count = 1, timeout_in_minutes = 10)\n",
|
|
" \n",
|
|
" # For a more detailed view of current AmlCompute status, use get_status()."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Create a new RunConfig object"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.core.conda_dependencies import CondaDependencies\n",
|
|
"\n",
|
|
"conda_run_config = RunConfiguration(framework=\"python\")\n",
|
|
"\n",
|
|
"conda_run_config.environment.docker.enabled = True\n",
|
|
"conda_run_config.environment.docker.base_image = azureml.core.runconfig.DEFAULT_CPU_IMAGE\n",
|
|
"\n",
|
|
"cd = CondaDependencies.create(pip_packages=['azureml-sdk'])\n",
|
|
"conda_run_config.environment.python.conda_dependencies = cd\n",
|
|
"\n",
|
|
"print('run config is ready')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Define input and outputs"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"input_data = DataReference(\n",
|
|
" datastore=Datastore.get(ws, \"workspaceblobstore\"),\n",
|
|
" data_reference_name=\"blob_test_data\",\n",
|
|
" path_on_datastore=\"20newsgroups/20news.pkl\")\n",
|
|
"\n",
|
|
"output_data = PipelineData(name=\"processed_data\",\n",
|
|
" datastore=Datastore.get(ws, \"workspaceblobstore\"))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Create notebook run configuration and set parameters values"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"handler = AzureMLNotebookHandler(timeout=600, progress_bar=False, log_output=True)\n",
|
|
"\n",
|
|
"cfg = NotebookRunConfig(source_directory=source_directory, notebook=\"training_notebook.ipynb\",\n",
|
|
" handler = handler,\n",
|
|
" parameters={\"arg1\": \"Machine Learning\"},\n",
|
|
" run_config=conda_run_config)\n",
|
|
"\n",
|
|
"print(\"Notebook Run Config is created.\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Define PythonScriptStep"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"print('Source directory for the step is {}.'.format(os.path.realpath('./train')))\n",
|
|
"python_script_step = PythonScriptStep(\n",
|
|
" script_name=\"train.py\",\n",
|
|
" arguments=[\"--input_data\", input_data],\n",
|
|
" inputs=[input_data],\n",
|
|
" outputs=[output_data],\n",
|
|
" compute_target=compute_target, \n",
|
|
" source_directory=\"./train\",\n",
|
|
" allow_reuse=True)\n",
|
|
"print(\"python_script_step created\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Define NotebookRunnerStep\n",
|
|
"\n",
|
|
"This step will consume intermediate output produced by `python_script_step` as an input.\n",
|
|
"\n",
|
|
"Optionally, a output of type `output_notebook_pipeline_data_name` can be added to the `NotebookRunnerStep` to redirect the `output_notebook` of notebook run to `NotebookRunnerStep`'s step output produced as `PipelineData` and can be further passed along the pipeline."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.pipeline.core import PipelineParameter\n",
|
|
"\n",
|
|
"output_from_notebook = PipelineData(name=\"notebook_processed_data\",\n",
|
|
" datastore=Datastore.get(ws, \"workspaceblobstore\"))\n",
|
|
"\n",
|
|
"my_pipeline_param = PipelineParameter(name=\"pipeline_param\", default_value=\"my_param\")\n",
|
|
"\n",
|
|
"print('Source directory for the step is {}.'.format(os.path.realpath(source_directory)))\n",
|
|
"notebook_runner_step = NotebookRunnerStep(name=\"training_notebook_step\",\n",
|
|
" notebook_run_config=cfg,\n",
|
|
" params={\"my_pipeline_param\": my_pipeline_param},\n",
|
|
" inputs=[output_data],\n",
|
|
" outputs=[output_from_notebook],\n",
|
|
" allow_reuse=True,\n",
|
|
" compute_target=compute_target,\n",
|
|
" output_notebook_pipeline_data_name=\"notebook_result\")\n",
|
|
"\n",
|
|
"print(\"Notebook Runner Step is Created.\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Build Pipeline\n",
|
|
"\n",
|
|
"Once we have the steps (or steps collection), we can build the [pipeline](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-core/azureml.pipeline.core.pipeline.pipeline?view=azure-ml-py). By deafult, all these steps will run in **parallel** once we submit the pipeline for run.\n",
|
|
"\n",
|
|
"A pipeline is created with a list of steps and a workspace. Submit a pipeline using [submit](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-core/azureml.pipeline.core.pipeline.pipeline?view=azure-ml-py#submit-experiment-name--pipeline-parameters-none--continue-on-step-failure-false--regenerate-outputs-false--parent-run-id-none----kwargs-). When submit is called, a [PipelineRun](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-core/azureml.pipeline.core.pipelinerun?view=azure-ml-py) is created which in turn creates [StepRun](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-core/azureml.pipeline.core.steprun?view=azure-ml-py) objects for each step in the workflow."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"pipeline1 = Pipeline(workspace=ws, steps=[notebook_runner_step])\n",
|
|
"print(\"Pipeline creation complete\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"pipeline_run1 = experiment.submit(pipeline1)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.widgets import RunDetails\n",
|
|
"RunDetails(pipeline_run1).show()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Download output notebook\n",
|
|
"\n",
|
|
"`output_notebook` can be retrieved via pipeline step output if `output_notebook_pipeline_data_name` is provided to the `NotebookRunnerStep`"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"pipeline_run1.wait_for_completion()\n",
|
|
"train_step = pipeline_run1.find_step_run('training_notebook_step') # Retrieve the step runs by name `train.py`\n",
|
|
"\n",
|
|
"if train_step:\n",
|
|
" train_step_obj = train_step[0] # since we have only one step by name `training_notebook_step`\n",
|
|
" train_step_obj.get_output_data('notebook_result').download(source_directory) # download the output to source_directory"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"authors": [
|
|
{
|
|
"name": "sanpil"
|
|
}
|
|
],
|
|
"category": "tutorial",
|
|
"compute": [
|
|
"AML Compute"
|
|
],
|
|
"datasets": [
|
|
"Custom"
|
|
],
|
|
"deployment": [
|
|
"None"
|
|
],
|
|
"exclude_from_index": false,
|
|
"framework": [
|
|
"Azure ML"
|
|
],
|
|
"friendly_name": "How to use run a notebook as a step in AML Pipelines",
|
|
"kernelspec": {
|
|
"display_name": "Python 3.6",
|
|
"language": "python",
|
|
"name": "python36"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.6.7"
|
|
},
|
|
"order_index": 12,
|
|
"star_tag": [
|
|
"None"
|
|
],
|
|
"tags": [
|
|
"None"
|
|
],
|
|
"task": "Demonstrates the use of NotebookRunnerStep"
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
} |