mirror of
https://github.com/Azure/MachineLearningNotebooks.git
synced 2025-12-19 17:17:04 -05:00
630 lines
30 KiB
Plaintext
630 lines
30 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Copyright (c) Microsoft Corporation. All rights reserved. \n",
|
|
"Licensed under the MIT License."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Azure Machine Learning Pipelines: Getting Started\n",
|
|
"\n",
|
|
"## Overview\n",
|
|
"\n",
|
|
"\n",
|
|
"A common scenario when using machine learning components is to have a data workflow that includes the following steps:\n",
|
|
"\n",
|
|
"- Preparing/preprocessing a given dataset for training, followed by\n",
|
|
"- Training a machine learning model on this data, and then\n",
|
|
"- Deploying this trained model in a separate environment, and finally\n",
|
|
"- Running a batch scoring task on another data set, using the trained model.\n",
|
|
"\n",
|
|
"Azure's Machine Learning pipelines give you a way to combine multiple steps like these into one configurable workflow, so that multiple agents/users can share and/or reuse this workflow. Machine learning pipelines thus provide a consistent, reproducible mechanism for building, evaluating, deploying, and running ML systems.\n",
|
|
"\n",
|
|
"To get more information about Azure machine learning pipelines, please read our [Azure Machine Learning Pipelines](https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-ml-pipelines) overview, or the [readme article](../README.md).\n",
|
|
"\n",
|
|
"In this notebook, we provide a gentle introduction to Azure machine learning pipelines. We build a pipeline that runs jobs unattended on different compute clusters; in this notebook, you'll see how to use the basic Azure ML SDK APIs for constructing this pipeline.\n",
|
|
" "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Prerequisites and Azure Machine Learning Basics\n",
|
|
"If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, make sure you go through the [configuration notebook](../../../configuration.ipynb) located at https://github.com/Azure/MachineLearningNotebooks first if you haven't. This sets you up with a working config file that has information on your workspace, subscription id, etc. \n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Azure Machine Learning Imports\n",
|
|
"\n",
|
|
"In this first code cell, we import key Azure Machine Learning modules that we will use below. "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import os\n",
|
|
"import azureml.core\n",
|
|
"from azureml.core import Workspace, Experiment, Datastore\n",
|
|
"from azureml.widgets import RunDetails\n",
|
|
"\n",
|
|
"# Check core SDK version number\n",
|
|
"print(\"SDK version:\", azureml.core.VERSION)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Pipeline-specific SDK imports\n",
|
|
"\n",
|
|
"Here, we import key pipeline modules, whose use will be illustrated in the examples below."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.pipeline.core import Pipeline\n",
|
|
"from azureml.pipeline.steps import PythonScriptStep\n",
|
|
"\n",
|
|
"print(\"Pipeline SDK-specific imports completed\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Initialize Workspace\n",
|
|
"\n",
|
|
"Initialize a [workspace](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.workspace(class%29) object from persisted configuration."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"tags": [
|
|
"create workspace"
|
|
]
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"ws = Workspace.from_config()\n",
|
|
"print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\\n')\n",
|
|
"\n",
|
|
"# Default datastore\n",
|
|
"def_blob_store = ws.get_default_datastore() \n",
|
|
"# The following call GETS the Azure Blob Store associated with your workspace.\n",
|
|
"# Note that workspaceblobstore is **the name of this store and CANNOT BE CHANGED and must be used as is** \n",
|
|
"def_blob_store = Datastore(ws, \"workspaceblobstore\")\n",
|
|
"print(\"Blobstore's name: {}\".format(def_blob_store.name))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Required data and script files for the the tutorial\n",
|
|
"Sample files required to finish this tutorial are already copied to the corresponding source_directory locations. Even though the .py provided in the samples don't have much \"ML work,\" as a data scientist, you will work on this extensively as part of your work. To complete this tutorial, the contents of these files are not very important. The one-line files are for demostration purpose only."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Datastore concepts\n",
|
|
"A [Datastore](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.datastore.datastore?view=azure-ml-py) is a place where data can be stored that is then made accessible to a compute either by means of mounting or copying the data to the compute target. \n",
|
|
"\n",
|
|
"A Datastore can either be backed by an Azure File Storage (default) or by an Azure Blob Storage.\n",
|
|
"\n",
|
|
"In this next step, we will upload the training and test set into the workspace's default storage (File storage), and another piece of data to Azure Blob Storage. When to use [Azure Blobs](https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blobs-introduction), [Azure Files](https://docs.microsoft.com/en-us/azure/storage/files/storage-files-introduction), or [Azure Disks](https://docs.microsoft.com/en-us/azure/virtual-machines/linux/managed-disks-overview) is [detailed here](https://docs.microsoft.com/en-us/azure/storage/common/storage-decide-blobs-files-disks).\n",
|
|
"\n",
|
|
"**Please take good note of the concept of the datastore.**"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Upload data to default datastore\n",
|
|
"Default datastore on workspace is the Azure File storage. The workspace has a Blob storage associated with it as well. Let's upload a file to each of these storages."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# get_default_datastore() gets the default Azure Blob Store associated with your workspace.\n",
|
|
"# Here we are reusing the def_blob_store object we obtained earlier\n",
|
|
"def_blob_store.upload_files([\"./20news.pkl\"], target_path=\"20newsgroups\", overwrite=True)\n",
|
|
"print(\"Upload call completed\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### (Optional) See your files using Azure Portal\n",
|
|
"Once you successfully uploaded the files, you can browse to them (or upload more files) using [Azure Portal](https://portal.azure.com). At the portal, make sure you have selected your subscription (click *Resource Groups* and then select the subscription). Then look for your **Machine Learning Workspace** name. It has a link to your storage. Click on the storage link. It will take you to a page where you can see [Blobs](https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blobs-introduction), [Files](https://docs.microsoft.com/en-us/azure/storage/files/storage-files-introduction), [Tables](https://docs.microsoft.com/en-us/azure/storage/tables/table-storage-overview), and [Queues](https://docs.microsoft.com/en-us/azure/storage/queues/storage-queues-introduction). We have uploaded a file each to the Blob storage and to the File storage in the above step. You should be able to see both of these files in their respective locations. "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Compute Targets\n",
|
|
"A compute target specifies where to execute your program such as a remote Docker on a VM, or a cluster. A compute target needs to be addressable and accessible by you.\n",
|
|
"\n",
|
|
"**You need at least one compute target to send your payload to. We are planning to use Azure Machine Learning Compute exclusively for this tutorial for all steps. However in some cases you may require multiple compute targets as some steps may run in one compute target like Azure Machine Learning Compute, and some other steps in the same pipeline could run in a different compute target.**\n",
|
|
"\n",
|
|
"*The example belows show creating/retrieving/attaching to an Azure Machine Learning Compute instance.*"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### List of Compute Targets on the workspace"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"cts = ws.compute_targets\n",
|
|
"for ct in cts:\n",
|
|
" print(ct)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Retrieve or create a Azure Machine Learning compute\n",
|
|
"Azure Machine Learning Compute is a service for provisioning and managing clusters of Azure virtual machines for running machine learning workloads. Let's create a new Azure Machine Learning Compute in the current workspace, if it doesn't already exist. We will then run the training script on this compute target.\n",
|
|
"\n",
|
|
"If we could not find the compute with the given name in the previous cell, then we will create a new compute here. We will create an Azure Machine Learning Compute containing **STANDARD_D2_V2 CPU VMs**. This process is broken down into the following steps:\n",
|
|
"\n",
|
|
"1. Create the configuration\n",
|
|
"2. Create the Azure Machine Learning compute\n",
|
|
"\n",
|
|
"**This process will take about 3 minutes and is providing only sparse output in the process. Please make sure to wait until the call returns before moving to the next cell.**"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.core.compute import ComputeTarget, AmlCompute\n",
|
|
"from azureml.core.compute_target import ComputeTargetException\n",
|
|
"\n",
|
|
"aml_compute_target = \"cpu-cluster\"\n",
|
|
"try:\n",
|
|
" aml_compute = AmlCompute(ws, aml_compute_target)\n",
|
|
" print(\"found existing compute target.\")\n",
|
|
"except ComputeTargetException:\n",
|
|
" print(\"creating new compute target\")\n",
|
|
" \n",
|
|
" provisioning_config = AmlCompute.provisioning_configuration(vm_size = \"STANDARD_D2_V2\",\n",
|
|
" min_nodes = 1, \n",
|
|
" max_nodes = 4) \n",
|
|
" aml_compute = ComputeTarget.create(ws, aml_compute_target, provisioning_config)\n",
|
|
" aml_compute.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)\n",
|
|
" \n",
|
|
"print(\"Azure Machine Learning Compute attached\")\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# For a more detailed view of current Azure Machine Learning Compute status, use get_status()\n",
|
|
"# example: un-comment the following line.\n",
|
|
"# print(aml_compute.get_status().serialize())"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"**Wait for this call to finish before proceeding (you will see the asterisk turning to a number).**\n",
|
|
"\n",
|
|
"Now that you have created the compute target, let's see what the workspace's compute_targets() function returns. You should now see one entry named 'amlcompute' of type AmlCompute."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"**Now that we have completed learning the basics of Azure Machine Learning (AML), let's go ahead and start understanding the Pipeline concepts.**"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Creating a Step in a Pipeline\n",
|
|
"A Step is a unit of execution. Step typically needs a target of execution (compute target), a script to execute, and may require script arguments and inputs, and can produce outputs. The step also could take a number of other parameters. Azure Machine Learning Pipelines provides the following built-in Steps:\n",
|
|
"\n",
|
|
"- [**PythonScriptStep**](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-steps/azureml.pipeline.steps.python_script_step.pythonscriptstep?view=azure-ml-py): Adds a step to run a Python script in a Pipeline.\n",
|
|
"- [**AdlaStep**](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-steps/azureml.pipeline.steps.adla_step.adlastep?view=azure-ml-py): Adds a step to run U-SQL script using Azure Data Lake Analytics.\n",
|
|
"- [**DataTransferStep**](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-steps/azureml.pipeline.steps.data_transfer_step.datatransferstep?view=azure-ml-py): Transfers data between Azure Blob and Data Lake accounts.\n",
|
|
"- [**DatabricksStep**](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-steps/azureml.pipeline.steps.databricks_step.databricksstep?view=azure-ml-py): Adds a DataBricks notebook as a step in a Pipeline.\n",
|
|
"- [**HyperDriveStep**](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-steps/azureml.pipeline.steps.hyper_drive_step.hyperdrivestep?view=azure-ml-py): Creates a Hyper Drive step for Hyper Parameter Tuning in a Pipeline.\n",
|
|
"- [**AzureBatchStep**](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-steps/azureml.pipeline.steps.azurebatch_step.azurebatchstep?view=azure-ml-py): Creates a step for submitting jobs to Azure Batch\n",
|
|
"- [**EstimatorStep**](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-steps/azureml.pipeline.steps.estimator_step.estimatorstep?view=azure-ml-py): Adds a step to run Estimator in a Pipeline.\n",
|
|
"- [**MpiStep**](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-steps/azureml.pipeline.steps.mpi_step.mpistep?view=azure-ml-py): Adds a step to run a MPI job in a Pipeline.\n",
|
|
"- [**AutoMLStep**](https://docs.microsoft.com/en-us/python/api/azureml-train-automl/azureml.train.automl.automlstep?view=azure-ml-py): Creates a AutoML step in a Pipeline.\n",
|
|
"\n",
|
|
"The following code will create a PythonScriptStep to be executed in the Azure Machine Learning Compute we created above using train.py, one of the files already made available in the `source_directory`.\n",
|
|
"\n",
|
|
"A **PythonScriptStep** is a basic, built-in step to run a Python Script on a compute target. It takes a script name and optionally other parameters like arguments for the script, compute target, inputs and outputs. If no compute target is specified, default compute target for the workspace is used. You can also use a [**RunConfiguration**](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.runconfiguration?view=azure-ml-py) to specify requirements for the PythonScriptStep, such as conda dependencies and docker image.\n",
|
|
"> The best practice is to use separate folders for scripts and its dependent files for each step and specify that folder as the `source_directory` for the step. This helps reduce the size of the snapshot created for the step (only the specific folder is snapshotted). Since changes in any files in the `source_directory` would trigger a re-upload of the snapshot, this helps keep the reuse of the step when there are no changes in the `source_directory` of the step."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Uses default values for PythonScriptStep construct.\n",
|
|
"\n",
|
|
"source_directory = './train'\n",
|
|
"print('Source directory for the step is {}.'.format(os.path.realpath(source_directory)))\n",
|
|
"\n",
|
|
"# Syntax\n",
|
|
"# PythonScriptStep(\n",
|
|
"# script_name, \n",
|
|
"# name=None, \n",
|
|
"# arguments=None, \n",
|
|
"# compute_target=None, \n",
|
|
"# runconfig=None, \n",
|
|
"# inputs=None, \n",
|
|
"# outputs=None, \n",
|
|
"# params=None, \n",
|
|
"# source_directory=None, \n",
|
|
"# allow_reuse=True, \n",
|
|
"# version=None, \n",
|
|
"# hash_paths=None)\n",
|
|
"# This returns a Step\n",
|
|
"step1 = PythonScriptStep(name=\"train_step\",\n",
|
|
" script_name=\"train.py\", \n",
|
|
" compute_target=aml_compute, \n",
|
|
" source_directory=source_directory,\n",
|
|
" allow_reuse=True)\n",
|
|
"print(\"Step1 created\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"**Note:** In the above call to PythonScriptStep(), the flag *allow_reuse* determines whether the step should reuse previous results when run with the same settings/inputs. This flag's default value is *True*; the default is set to *True* because, when inputs and parameters have not changed, we typically do not want to re-run a given pipeline step. \n",
|
|
"\n",
|
|
"If *allow_reuse* is set to *False*, a new run will always be generated for this step during pipeline execution. The *allow_reuse* flag can come in handy in situations where you do *not* want to re-run a pipeline step."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Running a few steps in parallel\n",
|
|
"Here we are looking at a simple scenario where we are running a few steps (all involving PythonScriptStep) in parallel. Running nodes in **parallel** is the default behavior for steps in a pipeline.\n",
|
|
"\n",
|
|
"We already have one step defined earlier. Let's define few more steps."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# For this step, we use a different source_directory\n",
|
|
"source_directory = './compare'\n",
|
|
"print('Source directory for the step is {}.'.format(os.path.realpath(source_directory)))\n",
|
|
"\n",
|
|
"# All steps use the same Azure Machine Learning compute target as well\n",
|
|
"step2 = PythonScriptStep(name=\"compare_step\",\n",
|
|
" script_name=\"compare.py\", \n",
|
|
" compute_target=aml_compute, \n",
|
|
" source_directory=source_directory)\n",
|
|
"\n",
|
|
"# Use a RunConfiguration to specify some additional requirements for this step.\n",
|
|
"from azureml.core.runconfig import RunConfiguration\n",
|
|
"from azureml.core.conda_dependencies import CondaDependencies\n",
|
|
"from azureml.core.runconfig import DEFAULT_CPU_IMAGE\n",
|
|
"\n",
|
|
"# create a new runconfig object\n",
|
|
"run_config = RunConfiguration()\n",
|
|
"\n",
|
|
"# enable Docker \n",
|
|
"run_config.environment.docker.enabled = True\n",
|
|
"\n",
|
|
"# set Docker base image to the default CPU-based image\n",
|
|
"run_config.environment.docker.base_image = DEFAULT_CPU_IMAGE\n",
|
|
"\n",
|
|
"# use conda_dependencies.yml to create a conda environment in the Docker image for execution\n",
|
|
"run_config.environment.python.user_managed_dependencies = False\n",
|
|
"\n",
|
|
"# specify CondaDependencies obj\n",
|
|
"run_config.environment.python.conda_dependencies = CondaDependencies.create(conda_packages=['scikit-learn'])\n",
|
|
"\n",
|
|
"# For this step, we use yet another source_directory\n",
|
|
"source_directory = './extract'\n",
|
|
"print('Source directory for the step is {}.'.format(os.path.realpath(source_directory)))\n",
|
|
"\n",
|
|
"step3 = PythonScriptStep(name=\"extract_step\",\n",
|
|
" script_name=\"extract.py\", \n",
|
|
" compute_target=aml_compute, \n",
|
|
" source_directory=source_directory,\n",
|
|
" runconfig=run_config)\n",
|
|
"\n",
|
|
"# list of steps to run\n",
|
|
"steps = [step1, step2, step3]\n",
|
|
"print(\"Step lists created\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Build the pipeline\n",
|
|
"Once we have the steps (or steps collection), we can build the [pipeline](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-core/azureml.pipeline.core.pipeline.pipeline?view=azure-ml-py). By deafult, all these steps will run in **parallel** once we submit the pipeline for run.\n",
|
|
"\n",
|
|
"A pipeline is created with a list of steps and a workspace. Submit a pipeline using [submit](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.experiment(class)?view=azure-ml-py#submit-config--tags-none----kwargs-). When submit is called, a [PipelineRun](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-core/azureml.pipeline.core.pipelinerun?view=azure-ml-py) is created which in turn creates [StepRun](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-core/azureml.pipeline.core.steprun?view=azure-ml-py) objects for each step in the workflow."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Syntax\n",
|
|
"# Pipeline(workspace, \n",
|
|
"# steps, \n",
|
|
"# description=None, \n",
|
|
"# default_datastore_name=None, \n",
|
|
"# default_source_directory=None, \n",
|
|
"# resolve_closure=True, \n",
|
|
"# _workflow_provider=None, \n",
|
|
"# _service_endpoint=None)\n",
|
|
"\n",
|
|
"pipeline1 = Pipeline(workspace=ws, steps=steps)\n",
|
|
"print (\"Pipeline is built\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Validate the pipeline\n",
|
|
"You have the option to [validate](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-core/azureml.pipeline.core.pipeline.pipeline?view=azure-ml-py#validate--) the pipeline prior to submitting for run. The platform runs validation steps such as checking for circular dependencies and parameter checks etc. even if you do not explicitly call validate method."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"pipeline1.validate()\n",
|
|
"print(\"Pipeline validation complete\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Submit the pipeline\n",
|
|
"[Submitting](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-core/azureml.pipeline.core.pipeline.pipeline?view=azure-ml-py#submit) the pipeline involves creating an [Experiment](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.experiment?view=azure-ml-py) object and providing the built pipeline for submission. "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Submit syntax\n",
|
|
"# submit(experiment_name, \n",
|
|
"# pipeline_params=None, \n",
|
|
"# continue_on_step_failure=False, \n",
|
|
"# regenerate_outputs=False)\n",
|
|
"\n",
|
|
"pipeline_run1 = Experiment(ws, 'Hello_World1').submit(pipeline1, regenerate_outputs=False)\n",
|
|
"print(\"Pipeline is submitted for execution\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"**Note:** If regenerate_outputs is set to True, a new submit will always force generation of all step outputs, and disallow data reuse for any step of this run. Once this run is complete, however, subsequent runs may reuse the results of this run.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Examine the pipeline run\n",
|
|
"\n",
|
|
"#### Use RunDetails Widget\n",
|
|
"We are going to use the RunDetails widget to examine the run of the pipeline. You can click each row below to get more details on the step runs."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"RunDetails(pipeline_run1).show()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Use Pipeline SDK objects\n",
|
|
"You can cycle through the node_run objects and examine job logs, stdout, and stderr of each of the steps."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"step_runs = pipeline_run1.get_children()\n",
|
|
"for step_run in step_runs:\n",
|
|
" status = step_run.get_status()\n",
|
|
" print('Script:', step_run.name, 'status:', status)\n",
|
|
" \n",
|
|
" # Change this if you want to see details even if the Step has succeeded.\n",
|
|
" if status == \"Failed\":\n",
|
|
" joblog = step_run.get_job_log()\n",
|
|
" print('job log:', joblog)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Get additonal run details\n",
|
|
"If you wait until the pipeline_run is finished, you may be able to get additional details on the run. **Since this is a blocking call, the following code is commented out.**"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"#pipeline_run1.wait_for_completion()\n",
|
|
"#for step_run in pipeline_run1.get_children():\n",
|
|
"# print(\"{}: {}\".format(step_run.name, step_run.get_metrics()))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Running a few steps in sequence\n",
|
|
"Now let's see how we run a few steps in sequence. We already have three steps defined earlier. Let's *reuse* those steps for this part.\n",
|
|
"\n",
|
|
"We will reuse step1, step2, step3, but build the pipeline in such a way that we chain step3 after step2 and step2 after step1. Note that there is no explicit data dependency between these steps, but still steps can be made dependent by using the [run_after](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-core/azureml.pipeline.core.builder.pipelinestep?view=azure-ml-py#run-after-step-) construct."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"step2.run_after(step1)\n",
|
|
"step3.run_after(step2)\n",
|
|
"\n",
|
|
"# Try a loop\n",
|
|
"#step2.run_after(step3)\n",
|
|
"\n",
|
|
"# Now, construct the pipeline using the steps.\n",
|
|
"\n",
|
|
"# We can specify the \"final step\" in the chain, \n",
|
|
"# Pipeline will take care of \"transitive closure\" and \n",
|
|
"# figure out the implicit or explicit dependencies\n",
|
|
"# https://www.geeksforgeeks.org/transitive-closure-of-a-graph/\n",
|
|
"pipeline2 = Pipeline(workspace=ws, steps=[step3])\n",
|
|
"print (\"Pipeline is built\")\n",
|
|
"\n",
|
|
"pipeline2.validate()\n",
|
|
"print(\"Simple validation complete\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"pipeline_run2 = Experiment(ws, 'Hello_World2').submit(pipeline2)\n",
|
|
"print(\"Pipeline is submitted for execution\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"RunDetails(pipeline_run2).show()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Next: Pipelines with data dependency\n",
|
|
"The next [notebook](./aml-pipelines-with-data-dependency-steps.ipynb) demostrates how to construct a pipeline with data dependency."
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"authors": [
|
|
{
|
|
"name": "sanpil"
|
|
}
|
|
],
|
|
"kernelspec": {
|
|
"display_name": "Python 3.6",
|
|
"language": "python",
|
|
"name": "python36"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.6.7"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
} |