mirror of
https://github.com/Azure/MachineLearningNotebooks.git
synced 2025-12-19 17:17:04 -05:00
release 1.0.15
This commit is contained in:
@@ -102,3 +102,5 @@ pip install azureml-sdk[explain]
|
||||
# install the core SDK and experimental components
|
||||
pip install azureml-sdk[contrib]
|
||||
```
|
||||
Drag and Drop
|
||||
The image will be downloaded by Fatkun
|
||||
@@ -11,7 +11,7 @@ and maintaining the complete data science workflow from the cloud.
|
||||
```sh
|
||||
pip install azureml-sdk
|
||||
```
|
||||
Read more detailed instructions on [how to set up your environment](./NBSETUP.md).
|
||||
Read more detailed instructions on [how to set up your environment](./NBSETUP.md) using Azure Notebook service, your own Jupyter notebook server, or Docker.
|
||||
|
||||
## How to navigate and use the example notebooks?
|
||||
You should always run the [Configuration](./configuration.ipynb) notebook first when setting up a notebook library on a new machine or in a new environment. It configures your notebook library to connect to an Azure Machine Learning workspace, and sets up your workspace and compute to be used by many of the other examples.
|
||||
|
||||
@@ -96,7 +96,7 @@
|
||||
"source": [
|
||||
"import azureml.core\n",
|
||||
"\n",
|
||||
"print(\"This notebook was created using version 1.0.10 of the Azure ML SDK\")\n",
|
||||
"print(\"This notebook was created using version 1.0.6 of the Azure ML SDK\")\n",
|
||||
"print(\"You are currently using version\", azureml.core.VERSION, \"of the Azure ML SDK\")"
|
||||
]
|
||||
},
|
||||
|
||||
@@ -169,6 +169,9 @@ bash automl_setup_linux.sh
|
||||
- How to specifying sample_weight
|
||||
- The difference that it makes to test results
|
||||
|
||||
- [auto-ml-subsampling-local.ipynb](subsampling/auto-ml-subsampling-local.ipynb)
|
||||
- How to enable subsampling
|
||||
|
||||
- [auto-ml-dataprep.ipynb](dataprep/auto-ml-dataprep.ipynb)
|
||||
- Using DataPrep for reading data
|
||||
|
||||
|
||||
@@ -13,19 +13,7 @@ dependencies:
|
||||
- pandas>=0.22.0,<0.23.0
|
||||
- tensorflow>=1.12.0
|
||||
|
||||
# Required for azuremlftk
|
||||
- dill
|
||||
- pyodbc
|
||||
- statsmodels
|
||||
- numexpr
|
||||
- keras
|
||||
- distributed>=1.21.5,<1.24
|
||||
|
||||
- pip:
|
||||
|
||||
# Required for azuremlftk
|
||||
- https://azuremlpackages.blob.core.windows.net/forecasting/azuremlftk-0.1.18323.5a1-py3-none-any.whl
|
||||
|
||||
# Required packages for AzureML execution, history, and data preparation.
|
||||
- azureml-sdk[automl,notebooks,explain]
|
||||
- pandas_ml
|
||||
|
||||
@@ -13,19 +13,7 @@ dependencies:
|
||||
- pandas>=0.22.0,<0.23.0
|
||||
- tensorflow>=1.12.0
|
||||
|
||||
# Required for azuremlftk
|
||||
- dill
|
||||
- pyodbc
|
||||
- statsmodels
|
||||
- numexpr
|
||||
- keras
|
||||
- distributed>=1.21.5,<1.24
|
||||
|
||||
- pip:
|
||||
|
||||
# Required for azuremlftk
|
||||
- https://azuremlpackages.blob.core.windows.net/forecasting/azuremlftk-0.1.18323.5a1-py3-none-any.whl
|
||||
|
||||
# Required packages for AzureML execution, history, and data preparation.
|
||||
- azureml-sdk[automl,notebooks,explain]
|
||||
- pandas_ml
|
||||
|
||||
@@ -47,30 +47,6 @@
|
||||
"## Setup\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To use the *forecasting* task in AutoML, you need to have the **azuremlftk** package installed in your environment. The following cell tests whether this package is installed locally and, if not, gives you instructions for installing it. "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"try:\n",
|
||||
" import ftk\n",
|
||||
" print('Using FTK version ' + ftk.__version__)\n",
|
||||
"except ImportError:\n",
|
||||
" print(\"Unable to import forecasting package. This notebook does not work without this package.\\n\"\n",
|
||||
" + \"Please open a command prompt and run `pip install azuremlftk` to install the package. \\n\"\n",
|
||||
" + \"Make sure you install the package into AutoML's Python environment.\\n\\n\"\n",
|
||||
" + \"For instance, if AutoML is installed in a conda environment called `python36`, run:\\n\"\n",
|
||||
" + \"> activate python36\\n> pip install azuremlftk\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
|
||||
@@ -38,7 +38,7 @@
|
||||
"3. Find and train a forecasting model using local compute\n",
|
||||
"4. Evaluate the performance of the model\n",
|
||||
"\n",
|
||||
"The examples in the follow code samples use the [University of Chicago's Dominick's Finer Foods dataset](https://research.chicagobooth.edu/kilts/marketing-databases/dominicks) to forecast orange juice sales. Dominick's was a grocery chain in the Chicago metropolitan area."
|
||||
"The examples in the follow code samples use the University of Chicago's Dominick's Finer Foods dataset to forecast orange juice sales. Dominick's was a grocery chain in the Chicago metropolitan area."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -48,30 +48,6 @@
|
||||
"## Setup"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To use the *forecasting* task in AutoML, you need to have the **azuremlftk** package installed in your environment. The following cell tests whether this package is installed locally and, if not, gives you instructions for installing it."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"try:\n",
|
||||
" import ftk\n",
|
||||
" print('Using FTK version ' + ftk.__version__)\n",
|
||||
"except ImportError:\n",
|
||||
" print(\"Unable to import forecasting package. This notebook does not work without this package.\\n\"\n",
|
||||
" + \"Please open a command prompt and run `pip install azuremlftk` to install the package. \\n\"\n",
|
||||
" + \"Make sure you install the package into AutoML's Python environment.\\n\\n\"\n",
|
||||
" + \"For instance, if AutoML is installed in a conda environment called `python36`, run:\\n\"\n",
|
||||
" + \"> activate python36\\n> pip install azuremlftk\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
|
||||
@@ -0,0 +1,218 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Copyright (c) Microsoft Corporation. All rights reserved.\n",
|
||||
"\n",
|
||||
"Licensed under the MIT License."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Automated Machine Learning\n",
|
||||
"_**Classification with Local Compute**_\n",
|
||||
"\n",
|
||||
"## Contents\n",
|
||||
"1. [Introduction](#Introduction)\n",
|
||||
"1. [Setup](#Setup)\n",
|
||||
"1. [Data](#Data)\n",
|
||||
"1. [Train](#Train)\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Introduction\n",
|
||||
"\n",
|
||||
"In this example we will explore AutoML's subsampling feature. This is useful for training on large datasets to speed up the convergence.\n",
|
||||
"\n",
|
||||
"The setup is quiet similar to a normal classification, with the exception of the `enable_subsampling` option. Keep in mind that even with the `enable_subsampling` flag set, subsampling will only be run for large datasets (>= 50k rows) and large (>= 85) or no iteration restrictions.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Setup\n",
|
||||
"\n",
|
||||
"As part of the setup you have already created an Azure ML `Workspace` object. For AutoML you will need to create an `Experiment` object, which is a named object in a `Workspace` used to run experiments."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import logging\n",
|
||||
"\n",
|
||||
"import numpy as np\n",
|
||||
"import pandas as pd\n",
|
||||
"\n",
|
||||
"import azureml.core\n",
|
||||
"from azureml.core.experiment import Experiment\n",
|
||||
"from azureml.core.workspace import Workspace\n",
|
||||
"from azureml.train.automl import AutoMLConfig\n",
|
||||
"from azureml.train.automl.run import AutoMLRun"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"ws = Workspace.from_config()\n",
|
||||
"\n",
|
||||
"# Choose a name for the experiment and specify the project folder.\n",
|
||||
"experiment_name = 'automl-subsampling'\n",
|
||||
"project_folder = './sample_projects/automl-subsampling'\n",
|
||||
"\n",
|
||||
"experiment = Experiment(ws, experiment_name)\n",
|
||||
"\n",
|
||||
"output = {}\n",
|
||||
"output['SDK version'] = azureml.core.VERSION\n",
|
||||
"output['Subscription ID'] = ws.subscription_id\n",
|
||||
"output['Workspace Name'] = ws.name\n",
|
||||
"output['Resource Group'] = ws.resource_group\n",
|
||||
"output['Location'] = ws.location\n",
|
||||
"output['Project Directory'] = project_folder\n",
|
||||
"output['Experiment Name'] = experiment.name\n",
|
||||
"pd.set_option('display.max_colwidth', -1)\n",
|
||||
"pd.DataFrame(data = output, index = ['']).T"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Opt-in diagnostics for better experience, quality, and security of future releases."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.telemetry import set_diagnostics_collection\n",
|
||||
"set_diagnostics_collection(send_diagnostics = True)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Data\n",
|
||||
"\n",
|
||||
"We will create a simple dataset using the numpy sin function just for this example. We need just over 50k rows."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"base = np.arange(60000)\n",
|
||||
"cos = np.cos(base)\n",
|
||||
"y = np.round(np.sin(base)).astype('int')\n",
|
||||
"\n",
|
||||
"# Exclude the first 100 rows from training so that they can be used for test.\n",
|
||||
"X_train = np.hstack((base.reshape(-1, 1), cos.reshape(-1, 1)))\n",
|
||||
"y_train = y"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Train\n",
|
||||
"\n",
|
||||
"Instantiate an `AutoMLConfig` object to specify the settings and data used to run the experiment.\n",
|
||||
"\n",
|
||||
"|Property|Description|\n",
|
||||
"|-|-|\n",
|
||||
"|**enable_subsampling**|This enables subsampling as an option. However it does not guarantee subsampling will be used. It also depends on how large the dataset is and how many iterations it's expected to run at a minimum.|\n",
|
||||
"|**iterations**|Number of iterations. Subsampling requires a lot of iterations at smaller percent so in order for subsampling to be used we need to set iterations to be a high number.|\n",
|
||||
"|**experiment_timeout_minutes**|The experiment timeout, it's set to 5 right now to shorten the demo but it should probably be higher if we want to finish all the iterations.|\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"automl_config = AutoMLConfig(task = 'classification',\n",
|
||||
" debug_log = 'automl_errors.log',\n",
|
||||
" primary_metric = 'accuracy',\n",
|
||||
" iterations = 85,\n",
|
||||
" experiment_timeout_minutes = 5,\n",
|
||||
" n_cross_validations = 2,\n",
|
||||
" verbosity = logging.INFO,\n",
|
||||
" X = X_train, \n",
|
||||
" y = y_train,\n",
|
||||
" enable_subsampling=True,\n",
|
||||
" path = project_folder)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Call the `submit` method on the experiment object and pass the run configuration. Execution of local runs is synchronous. Depending on the data and the number of iterations this can run for a while.\n",
|
||||
"In this example, we specify `show_output = True` to print currently running iterations to the console."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"local_run = experiment.submit(automl_config, show_output = True)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"authors": [
|
||||
{
|
||||
"name": "rogehe"
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.6"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
@@ -123,6 +123,13 @@
|
||||
"ws.get_details()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
@@ -263,14 +270,15 @@
|
||||
"#If your data is in a dataframe, please use read_pandas_dataframe to convert a dataframe to dataflow before usind dprep.\n",
|
||||
"\n",
|
||||
"import azureml.dataprep as dprep\n",
|
||||
"\n",
|
||||
"# You can use `auto_read_file` which intelligently figures out delimiters and datatypes of a file.\n",
|
||||
"# The data referenced here was pulled from `sklearn.datasets.load_digits()`.\n",
|
||||
"simple_example_data_root = 'https://dprepdata.blob.core.windows.net/automl-notebook-data/'\n",
|
||||
"X_train = dprep.auto_read_file(simple_example_data_root + 'X.csv').skip(1) # Remove the header row.\n",
|
||||
"\n",
|
||||
"#Convert Pandas DataFrame to DataFlow\n",
|
||||
"#The read_pandas_dataframe reader can take a DataFrame and use it as the data source for a Dataflow.\n",
|
||||
"X_train = dprep.read_pandas_dataframe(pd.read_csv(simple_example_data_root + 'X.csv'), temp_folder='/dbfs/dataset_dataflowX_train') \n",
|
||||
"y_train = dprep.read_pandas_dataframe(pd.read_csv(simple_example_data_root + 'y.csv'), temp_folder='/dbfs/dataset_dataflowy_train').to_long(dprep.ColumnSelector(term='.*', use_regex = True))\n"
|
||||
"# You can also use `read_csv` and `to_*` transformations to read (with overridable delimiter)\n",
|
||||
"# and convert column types manually.\n",
|
||||
"# Here we read a comma delimited file and convert all columns to integers.\n",
|
||||
"y_train = dprep.read_csv(simple_example_data_root + 'y.csv').to_long(dprep.ColumnSelector(term='.*', use_regex = True))"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -287,16 +295,7 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"X_train.get_profile()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"y_train.get_profile()"
|
||||
"X_train.skip(1).head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -334,8 +333,7 @@
|
||||
" debug_log = 'automl_errors.log',\n",
|
||||
" primary_metric = 'AUC_weighted',\n",
|
||||
" iteration_timeout_minutes = 10,\n",
|
||||
" iterations = 5,\n",
|
||||
" preprocess = True,\n",
|
||||
" iterations = 30,\n",
|
||||
" n_cross_validations = 10,\n",
|
||||
" max_concurrent_iterations = 2, #change it based on number of worker nodes\n",
|
||||
" verbosity = logging.INFO,\n",
|
||||
@@ -351,7 +349,8 @@
|
||||
"source": [
|
||||
"## Train the Models\n",
|
||||
"\n",
|
||||
"Call the `submit` method on the experiment object and pass the run configuration. Execution of local runs is synchronous. Depending on the data and the number of iterations this can run for a while."
|
||||
"Call the `submit` method on the experiment object and pass the run configuration. Execution of local runs is synchronous. Depending on the data and the number of iterations this can run for a while.\n",
|
||||
"In this example, we specify `show_output = True` to print currently running iterations to the console."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -360,7 +359,7 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"local_run = experiment.submit(automl_config, show_output = False) # for higher runs please use show_output=False and use the below"
|
||||
"local_run = experiment.submit(automl_config, show_output = True) # for higher runs please use show_output=False and use the below"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -550,11 +549,11 @@
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.5"
|
||||
"version": "3.7.0"
|
||||
},
|
||||
"name": "auto-ml-classification-local-adb",
|
||||
"notebookId": 587284549713154
|
||||
"notebookId": 817220787969977
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 1
|
||||
"nbformat_minor": 0
|
||||
}
|
||||
@@ -99,10 +99,10 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"subscription_id = \"<Your SubscriptionId>\" #you should be owner or contributor\n",
|
||||
"resource_group = \"<Resource group - new or existing>\" #you should be owner or contributor\n",
|
||||
"workspace_name = \"<workspace to be created>\" #your workspace name\n",
|
||||
"workspace_region = \"<azureregion>\" #your region"
|
||||
"subscription_id = \"<Your SubscriptionId>\"\n",
|
||||
"resource_group = \"<Resource group - new or existing>\"\n",
|
||||
"workspace_name = \"<workspace to be created>\"\n",
|
||||
"workspace_region = \"<azureregion>\""
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -134,7 +134,7 @@
|
||||
"ws = Workspace.create(name = workspace_name,\n",
|
||||
" subscription_id = subscription_id,\n",
|
||||
" resource_group = resource_group, \n",
|
||||
" location = workspace_region, \n",
|
||||
" location = workspace_region,\n",
|
||||
" exist_ok=True)\n",
|
||||
"ws.get_details()"
|
||||
]
|
||||
@@ -160,8 +160,7 @@
|
||||
" resource_group = resource_group)\n",
|
||||
"\n",
|
||||
"# Persist the subscription id, resource group name, and workspace name in aml_config/config.json.\n",
|
||||
"ws.write_config()\n",
|
||||
"write_config(path=\"/databricks/driver/aml_config/\",file_name=<alias_conf.cfg>)"
|
||||
"ws.write_config()"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -263,13 +262,6 @@
|
||||
"set_diagnostics_collection(send_diagnostics = True)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Covert Pandas Dataframe to DataFlow"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
@@ -284,16 +276,15 @@
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import azureml.dataprep as dprep\n",
|
||||
"\n",
|
||||
"# You can use `auto_read_file` which intelligently figures out delimiters and datatypes of a file.\n",
|
||||
"# The data referenced here was pulled from `sklearn.datasets.load_digits()`.\n",
|
||||
"simple_example_data_root = 'https://dprepdata.blob.core.windows.net/automl-notebook-data/'\n",
|
||||
"X_train = dprep.auto_read_file(simple_example_data_root + 'X.csv').skip(1) # Remove the header row.\n",
|
||||
"\n",
|
||||
"#Convert Pandas DataFrame to DataFlow\n",
|
||||
"#The read_pandas_dataframe reader can take a DataFrame and use it as the data source for a Dataflow.\n",
|
||||
"X_train = dprep.read_pandas_dataframe(pd.read_csv(simple_example_data_root + 'X.csv'), temp_folder='/dbfs/dataset_dataflowX_train') \n",
|
||||
"y_train = dprep.read_pandas_dataframe(pd.read_csv(simple_example_data_root + 'y.csv'), temp_folder='/dbfs/dataset_dataflowy_train').to_long(dprep.ColumnSelector(term='.*', use_regex = True))\n",
|
||||
"\n",
|
||||
"\n"
|
||||
"# You can also use `read_csv` and `to_*` transformations to read (with overridable delimiter)\n",
|
||||
"# and convert column types manually.\n",
|
||||
"# Here we read a comma delimited file and convert all columns to integers.\n",
|
||||
"y_train = dprep.read_csv(simple_example_data_root + 'y.csv').to_long(dprep.ColumnSelector(term='.*', use_regex = True))"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -310,16 +301,7 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"X_train.get_profile()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"y_train.get_profile()"
|
||||
"X_train.skip(1).head(5)"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -357,14 +339,14 @@
|
||||
" debug_log = 'automl_errors.log',\n",
|
||||
" primary_metric = 'AUC_weighted',\n",
|
||||
" iteration_timeout_minutes = 10,\n",
|
||||
" iterations = 30,\n",
|
||||
" preprocess = True,\n",
|
||||
" n_cross_validations = 10,\n",
|
||||
" max_concurrent_iterations = 2, #change it based on number of worker nodes\n",
|
||||
" iterations = 5,\n",
|
||||
" n_cross_validations = 2,\n",
|
||||
" max_concurrent_iterations = 4, #change it based on number of worker nodes\n",
|
||||
" verbosity = logging.INFO,\n",
|
||||
" spark_context=sc, #databricks/spark related\n",
|
||||
" X = X_train, \n",
|
||||
" y = y_train,\n",
|
||||
" enable_cache=False,\n",
|
||||
" path = project_folder)"
|
||||
]
|
||||
},
|
||||
@@ -374,7 +356,8 @@
|
||||
"source": [
|
||||
"## Train the Models\n",
|
||||
"\n",
|
||||
"Call the `submit` method on the experiment object and pass the run configuration. Execution of local runs is synchronous. Depending on the data and the number of iterations this can run for a while."
|
||||
"Call the `submit` method on the experiment object and pass the run configuration. Execution of local runs is synchronous. Depending on the data and the number of iterations this can run for a while.\n",
|
||||
"In this example, we specify `show_output = True` to print currently running iterations to the console."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -383,7 +366,7 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"local_run = experiment.submit(automl_config, show_output = False) # for higher runs please use show_output=False and use the below"
|
||||
"local_run = experiment.submit(automl_config, show_output = True) # for higher runs please use show_output=False and use the below"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -436,7 +419,6 @@
|
||||
"metricslist = {}\n",
|
||||
"for run in children:\n",
|
||||
" properties = run.get_properties()\n",
|
||||
" #print(properties)\n",
|
||||
" metrics = {k: v for k, v in run.get_metrics().items() if isinstance(v, float)} \n",
|
||||
" metricslist[int(properties['iteration'])] = metrics\n",
|
||||
"\n",
|
||||
@@ -712,11 +694,11 @@
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.5"
|
||||
"version": "3.7.0"
|
||||
},
|
||||
"name": "auto-ml-classification-local-adb",
|
||||
"notebookId": 2733885892129020
|
||||
"notebookId": 3888835968049288
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 1
|
||||
"nbformat_minor": 0
|
||||
}
|
||||
@@ -397,11 +397,11 @@
|
||||
"source": [
|
||||
"### b. Connect Blob to Power Bi (Small Data only)\n",
|
||||
"1. Download and Open PowerBi Desktop\n",
|
||||
"2. Select \u201cGet Data\u201d and click on \u201cAzure Blob Storage\u201d >> Connect\n",
|
||||
"2. Select \"Get Data\" and click on \"Azure Blob Storage\" >> Connect\n",
|
||||
"3. Add your storage account and enter your storage key.\n",
|
||||
"4. Select the container where your Data Collection is stored and click on Edit. \n",
|
||||
"5. In the query editor, click under \u201cName\u201d column and add your Storage account Model path into the filter. Note: if you want to only look into files from a specific year or month, just expand the filter path. For example, just look into March data: /modeldata/subscriptionid>/resourcegroupname>/workspacename>/webservicename>/modelname>/modelversion>/identifier>/year>/3\n",
|
||||
"6. Click on the double arrow aside the \u201cContent\u201d column to combine the files. \n",
|
||||
"5. In the query editor, click under \"Name\" column and add your Storage account Model path into the filter. Note: if you want to only look into files from a specific year or month, just expand the filter path. For example, just look into March data: /modeldata/subscriptionid>/resourcegroupname>/workspacename>/webservicename>/modelname>/modelversion>/identifier>/year>/3\n",
|
||||
"6. Click on the double arrow aside the \"Content\" column to combine the files. \n",
|
||||
"7. Click OK and the data will preload.\n",
|
||||
"8. You can now click Close and Apply and start building your custom reports on your Model Input data."
|
||||
]
|
||||
|
||||
@@ -44,6 +44,9 @@ In this directory, there are two types of notebooks:
|
||||
4. [aml-pipelines-data-transfer.ipynb](https://aka.ms/pl-data-trans)
|
||||
5. [aml-pipelines-use-databricks-as-compute-target.ipynb](https://aka.ms/pl-databricks)
|
||||
6. [aml-pipelines-use-adla-as-compute-target.ipynb](https://aka.ms/pl-adla)
|
||||
7. [aml-pipelines-parameter-tuning-with-hyperdrive.ipynb](https://aka.ms/pl-hyperdrive)
|
||||
8. [aml-pipelines-how-to-use-azurebatch-to-run-a-windows-executable.ipynb](https://aka.ms/pl-azbatch)
|
||||
9. [aml-pipelines-setup-schedule-for-a-published-pipeline.ipynb](https://aka.ms/pl-schedule)
|
||||
|
||||
* The second type of notebooks illustrate more sophisticated scenarios, and are independent of each other. These notebooks include:
|
||||
|
||||
|
||||
@@ -17,7 +17,7 @@
|
||||
"\n",
|
||||
"In certain cases, you will need to transfer data from one data location to another. For example, your data may be in Files storage and you may want to move it to Blob storage. Or, if your data is in an ADLS account and you want to make it available in the Blob storage. The built-in **DataTransferStep** class helps you transfer data in these situations.\n",
|
||||
"\n",
|
||||
"The below example shows how to move data in an ADLS account to Blob storage."
|
||||
"The below example shows how to move data between an ADLS account, Blob storage, SQL Server, PostgreSQL server. "
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -85,7 +85,9 @@
|
||||
"\n",
|
||||
"For background on registering your data store, consult this article:\n",
|
||||
"\n",
|
||||
"https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-service-to-service-authenticate-using-active-directory"
|
||||
"https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-service-to-service-authenticate-using-active-directory\n",
|
||||
"\n",
|
||||
"### register datastores for Azure Data Lake and Azure Blob storage"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -146,7 +148,65 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Create DataReferences"
|
||||
"### register datastores for Azure SQL Server and Azure database for PostgreSQL"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"\n",
|
||||
"sql_datastore_name=\"MySqlDatastore\"\n",
|
||||
"server_name=os.getenv(\"SQL_SERVERNAME_62\", \"<my-server-name>\") # Name of SQL server\n",
|
||||
"database_name=os.getenv(\"SQL_DATBASENAME_62\", \"<my-database-name>\") # Name of SQL database\n",
|
||||
"client_id=os.getenv(\"SQL_CLIENTNAME_62\", \"<my-client-id>\") # client id of service principal with permissions to access database\n",
|
||||
"client_secret=os.getenv(\"SQL_CLIENTSECRET_62\", \"<my-client-secret>\") # the secret of service principal\n",
|
||||
"tenant_id=os.getenv(\"SQL_TENANTID_62\", \"<my-tenant-id>\") # tenant id of service principal\n",
|
||||
"\n",
|
||||
"try:\n",
|
||||
" sql_datastore = Datastore.get(ws, sql_datastore_name)\n",
|
||||
" print(\"found sql database datastore with name: %s\" % sql_datastore_name)\n",
|
||||
"except HttpOperationError:\n",
|
||||
" sql_datastore = Datastore.register_azure_sql_database(\n",
|
||||
" workspace=ws,\n",
|
||||
" datastore_name=sql_datastore_name,\n",
|
||||
" server_name=server_name,\n",
|
||||
" database_name=database_name,\n",
|
||||
" client_id=client_id,\n",
|
||||
" client_secret=client_secret,\n",
|
||||
" tenant_id=tenant_id)\n",
|
||||
" print(\"registered sql databse datastore with name: %s\" % sql_datastore_name)\n",
|
||||
"\n",
|
||||
" \n",
|
||||
"psql_datastore_name=\"MyPostgreSqlDatastore\"\n",
|
||||
"server_name=os.getenv(\"PSQL_SERVERNAME_62\", \"<my-server-name>\") # Name of PostgreSQL server \n",
|
||||
"database_name=os.getenv(\"PSQL_DATBASENAME_62\", \"<my-database-name>\") # Name of PostgreSQL database\n",
|
||||
"user_id=os.getenv(\"PSQL_USERID_62\", \"<my-user-id>\") # user id\n",
|
||||
"user_password=os.getenv(\"PSQL_USERPW_62\", \"<my-user-password>\") # user password\n",
|
||||
"\n",
|
||||
"try:\n",
|
||||
" psql_datastore = Datastore.get(ws, psql_datastore_name)\n",
|
||||
" print(\"found PostgreSQL database datastore with name: %s\" % psql_datastore_name)\n",
|
||||
"except HttpOperationError:\n",
|
||||
" psql_datastore = Datastore.register_azure_postgre_sql(\n",
|
||||
" workspace=ws,\n",
|
||||
" datastore_name=psql_datastore,\n",
|
||||
" server_name=server_name,\n",
|
||||
" database_name=database_name,\n",
|
||||
" user_id=user_id,\n",
|
||||
" user_password=user_password)\n",
|
||||
" print(\"registered PostgreSQL databse datastore with name: %s\" % psql_datastore_name)\n",
|
||||
" "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Create DataReferences\n",
|
||||
"### create DataReferences for Azure Data Lake and Azure Blob storage"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -174,6 +234,39 @@
|
||||
"print(\"obtained adls, blob data references\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### create DataReferences for Azure SQL Server and Azure database for PostgreSQL"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.data.sql_data_reference import SqlDataReference\n",
|
||||
"\n",
|
||||
"sql_datastore = Datastore(workspace=ws, name=\"MySqlDatastore\")\n",
|
||||
"\n",
|
||||
"sql_query_data_ref = SqlDataReference(\n",
|
||||
" datastore=sql_datastore,\n",
|
||||
" data_reference_name=\"sql_query_data_ref\",\n",
|
||||
" sql_query=\"select top 1 * from TestData\")\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"psql_datastore = Datastore(workspace=ws, name=\"MyPostgreSqlDatastore\")\n",
|
||||
"\n",
|
||||
"psql_query_data_ref = SqlDataReference(\n",
|
||||
" datastore=psql_datastore,\n",
|
||||
" data_reference_name=\"psql_query_data_ref\",\n",
|
||||
" sql_query=\"SELECT * FROM testtable\")\n",
|
||||
"\n",
|
||||
"print(\"obtained Sql server, PostgreSQL data references\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
@@ -251,6 +344,29 @@
|
||||
"print(\"data transfer step created\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"transfer_sql_to_blob = DataTransferStep(\n",
|
||||
" name=\"transfer_sql_to_blob\",\n",
|
||||
" source_data_reference=sql_query_data_ref,\n",
|
||||
" destination_data_reference=blob_data_ref,\n",
|
||||
" compute_target=data_factory_compute,\n",
|
||||
" destination_reference_type='file')\n",
|
||||
"\n",
|
||||
"transfer_psql_to_blob = DataTransferStep(\n",
|
||||
" name=\"transfer_psql_to_blob\",\n",
|
||||
" source_data_reference=psql_query_data_ref,\n",
|
||||
" destination_data_reference=blob_data_ref,\n",
|
||||
" compute_target=data_factory_compute,\n",
|
||||
" destination_reference_type='file')\n",
|
||||
"\n",
|
||||
"print(\"data transfer step created for Sql server and PostgreSQL\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
@@ -264,13 +380,28 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"pipeline = Pipeline(\n",
|
||||
" description=\"data_transfer_101\",\n",
|
||||
"pipeline_01 = Pipeline(\n",
|
||||
" description=\"data_transfer_01\",\n",
|
||||
" workspace=ws,\n",
|
||||
" steps=[transfer_adls_to_blob])\n",
|
||||
"\n",
|
||||
"pipeline_run = Experiment(ws, \"Data_Transfer_example\").submit(pipeline)\n",
|
||||
"pipeline_run.wait_for_completion()"
|
||||
"pipeline_run_01 = Experiment(ws, \"Data_Transfer_example_01\").submit(pipeline_01)\n",
|
||||
"pipeline_run_01.wait_for_completion()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"pipeline_02 = Pipeline(\n",
|
||||
" description=\"data_transfer_02\",\n",
|
||||
" workspace=ws,\n",
|
||||
" steps=[transfer_sql_to_blob,transfer_psql_to_blob])\n",
|
||||
"\n",
|
||||
"pipeline_run_02 = Experiment(ws, \"Data_Transfer_example_02\").submit(pipeline_02)\n",
|
||||
"pipeline_run_02.wait_for_completion()"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -287,7 +418,17 @@
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.widgets import RunDetails\n",
|
||||
"RunDetails(pipeline_run).show()"
|
||||
"RunDetails(pipeline_run_01).show()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.widgets import RunDetails\n",
|
||||
"RunDetails(pipeline_run_02).show()"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -320,7 +461,7 @@
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.7"
|
||||
"version": "3.6.2"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
|
||||
@@ -16,10 +16,20 @@
|
||||
"\n",
|
||||
"## Overview\n",
|
||||
"\n",
|
||||
"Read [Azure Machine Learning Pipelines](https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-ml-pipelines) overview, or the [readme article](../README.md) on Azure Machine Learning Pipelines to get more information.\n",
|
||||
" \n",
|
||||
"\n",
|
||||
"This Notebook shows basic construction of a **pipeline** that runs jobs unattended in different compute clusters. "
|
||||
"A common scenario when using machine learning components is to have a data workflow that includes the following steps:\n",
|
||||
"\n",
|
||||
"- Preparing/preprocessing a given dataset for training, followed by\n",
|
||||
"- Training a machine learning model on this data, and then\n",
|
||||
"- Deploying this trained model in a separate environment, and finally\n",
|
||||
"- Running a batch scoring task on another data set, using the trained model.\n",
|
||||
"\n",
|
||||
"Azure's Machine Learning pipelines give you a way to combine multiple steps like these into one configurable workflow, so that multiple agents/users can share and/or reuse this workflow. Machine learning pipelines thus provide a consistent, reproducible mechanism for building, evaluating, deploying, and running ML systems.\n",
|
||||
"\n",
|
||||
"To get more information about Azure machine learning pipelines, please read our [Azure Machine Learning Pipelines](https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-ml-pipelines) overview, or the [readme article](../README.md).\n",
|
||||
"\n",
|
||||
"In this notebook, we provide a gentle introduction to Azure machine learning pipelines. We build a pipeline that runs jobs unattended on different compute clusters; in this notebook, you'll see how to use the basic Azure ML SDK APIs for constructing this pipeline.\n",
|
||||
" "
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -45,6 +55,7 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import os\n",
|
||||
"import azureml.core\n",
|
||||
"from azureml.core import Workspace, Experiment, Datastore\n",
|
||||
"from azureml.core.compute import AmlCompute\n",
|
||||
@@ -119,7 +130,7 @@
|
||||
"# project folder\n",
|
||||
"project_folder = '.'\n",
|
||||
" \n",
|
||||
"print('Sample projects will be created in {}.'.format(project_folder))"
|
||||
"print('Sample projects will be created in {}.'.format(os.path.realpath(project_folder)))"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -135,7 +146,7 @@
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Datastore concepts\n",
|
||||
"A [Datastore](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.datastore(class) is a place where data can be stored that is then made accessible to a compute either by means of mounting or copying the data to the compute target. \n",
|
||||
"A [Datastore](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.datastore(class)?view=azure-ml-py) is a place where data can be stored that is then made accessible to a compute either by means of mounting or copying the data to the compute target. \n",
|
||||
"\n",
|
||||
"A Datastore can either be backed by an Azure File Storage (default) or by an Azure Blob Storage.\n",
|
||||
"\n",
|
||||
@@ -322,7 +333,7 @@
|
||||
" script_name=\"train.py\", \n",
|
||||
" compute_target=aml_compute, \n",
|
||||
" source_directory=project_folder,\n",
|
||||
" allow_reuse=False)\n",
|
||||
" allow_reuse=True)\n",
|
||||
"print(\"Step1 created\")"
|
||||
]
|
||||
},
|
||||
@@ -375,7 +386,7 @@
|
||||
"### Build the pipeline\n",
|
||||
"Once we have the steps (or steps collection), we can build the [pipeline](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-core/azureml.pipeline.core.pipeline.pipeline?view=azure-ml-py). By deafult, all these steps will run in **parallel** once we submit the pipeline for run.\n",
|
||||
"\n",
|
||||
"A pipeline is created with a list of steps and a workspace. Submit a pipeline using [submit](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.experiment%28class%29?view=azure-ml-py#submit). When submit is called, a [PipelineRun](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-core/azureml.pipeline.core.pipelinerun?view=azure-ml-py) is created which in turn creates [StepRun](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-core/azureml.pipeline.core.steprun?view=azure-ml-py) objects for each step in the workflow."
|
||||
"A pipeline is created with a list of steps and a workspace. Submit a pipeline using [submit](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.experiment(class)?view=azure-ml-py#submit-config--tags-none----kwargs-). When submit is called, a [PipelineRun](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-core/azureml.pipeline.core.pipelinerun?view=azure-ml-py) is created which in turn creates [StepRun](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-core/azureml.pipeline.core.steprun?view=azure-ml-py) objects for each step in the workflow."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -403,7 +414,7 @@
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Validate the pipeline\n",
|
||||
"You have the option to [validate](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-core/azureml.pipeline.core.pipeline.pipeline?view=azure-ml-py#validate) the pipeline prior to submitting for run. The platform runs validation steps such as checking for circular dependencies and parameter checks etc. even if you do not explicitly call validate method."
|
||||
"You have the option to [validate](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-core/azureml.pipeline.core.pipeline.pipeline?view=azure-ml-py#validate--) the pipeline prior to submitting for run. The platform runs validation steps such as checking for circular dependencies and parameter checks etc. even if you do not explicitly call validate method."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -436,7 +447,7 @@
|
||||
"# continue_on_node_failure=False, \n",
|
||||
"# regenerate_outputs=False)\n",
|
||||
"\n",
|
||||
"pipeline_run1 = Experiment(ws, 'Hello_World1').submit(pipeline1, regenerate_outputs=True)\n",
|
||||
"pipeline_run1 = Experiment(ws, 'Hello_World1').submit(pipeline1, regenerate_outputs=False)\n",
|
||||
"print(\"Pipeline is submitted for execution\")"
|
||||
]
|
||||
},
|
||||
@@ -517,7 +528,7 @@
|
||||
"## Running a few steps in sequence\n",
|
||||
"Now let's see how we run a few steps in sequence. We already have three steps defined earlier. Let's *reuse* those steps for this part.\n",
|
||||
"\n",
|
||||
"We will reuse step1, step2, step3, but build the pipeline in such a way that we chain step3 after step2 and step2 after step1. Note that there is no explicit data dependency between these steps, but still steps can be made dependent by using the [run_after](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-core/azureml.pipeline.core.builder.pipelinestep?view=azure-ml-py#run-after) construct."
|
||||
"We will reuse step1, step2, step3, but build the pipeline in such a way that we chain step3 after step2 and step2 after step1. Note that there is no explicit data dependency between these steps, but still steps can be made dependent by using the [run_after](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-core/azureml.pipeline.core.builder.pipelinestep?view=azure-ml-py#run-after-step-) construct."
|
||||
]
|
||||
},
|
||||
{
|
||||
|
||||
@@ -0,0 +1,359 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Copyright (c) Microsoft Corporation. All rights reserved. \n",
|
||||
"Licensed under the MIT License."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Azure Machine Learning Pipeline with AzureBatchStep"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"This notebook is used to demonstrate the use of AzureBatchStep in Azure Machine Learning Pipeline.\n",
|
||||
"An AzureBatchStep will submit a job to an AzureBatch Compute to run a simple windows executable."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Azure Machine Learning and Pipeline SDK-specific Imports"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import azureml.core\n",
|
||||
"from azureml.core import Workspace, Experiment\n",
|
||||
"from azureml.core.compute import ComputeTarget, BatchCompute\n",
|
||||
"from azureml.core.datastore import Datastore\n",
|
||||
"from azureml.data.data_reference import DataReference\n",
|
||||
"from azureml.exceptions import ComputeTargetException\n",
|
||||
"from azureml.pipeline.core import Pipeline, PipelineData\n",
|
||||
"from azureml.pipeline.steps import AzureBatchStep\n",
|
||||
"\n",
|
||||
"import os\n",
|
||||
"from os import path\n",
|
||||
"from tempfile import mkdtemp\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"# Check core SDK version number\n",
|
||||
"print(\"SDK version:\", azureml.core.VERSION)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Initialize Workspace"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Initialize a workspace object from persisted configuration. Make sure the config file is present at .\\config.json\n",
|
||||
"\n",
|
||||
"If you don't have a config.json file, please go through the configuration Notebook located here:\n",
|
||||
"https://github.com/Azure/MachineLearningNotebooks. \n",
|
||||
"\n",
|
||||
"This sets you up with a working config file that has information on your workspace, subscription id, etc. "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"ws = Workspace.from_config()\n",
|
||||
"print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\\n')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Attach Batch Compute to Workspace"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To submit jobs to Azure Batch service, you must attach your Azure Batch account to the workspace."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"batch_compute_name = 'mybatchcompute' # Name to associate with new compute in workspace\n",
|
||||
"\n",
|
||||
"# Batch account details needed to attach as compute to workspace\n",
|
||||
"batch_account_name = \"<batch_account_name>\" # Name of the Batch account\n",
|
||||
"batch_resource_group = \"<batch_resource_group>\" # Name of the resource group which contains this account\n",
|
||||
"\n",
|
||||
"try:\n",
|
||||
" # check if already attached\n",
|
||||
" batch_compute = BatchCompute(ws, batch_compute_name)\n",
|
||||
"except ComputeTargetException:\n",
|
||||
" print('Attaching Batch compute...')\n",
|
||||
" provisioning_config = BatchCompute.attach_configuration(resource_group=batch_resource_group, account_name=batch_account_name)\n",
|
||||
" batch_compute = ComputeTarget.attach(ws, batch_compute_name, provisioning_config)\n",
|
||||
" batch_compute.wait_for_completion()\n",
|
||||
" print(\"Provisioning state:{}\".format(batch_compute.provisioning_state))\n",
|
||||
" print(\"Provisioning errors:{}\".format(batch_compute.provisioning_errors))\n",
|
||||
"\n",
|
||||
"print(\"Using Batch compute:{}\".format(batch_compute.cluster_resource_id))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Setup DataStore"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Blob storage associated with the workspace\n",
|
||||
"# The following call GETS the Azure Blob Store associated with your workspace.\n",
|
||||
"# Note that workspaceblobstore is **the name of this store and CANNOT BE CHANGED and must be used as is** \n",
|
||||
"default_blob_store = Datastore(ws, \"workspaceblobstore\")\n",
|
||||
"print(\"Blobstore name: {}\".format(def_blob_store.name))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Setup Input and Output"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"For this example we will upload a file in the provided DataStore. These are some helper methods to achieve that."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def create_local_file(content, file_name):\n",
|
||||
" # create a file in a local temporary directory\n",
|
||||
" temp_dir = mkdtemp()\n",
|
||||
" with open(path.join(temp_dir, file_name), 'w') as f:\n",
|
||||
" f.write(content)\n",
|
||||
" return temp_dir\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"def upload_file_to_datastore(datastore, path, content):\n",
|
||||
" dir = create_local_file(content=content, file_name=\"temp.file\")\n",
|
||||
" datastore.upload(src_dir=dir, target_path=path, overwrite=True, show_progress=True)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Here we associate the input DataReference with an existing file in the provided DataStore. Feel free to upload the file of your choice manually or use the *upload_testdata* method. "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"testdata_path=\"testdata.txt\"\n",
|
||||
"\n",
|
||||
"upload_file_to_datastore(datastore=default_blob_store, \n",
|
||||
" path=testdata_path, \n",
|
||||
" content=\"This is the content of the file\")\n",
|
||||
"\n",
|
||||
"testdata = DataReference(datastore=default_blob_store, \n",
|
||||
" path_on_datastore=testdata_path, \n",
|
||||
" data_reference_name=\"input\")\n",
|
||||
"\n",
|
||||
"outputdata = PipelineData(name=\"output\", datastore=datastore)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Setup AzureBatch Job Binaries"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"AzureBatch can run a task within the job and here we put a simple .cmd file to be executed. Feel free to put any binaries in the folder, or modify the .cmd file as needed, they will be uploaded once we create the AzureBatch Step."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"binaries_folder = \"azurebatch/job_binaries\"\n",
|
||||
"if not os.path.isdir(binaries_folder):\n",
|
||||
" os.mkdir(project_folder)\n",
|
||||
"\n",
|
||||
"file_name=\"azurebatch.cmd\"\n",
|
||||
"with open(path.join(binaries_folder, file_name), 'w') as f:\n",
|
||||
" f.write(\"copy \\\"%1\\\" \\\"%2\\\"\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Create an AzureBatchStep"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"AzureBatchStep is used to submit a job to the attached Azure Batch compute.\n",
|
||||
"- **name:** Name of the step\n",
|
||||
"- **pool_id:** Name of the pool, it can be an existing pool, or one that will be created when the job is submitted\n",
|
||||
"- **inputs:** List of inputs that will be processed by the job\n",
|
||||
"- **outputs:** List of outputs the job will create\n",
|
||||
"- **executable:** The executable that will run as part of the job\n",
|
||||
"- **arguments:** Arguments for the executable. They can be plain string format, inputs, outputs or parameters\n",
|
||||
"- **compute_target:** The compute target where the job will run.\n",
|
||||
"- **source_directory:** The local directory with binaries to be executed by the job\n",
|
||||
"\n",
|
||||
"Optional parameters:\n",
|
||||
"\n",
|
||||
"- **create_pool:** Boolean flag to indicate whether create the pool before running the jobs\n",
|
||||
"- **delete_batch_job_after_finish:** Boolean flag to indicate whether to delete the job from Batch account after it's finished\n",
|
||||
"- **delete_batch_pool_after_finish:** Boolean flag to indicate whether to delete the pool after the job finishes\n",
|
||||
"- **is_positive_exit_code_failure:** Boolean flag to indicate if the job fails if the task exists with a positive code\n",
|
||||
"- **vm_image_urn:** If create_pool is true and VM uses VirtualMachineConfiguration. \n",
|
||||
" Value format: 'urn:publisher:offer:sku'. \n",
|
||||
" Example: urn:MicrosoftWindowsServer:WindowsServer:2012-R2-Datacenter \n",
|
||||
" For more details: \n",
|
||||
" https://docs.microsoft.com/en-us/azure/virtual-machines/windows/cli-ps-findimage#table-of-commonly-used-windows-images and \n",
|
||||
" https://docs.microsoft.com/en-us/azure/virtual-machines/linux/cli-ps-findimage#find-specific-images\n",
|
||||
"- **run_task_as_admin:** Boolean flag to indicate if the task should run with Admin privileges\n",
|
||||
"- **target_compute_nodes:** Assumes create_pool is true, indicates how many compute nodes will be added to the pool\n",
|
||||
"- **source_directory:** Local folder that contains the module binaries, executable, assemblies etc.\n",
|
||||
"- **executable:** Name of the command/executable that will be executed as part of the job\n",
|
||||
"- **arguments:** Arguments for the command/executable\n",
|
||||
"- **inputs:** List of input port bindings\n",
|
||||
"- **outputs:** List of output port bindings\n",
|
||||
"- **vm_size:** If create_pool is true, indicating Virtual machine size of the compute nodes\n",
|
||||
"- **compute_target:** BatchCompute compute\n",
|
||||
"- **allow_reuse:** Whether the module should reuse previous results when run with the same settings/inputs\n",
|
||||
"- **version:** A version tag to denote a change in functionality for the module"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"step = AzureBatchStep(\n",
|
||||
" name=\"Azure Batch Job\",\n",
|
||||
" pool_id=\"MyPoolName\", # Replace this with the pool name of your choice\n",
|
||||
" inputs=[testdata],\n",
|
||||
" outputs=[outputdata],\n",
|
||||
" executable=\"azurebatch.cmd\",\n",
|
||||
" arguments=[testdata, outputdata],\n",
|
||||
" compute_target=batch_compute,\n",
|
||||
" source_directory=binaries_folder,\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Build and Submit the Pipeline"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"pipeline = Pipeline(workspace=ws, steps=[step])\n",
|
||||
"pipeline_run = Experiment(ws, 'azurebatch_experiment').submit(pipeline)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Visualize the Running Pipeline"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.widgets import RunDetails\n",
|
||||
"RunDetails(pipeline_run).show()"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"authors": [
|
||||
{
|
||||
"name": "diray"
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.7"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
@@ -0,0 +1,396 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Copyright (c) Microsoft Corporation. All rights reserved. \n",
|
||||
"Licensed under the MIT License."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Azure Machine Learning Pipeline with HyperDriveStep\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"This notebook is used to demonstrate the use of HyperDriveStep in AML Pipeline.\n",
|
||||
"\n",
|
||||
"## Azure Machine Learning and Pipeline SDK-specific imports\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import os\n",
|
||||
"import shutil\n",
|
||||
"import urllib\n",
|
||||
"from azureml.core import Experiment\n",
|
||||
"from azureml.core.datastore import Datastore\n",
|
||||
"from azureml.core.compute import ComputeTarget, AmlCompute\n",
|
||||
"from azureml.exceptions import ComputeTargetException\n",
|
||||
"from azureml.data.data_reference import DataReference\n",
|
||||
"from azureml.pipeline.steps import HyperDriveStep\n",
|
||||
"from azureml.pipeline.core import Pipeline\n",
|
||||
"from azureml.train.dnn import TensorFlow\n",
|
||||
"from azureml.train.hyperdrive import *\n",
|
||||
"\n",
|
||||
"# Check core SDK version number\n",
|
||||
"print(\"SDK version:\", azureml.core.VERSION)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Initialize workspace\n",
|
||||
"\n",
|
||||
"Initialize a workspace object from persisted configuration. Make sure the config file is present at .\\config.json"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"ws = Workspace.from_config()\n",
|
||||
"print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\\n')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Create an Azure ML experiment\n",
|
||||
"Let's create an experiment named \"tf-mnist\" and a folder to hold the training scripts. The script runs will be recorded under the experiment in Azure.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"script_folder = './tf-mnist'\n",
|
||||
"os.makedirs(script_folder, exist_ok=True)\n",
|
||||
"\n",
|
||||
"exp = Experiment(workspace=ws, name='tf-mnist')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Download MNIST dataset\n",
|
||||
"In order to train on the MNIST dataset we will first need to download it from Yan LeCun's web site directly and save them in a `data` folder locally."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"os.makedirs('./data/mnist', exist_ok=True)\n",
|
||||
"\n",
|
||||
"urllib.request.urlretrieve('http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz', filename = './data/mnist/train-images.gz')\n",
|
||||
"urllib.request.urlretrieve('http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz', filename = './data/mnist/train-labels.gz')\n",
|
||||
"urllib.request.urlretrieve('http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz', filename = './data/mnist/test-images.gz')\n",
|
||||
"urllib.request.urlretrieve('http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz', filename = './data/mnist/test-labels.gz')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Upload MNIST dataset to blob datastore \n",
|
||||
"A [datastore](https://docs.microsoft.com/azure/machine-learning/service/how-to-access-data) is a place where data can be stored that is then made accessible to a Run either by means of mounting or copying the data to the compute target. A datastore can either be backed by an Azure Blob Storage or and Azure File Share (ADLS will be supported in the future). In the next step, we will use Azure Blob Storage and upload the training and test set into the Azure Blob datastore, which we will then later be mount on a Batch AI cluster for training."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"ds = Datastore(workspace=ws, name=\"MyBlobDatastore\")\n",
|
||||
"ds.upload(src_dir='./data/mnist', target_path='mnist', overwrite=True, show_progress=True)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Retrieve or create a Azure Machine Learning compute\n",
|
||||
"Azure Machine Learning Compute is a service for provisioning and managing clusters of Azure virtual machines for running machine learning workloads. Let's create a new Azure Machine Learning Compute in the current workspace, if it doesn't already exist. We will then run the training script on this compute target.\n",
|
||||
"\n",
|
||||
"If we could not find the compute with the given name in the previous cell, then we will create a new compute here. We will create an Azure Machine Learning Compute containing **STANDARD_D2_V2 CPU VMs**. This process is broken down into the following steps:\n",
|
||||
"\n",
|
||||
"1. Create the configuration\n",
|
||||
"2. Create the Azure Machine Learning compute\n",
|
||||
"\n",
|
||||
"**This process will take about 3 minutes and is providing only sparse output in the process. Please make sure to wait until the call returns before moving to the next cell.**\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"cluster_name = \"aml-compute\"\n",
|
||||
"\n",
|
||||
"try:\n",
|
||||
" compute_target = ComputeTarget(workspace=ws, name=cluster_name)\n",
|
||||
" print('Found existing compute target {}.'.format(cluster_name))\n",
|
||||
"except ComputeTargetException:\n",
|
||||
" print('Creating a new compute target...')\n",
|
||||
" compute_config = AmlCompute.provisioning_configuration(vm_size=\"STANDARD_NC6\",\n",
|
||||
" max_nodes=4)\n",
|
||||
"\n",
|
||||
" compute_target = ComputeTarget.create(ws, cluster_name, compute_config)\n",
|
||||
" compute_target.wait_for_completion(show_output=True, timeout_in_minutes=20)\n",
|
||||
"\n",
|
||||
"print(\"Azure Machine Learning Compute attached\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Copy the training files into the script folder\n",
|
||||
"The TensorFlow training script is already created for you. You can simply copy it into the script folder, together with the utility library used to load compressed data file into numpy array."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# the training logic is in the tf_mnist.py file.\n",
|
||||
"shutil.copy('./tf_mnist.py', script_folder)\n",
|
||||
"\n",
|
||||
"# the utils.py just helps loading data from the downloaded MNIST dataset into numpy arrays.\n",
|
||||
"shutil.copy('./utils.py', script_folder)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Create TensorFlow estimator\n",
|
||||
"Next, we construct an `azureml.train.dnn.TensorFlow` estimator object, use the Batch AI cluster as compute target, and pass the mount-point of the datastore to the training code as a parameter.\n",
|
||||
"The TensorFlow estimator is providing a simple way of launching a TensorFlow training job on a compute target. It will automatically provide a docker image that has TensorFlow installed -- if additional pip or conda packages are required, their names can be passed in via the `pip_packages` and `conda_packages` arguments and they will be included in the resulting docker."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"est = TensorFlow(source_directory=script_folder, \n",
|
||||
" compute_target=compute_target,\n",
|
||||
" entry_script='tf_mnist.py', \n",
|
||||
" use_gpu=True)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Intelligent hyperparameter tuning\n",
|
||||
"We have trained the model with one set of hyperparameters, now let's how we can do hyperparameter tuning by launching multiple runs on the cluster. First let's define the parameter space using random sampling.\n",
|
||||
"\n",
|
||||
"In this example we will use random sampling to try different configuration sets of hyperparameters to maximize our primary metric, the best validation accuracy (`validation_acc`)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"ps = RandomParameterSampling(\n",
|
||||
" {\n",
|
||||
" '--batch-size': choice(25, 50, 100),\n",
|
||||
" '--first-layer-neurons': choice(10, 50, 200, 300, 500),\n",
|
||||
" '--second-layer-neurons': choice(10, 50, 200, 500),\n",
|
||||
" '--learning-rate': loguniform(-6, -1)\n",
|
||||
" }\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Now we will define an early termnination policy. The `BanditPolicy` basically states to check the job every 2 iterations. If the primary metric (defined later) falls outside of the top 10% range, Azure ML terminate the job. This saves us from continuing to explore hyperparameters that don't show promise of helping reach our target metric.\n",
|
||||
"\n",
|
||||
"Refer [here](https://docs.microsoft.com/azure/machine-learning/service/how-to-tune-hyperparameters#specify-an-early-termination-policy) for more information on the BanditPolicy and other policies available."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"early_termination_policy = BanditPolicy(evaluation_interval=2, slack_factor=0.1)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Now we are ready to configure a run configuration object, and specify the primary metric `validation_acc` that's recorded in your training runs. If you go back to visit the training script, you will notice that this value is being logged after every epoch (a full batch set). We also want to tell the service that we are looking to maximizing this value. We also set the number of samples to 20, and maximal concurrent job to 4, which is the same as the number of nodes in our computer cluster."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"hd_config = HyperDriveRunConfig(estimator=est, \n",
|
||||
" hyperparameter_sampling=ps,\n",
|
||||
" policy=early_termination_policy,\n",
|
||||
" primary_metric_name='validation_acc', \n",
|
||||
" primary_metric_goal=PrimaryMetricGoal.MAXIMIZE, \n",
|
||||
" max_total_runs=1,\n",
|
||||
" max_concurrent_runs=1)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Add HyperDrive as a step of pipeline\n",
|
||||
"\n",
|
||||
"Let's setup a data reference for inputs of hyperdrive step."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"data_folder = DataReference(\n",
|
||||
" datastore=ds,\n",
|
||||
" data_reference_name=\"mnist_data\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### HyperDriveStep\n",
|
||||
"HyperDriveStep can be used to run HyperDrive job as a step in pipeline.\n",
|
||||
"- **name:** Name of the step\n",
|
||||
"- **hyperdrive_run_config:** A HyperDriveRunConfig that defines the configuration for this HyperDrive run\n",
|
||||
"- **estimator_entry_script_arguments:** List of command-line arguments for estimator entry script\n",
|
||||
"- **inputs:** List of input port bindings\n",
|
||||
"- **outputs:** List of output port bindings\n",
|
||||
"- **metrics_output:** Optional value specifying the location to store HyperDrive run metrics as a JSON file\n",
|
||||
"- **allow_reuse:** whether to allow reuse\n",
|
||||
"- **version:** version\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"hd_step = HyperDriveStep(\n",
|
||||
" name=\"hyperdrive_module\",\n",
|
||||
" hyperdrive_run_config=hd_config,\n",
|
||||
" estimator_entry_script_arguments=['--data-folder', data_folder],\n",
|
||||
" inputs=[data_folder])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Build the experiment"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"pipeline = Pipeline(workspace=ws, steps=[hd_step])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Submit the experiment "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"pipeline_run = Experiment(ws, 'Hyperdrive_Test').submit(pipeline)\n",
|
||||
"pipeline_run.wait_for_completion()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### View Run Details"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.widgets import RunDetails\n",
|
||||
"RunDetails(pipeline_run).show()"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"authors": [
|
||||
{
|
||||
"name": "sonnyp"
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.7"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
@@ -306,11 +306,11 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.core.authentication import AzureCliAuthentication\n",
|
||||
"from azureml.core.authentication import InteractiveLoginAuthentication\n",
|
||||
"import requests\n",
|
||||
"\n",
|
||||
"cli_auth = AzureCliAuthentication()\n",
|
||||
"aad_token = cli_auth.get_authentication_header()\n",
|
||||
"auth = InteractiveLoginAuthentication()\n",
|
||||
"aad_token = auth.get_authentication_header()\n",
|
||||
"\n",
|
||||
"rest_endpoint1 = published_pipeline1.endpoint\n",
|
||||
"\n",
|
||||
|
||||
@@ -0,0 +1,404 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Copyright (c) Microsoft Corporation. All rights reserved. \n",
|
||||
"Licensed under the MIT License."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# How to Setup a Schedule for a Published Pipeline\n",
|
||||
"In this notebook, we will show you how you can run an already published pipeline on a schedule."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Prerequisites and AML Basics\n",
|
||||
"Make sure you go through the configuration Notebook located at https://github.com/Azure/MachineLearningNotebooks first if you haven't. This sets you up with a working config file that has information on your workspace, subscription id, etc.\n",
|
||||
"\n",
|
||||
"### Initialization Steps"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import azureml.core\n",
|
||||
"from azureml.core import Workspace\n",
|
||||
"\n",
|
||||
"# Check core SDK version number\n",
|
||||
"print(\"SDK version:\", azureml.core.VERSION)\n",
|
||||
"\n",
|
||||
"ws = Workspace.from_config()\n",
|
||||
"print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\\n')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Compute Targets\n",
|
||||
"#### Retrieve an already attached Azure Machine Learning Compute"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.core import Run, Experiment, Datastore\n",
|
||||
"\n",
|
||||
"from azureml.widgets import RunDetails\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.core.compute import AmlCompute, ComputeTarget\n",
|
||||
"aml_compute_target = \"aml-compute\"\n",
|
||||
"try:\n",
|
||||
" aml_compute = AmlCompute(ws, aml_compute_target)\n",
|
||||
" print(\"Found existing compute target: {}\".format(aml_compute_target))\n",
|
||||
"except:\n",
|
||||
" print(\"Creating new compute target: {}\".format(aml_compute_target))\n",
|
||||
" \n",
|
||||
" provisioning_config = AmlCompute.provisioning_configuration(vm_size = \"STANDARD_D2_V2\",\n",
|
||||
" min_nodes = 1, \n",
|
||||
" max_nodes = 4) \n",
|
||||
" aml_compute = ComputeTarget.create(ws, aml_compute_target, provisioning_config)\n",
|
||||
" aml_compute.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Build and Publish Pipeline\n",
|
||||
"Build a simple pipeline, publish it and add a schedule to run it."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Define a pipeline step\n",
|
||||
"Define a single step pipeline for demonstration purpose."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.pipeline.steps import PythonScriptStep\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"# project folder\n",
|
||||
"project_folder = 'scripts'\n",
|
||||
"\n",
|
||||
"trainStep = PythonScriptStep(\n",
|
||||
" name=\"Training_Step\",\n",
|
||||
" script_name=\"train.py\", \n",
|
||||
" compute_target=aml_compute_target, \n",
|
||||
" source_directory=project_folder\n",
|
||||
")\n",
|
||||
"print(\"TrainStep created\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Build the pipeline"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.pipeline.core import Pipeline\n",
|
||||
"\n",
|
||||
"pipeline1 = Pipeline(workspace=ws, steps=[trainStep])\n",
|
||||
"print (\"Pipeline is built\")\n",
|
||||
"\n",
|
||||
"pipeline1.validate()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Publish the pipeline"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from datetime import datetime\n",
|
||||
"\n",
|
||||
"timenow = datetime.now().strftime('%m-%d-%Y-%H-%M')\n",
|
||||
"\n",
|
||||
"pipeline_name = timenow + \"-Pipeline\"\n",
|
||||
"print(pipeline_name)\n",
|
||||
"\n",
|
||||
"published_pipeline1 = pipeline1.publish(\n",
|
||||
" name=pipeline_name, \n",
|
||||
" description=pipeline_name)\n",
|
||||
"print(\"Newly published pipeline id: {}\".format(published_pipeline1.id))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Schedule Operations\n",
|
||||
"Schedule operations require id of a published pipeline. You can get all published pipelines and do Schedule operations on them, or if you already know the id of the published pipeline, you can use it directly as well.\n",
|
||||
"### Get published pipeline ID"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.pipeline.core import PublishedPipeline\n",
|
||||
"\n",
|
||||
"# You could retrieve all pipelines that are published, or \n",
|
||||
"# just get the published pipeline object that you have the ID for.\n",
|
||||
"\n",
|
||||
"# Get all published pipeline objects in the workspace\n",
|
||||
"all_pub_pipelines = PublishedPipeline.get_all(ws)\n",
|
||||
"\n",
|
||||
"# We will iterate through the list of published pipelines and \n",
|
||||
"# use the last ID in the list for Schelue operations: \n",
|
||||
"print(\"Published pipelines found in the workspace:\")\n",
|
||||
"for pub_pipeline in all_pub_pipelines:\n",
|
||||
" print(pub_pipeline.id)\n",
|
||||
" pub_pipeline_id = pub_pipeline.id\n",
|
||||
"\n",
|
||||
"print(\"Published pipeline id to be used for Schedule operations: {}\".format(pub_pipeline_id))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Create a schedule for the pipeline"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.pipeline.core.schedule import ScheduleRecurrence, Schedule\n",
|
||||
"\n",
|
||||
"recurrence = ScheduleRecurrence(frequency=\"Day\", interval=2, hours=[22], minutes=[30]) # Runs every other day at 10:30pm\n",
|
||||
"\n",
|
||||
"schedule = Schedule.create(workspace=ws, name=\"My_Schedule\",\n",
|
||||
" pipeline_id=pub_pipeline_id, \n",
|
||||
" experiment_name='Schedule_Run',\n",
|
||||
" recurrence=recurrence,\n",
|
||||
" wait_for_provisioning=True,\n",
|
||||
" description=\"Schedule Run\")\n",
|
||||
"\n",
|
||||
"# You may want to make sure that the schedule is provisioned properly\n",
|
||||
"# before making any further changes to the schedule\n",
|
||||
"\n",
|
||||
"print(\"Created schedule with id: {}\".format(schedule.id))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Note: Set the `wait_for_provisioning` flag to False if you do not want to wait for the call to provision the schedule in the backend."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Get all schedules for a given pipeline\n",
|
||||
"Once you have the published pipeline ID, then you can get all schedules for that pipeline."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"schedules = Schedule.get_all(ws, pipeline_id=pub_pipeline_id)\n",
|
||||
"\n",
|
||||
"# We will iterate through the list of schedules and \n",
|
||||
"# use the last ID in the list for further operations: \n",
|
||||
"print(\"Found these schedules for the pipeline id {}:\".format(pub_pipeline_id))\n",
|
||||
"for schedule in schedules: \n",
|
||||
" print(schedule.id)\n",
|
||||
" schedule_id = schedule.id\n",
|
||||
"\n",
|
||||
"print(\"Schedule id to be used for schedule operations: {}\".format(schedule_id))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Get all schedules in your workspace\n",
|
||||
"You can also iterate through all schedules in your workspace if needed."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Use active_only=False to get all schedules including disabled schedules\n",
|
||||
"schedules = Schedule.get_all(ws, active_only=True) \n",
|
||||
"print(\"Your workspace has the following schedules set up:\")\n",
|
||||
"for schedule in schedules:\n",
|
||||
" print(\"{} (Published pipeline: {}\".format(schedule.id, schedule.pipeline_id))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Get the schedule"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"fetched_schedule = Schedule.get(ws, schedule_id)\n",
|
||||
"print(\"Using schedule with id: {}\".format(fetched_schedule.id))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Disable the schedule"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Set the wait_for_provisioning flag to False if you do not want to wait \n",
|
||||
"# for the call to provision the schedule in the backend.\n",
|
||||
"fetched_schedule.disable(wait_for_provisioning=True)\n",
|
||||
"fetched_schedule = Schedule.get(ws, schedule_id)\n",
|
||||
"print(\"Disabled schedule {}. New status is: {}\".format(fetched_schedule.id, fetched_schedule.status))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Reactivate the schedule"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Set the wait_for_provisioning flag to False if you do not want to wait \n",
|
||||
"# for the call to provision the schedule in the backend.\n",
|
||||
"fetched_schedule.activate(wait_for_provisioning=True)\n",
|
||||
"fetched_schedule = Schedule.get(ws, schedule_id)\n",
|
||||
"print(\"Activated schedule {}. New status is: {}\".format(fetched_schedule.id, fetched_schedule.status))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Change reccurence of the schedule"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Set the wait_for_provisioning flag to False if you do not want to wait \n",
|
||||
"# for the call to provision the schedule in the backend.\n",
|
||||
"recurrence = ScheduleRecurrence(frequency=\"Hour\", interval=2) # Runs every two hours\n",
|
||||
"\n",
|
||||
"fetched_schedule = Schedule.get(ws, schedule_id)\n",
|
||||
"\n",
|
||||
"fetched_schedule.update(name=\"My_Updated_Schedule\", \n",
|
||||
" description=\"Updated_Schedule_Run\", \n",
|
||||
" status='Active', \n",
|
||||
" wait_for_provisioning=True,\n",
|
||||
" recurrence=recurrence)\n",
|
||||
"\n",
|
||||
"fetched_schedule = Schedule.get_schedule(ws, fetched_schedule.id)\n",
|
||||
"\n",
|
||||
"print(\"Updated schedule:\", fetched_schedule.id, \n",
|
||||
" \"\\nNew name:\", fetched_schedule.name,\n",
|
||||
" \"\\nNew frequency:\", fetched_schedule.recurrence.frequency,\n",
|
||||
" \"\\nNew status:\", fetched_schedule.status)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"authors": [
|
||||
{
|
||||
"name": "diray"
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.7"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
@@ -13,7 +13,8 @@
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# AML Pipeline with AdlaStep\n",
|
||||
"This notebook is used to demonstrate the use of AdlaStep in AML Pipeline."
|
||||
"\n",
|
||||
"This notebook is used to demonstrate the use of AdlaStep in AML Pipelines. [AdlaStep](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-steps/azureml.pipeline.steps.adla_step.adlastep?view=azure-ml-py) is used to run U-SQL scripts using Azure Data Lake Analytics service. "
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -30,13 +31,16 @@
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import os\n",
|
||||
"from msrest.exceptions import HttpOperationError\n",
|
||||
"\n",
|
||||
"import azureml.core\n",
|
||||
"from azureml.exceptions import ComputeTargetException\n",
|
||||
"from azureml.core import Workspace, Experiment\n",
|
||||
"from azureml.pipeline.core import Pipeline, PipelineData\n",
|
||||
"from azureml.pipeline.steps import AdlaStep\n",
|
||||
"from azureml.core.compute import ComputeTarget, AdlaCompute\n",
|
||||
"from azureml.core.datastore import Datastore\n",
|
||||
"from azureml.data.data_reference import DataReference\n",
|
||||
"from azureml.pipeline.core import Pipeline, PipelineData\n",
|
||||
"from azureml.pipeline.steps import AdlaStep\n",
|
||||
"\n",
|
||||
"# Check core SDK version number\n",
|
||||
"print(\"SDK version:\", azureml.core.VERSION)"
|
||||
@@ -65,22 +69,57 @@
|
||||
"print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\\n')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Attach ADLA account to workspace\n",
|
||||
"\n",
|
||||
"To submit jobs to Azure Data Lake Analytics service, you must first attach your ADLA account to the workspace. You'll need to provide the account name and resource group of ADLA account to complete this part."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"script_folder = '.'\n",
|
||||
"experiment_name = \"adla_101_experiment\"\n",
|
||||
"ws._initialize_folder(experiment_name=experiment_name, directory=script_folder)"
|
||||
"adla_compute_name = 'testadl' # Name to associate with new compute in workspace\n",
|
||||
"\n",
|
||||
"# ADLA account details needed to attach as compute to workspace\n",
|
||||
"adla_account_name = \"<adla_account_name>\" # Name of the Azure Data Lake Analytics account\n",
|
||||
"adla_resource_group = \"<adla_resource_group>\" # Name of the resource group which contains this account\n",
|
||||
"\n",
|
||||
"try:\n",
|
||||
" # check if already attached\n",
|
||||
" adla_compute = AdlaCompute(ws, adla_compute_name)\n",
|
||||
"except ComputeTargetException:\n",
|
||||
" print('attaching adla compute...')\n",
|
||||
" attach_config = AdlaCompute.attach_configuration(resource_group=adla_resource_group, account_name=adla_account_name)\n",
|
||||
" adla_compute = ComputeTarget.attach(ws, adla_compute_name, attach_config)\n",
|
||||
" adla_compute.wait_for_completion()\n",
|
||||
"\n",
|
||||
"print(\"Using ADLA compute:{}\".format(adla_compute.cluster_resource_id))\n",
|
||||
"print(\"Provisioning state:{}\".format(adla_compute.provisioning_state))\n",
|
||||
"print(\"Provisioning errors:{}\".format(adla_compute.provisioning_errors))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Register Datastore"
|
||||
"## Register Data Lake Storage as Datastore\n",
|
||||
"\n",
|
||||
"To register Data Lake Storage as Datastore in workspace, you'll need account information like account name, resource group and subscription Id. \n",
|
||||
"\n",
|
||||
"> AdlaStep can only work with data stored in the **default** Data Lake Storage of the Data Lake Analytics account provided above. If the data you need to work with is in a non-default storage, you can use a DataTransferStep to copy the data before training. You can find the default storage by opening your Data Lake Analytics account in Azure portal and then navigating to 'Data sources' item under Settings in the left pane.\n",
|
||||
"\n",
|
||||
"### Grant Azure AD application access to Data Lake Storage\n",
|
||||
"\n",
|
||||
"You'll also need to provide an Active Directory application which can access Data Lake Storage. [This document](https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-service-to-service-authenticate-using-active-directory) contains step-by-step instructions on how to create an AAD application and assign to Data Lake Storage. Couple of important notes when assigning permissions to AAD app:\n",
|
||||
"\n",
|
||||
"- Access should be provided at root folder level.\n",
|
||||
"- In 'Assign permissions' pane, select Read, Write, and Execute permissions for 'This folder and all children'. Add as 'An access permission entry and a default permission entry' to make sure application can access any new files created in the future."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -89,15 +128,15 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from msrest.exceptions import HttpOperationError\n",
|
||||
"datastore_name = 'MyAdlsDatastore' # Name to associate with data store in workspace\n",
|
||||
"\n",
|
||||
"datastore_name='MyAdlsDatastore'\n",
|
||||
"subscription_id=os.getenv(\"ADL_SUBSCRIPTION_62\", \"<my-subscription-id>\") # subscription id of ADLS account\n",
|
||||
"resource_group=os.getenv(\"ADL_RESOURCE_GROUP_62\", \"<my-resource-group>\") # resource group of ADLS account\n",
|
||||
"store_name=os.getenv(\"ADL_STORENAME_62\", \"<my-datastore-name>\") # ADLS account name\n",
|
||||
"tenant_id=os.getenv(\"ADL_TENANT_62\", \"<my-tenant-id>\") # tenant id of service principal\n",
|
||||
"client_id=os.getenv(\"ADL_CLIENTID_62\", \"<my-client-id>\") # client id of service principal\n",
|
||||
"client_secret=os.getenv(\"ADL_CLIENT_62_SECRET\", \"<my-client-secret>\") # the secret of service principal\n",
|
||||
"# ADLS storage account details needed to register as a Datastore\n",
|
||||
"subscription_id = os.getenv(\"ADL_SUBSCRIPTION_62\", \"<my-subscription-id>\") # subscription id of ADLS account\n",
|
||||
"resource_group = os.getenv(\"ADL_RESOURCE_GROUP_62\", \"<my-resource-group>\") # resource group of ADLS account\n",
|
||||
"store_name = os.getenv(\"ADL_STORENAME_62\", \"<my-datastore-name>\") # ADLS account name\n",
|
||||
"tenant_id = os.getenv(\"ADL_TENANT_62\", \"<my-tenant-id>\") # tenant id of service principal\n",
|
||||
"client_id = os.getenv(\"ADL_CLIENTID_62\", \"<my-client-id>\") # client id of service principal\n",
|
||||
"client_secret = os.getenv(\"ADL_CLIENT_62_SECRET\", \"<my-client-secret>\") # the secret of service principal\n",
|
||||
"\n",
|
||||
"try:\n",
|
||||
" adls_datastore = Datastore.get(ws, datastore_name)\n",
|
||||
@@ -112,16 +151,16 @@
|
||||
" tenant_id=tenant_id, # tenant id of service principal\n",
|
||||
" client_id=client_id, # client id of service principal\n",
|
||||
" client_secret=client_secret) # the secret of service principal\n",
|
||||
" print(\"registered datastore with name: %s\" % datastore_name)\n"
|
||||
" print(\"registered datastore with name: %s\" % datastore_name)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Create DataReferences and PipelineData\n",
|
||||
"## Setup inputs and outputs\n",
|
||||
"\n",
|
||||
"In the code cell below, replace datastorename with your default datastore name. Copy the file `testdata.txt` (located in the pipeline folder that this notebook is in) to the path on the datastore."
|
||||
"For purpose of this demo, we're going to execute a simple U-SQL script that reads a CSV file and writes portion of content to a new text file. First, let's create our sample input which contains 3 columns: employee Id, name and department Id."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -130,26 +169,51 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"datastorename = \"MyAdlsDatastore\"\n",
|
||||
"# create a folder to store files for our job\n",
|
||||
"sample_folder = \"adla_sample\"\n",
|
||||
"\n",
|
||||
"adls_datastore = Datastore(workspace=ws, name=datastorename)\n",
|
||||
"script_input = DataReference(\n",
|
||||
"if not os.path.isdir(sample_folder):\n",
|
||||
" os.mkdir(sample_folder)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"%%writefile $sample_folder/sample_input.csv\n",
|
||||
"1, Noah, 100\n",
|
||||
"3, Liam, 100\n",
|
||||
"4, Emma, 100\n",
|
||||
"5, Jacob, 100\n",
|
||||
"7, Jennie, 100"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Upload this file to Data Lake Storage at location `adla_sample/sample_input.csv` and create a DataReference to refer to this file."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"sample_input = DataReference(\n",
|
||||
" datastore=adls_datastore,\n",
|
||||
" data_reference_name=\"script_input\",\n",
|
||||
" path_on_datastore=\"testdata/testdata.txt\")\n",
|
||||
"\n",
|
||||
"script_output = PipelineData(\"script_output\", datastore=adls_datastore)\n",
|
||||
"\n",
|
||||
"print(\"Created Pipeline Data\")"
|
||||
" data_reference_name=\"employee_data\",\n",
|
||||
" path_on_datastore=\"adla_sample/sample_input.csv\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Setup Data Lake Account\n",
|
||||
"\n",
|
||||
"ADLA can only use data that is located in the default data store associated with that ADLA account. Through Azure portal, check the name of the default data store corresponding to the ADLA account you are using below. Replace the value associated with `adla_compute_name` in the code cell below accordingly."
|
||||
"Create PipelineData object to store output produced by AdlaStep."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -158,35 +222,23 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"adla_compute_name = 'testadl' # Replace this with your default compute\n",
|
||||
"\n",
|
||||
"from azureml.core.compute import ComputeTarget, AdlaCompute\n",
|
||||
"\n",
|
||||
"def get_or_create_adla_compute(workspace, compute_name):\n",
|
||||
" try:\n",
|
||||
" return AdlaCompute(workspace, compute_name)\n",
|
||||
" except ComputeTargetException as e:\n",
|
||||
" if 'ComputeTargetNotFound' in e.message:\n",
|
||||
" print('adla compute not found, creating...')\n",
|
||||
" provisioning_config = AdlaCompute.provisioning_configuration()\n",
|
||||
" new_adla_compute = ComputeTarget.create(workspace, compute_name, provisioning_config)\n",
|
||||
" new_adla_compute.wait_for_completion()\n",
|
||||
" return new_adla_compute\n",
|
||||
" else:\n",
|
||||
" raise e\n",
|
||||
" \n",
|
||||
"adla_compute = get_or_create_adla_compute(ws, adla_compute_name)\n",
|
||||
"\n",
|
||||
"# CLI:\n",
|
||||
"# Create: az ml computetarget setup adla -n <name>\n",
|
||||
"# BYOC: az ml computetarget attach adla -n <name> -i <resource-id>"
|
||||
"sample_output = PipelineData(\"sample_output\", datastore=adls_datastore)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Once the above code cell completes, run the below to check your ADLA compute status:"
|
||||
"## Write your U-SQL script\n",
|
||||
"\n",
|
||||
"Now let's write a U-Sql script that reads above CSV file and writes the name column to a new file.\n",
|
||||
"\n",
|
||||
"Instead of hard-coding paths in your script, you can use `@@name@@` syntax to refer to inputs, outputs, and parameters.\n",
|
||||
"\n",
|
||||
"- If `name` is the name of an input or output port binding, any occurrences of `@@name@@` in the script are replaced with actual data path of corresponding port binding.\n",
|
||||
"- If `name` matches any key in the `params` dictionary, any occurrences of `@@name@@` will be replaced with corresponding value in the dictionary.\n",
|
||||
"\n",
|
||||
"Note the use of @@ syntax in the below script. Before submitting the job to Data Lake Analytics service, `@@emplyee_data@@` will be replaced with actual path of `sample_input.csv` in Data Lake Storage. Similarly, `@@sample_output@@` will be replaced with a path in Data Lake Storage which will be used to store intermediate output produced by the step."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -195,58 +247,43 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"print(\"ADLA compute state:{}\".format(adla_compute.provisioning_state))\n",
|
||||
"print(\"ADLA compute state:{}\".format(adla_compute.provisioning_errors))\n",
|
||||
"print(\"Using ADLA compute:{}\".format(adla_compute.cluster_resource_id))"
|
||||
"%%writefile $sample_folder/sample_script.usql\n",
|
||||
"\n",
|
||||
"// Read employee information from csv file\n",
|
||||
"@employees = \n",
|
||||
" EXTRACT EmpId int, EmpName string, DeptId int\n",
|
||||
" FROM \"@@employee_data@@\"\n",
|
||||
" USING Extractors.Csv();\n",
|
||||
"\n",
|
||||
"// Export employee names to text file\n",
|
||||
"OUTPUT\n",
|
||||
"(\n",
|
||||
" SELECT EmpName\n",
|
||||
" FROM @employees\n",
|
||||
")\n",
|
||||
"TO \"@@sample_output@@\"\n",
|
||||
"USING Outputters.Text();"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Create an AdlaStep"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"**AdlaStep** is used to run U-SQL script using Azure Data Lake Analytics.\n",
|
||||
"## Create an AdlaStep\n",
|
||||
"\n",
|
||||
"**[AdlaStep](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-steps/azureml.pipeline.steps.adla_step.adlastep?view=azure-ml-py)** is used to run U-SQL script using Azure Data Lake Analytics.\n",
|
||||
"\n",
|
||||
"- **name:** Name of module\n",
|
||||
"- **script_name:** name of U-SQL script\n",
|
||||
"- **script_name:** name of U-SQL script file\n",
|
||||
"- **inputs:** List of input port bindings\n",
|
||||
"- **outputs:** List of output port bindings\n",
|
||||
"- **adla_compute:** the ADLA compute to use for this job\n",
|
||||
"- **compute_target:** the ADLA compute to use for this job\n",
|
||||
"- **params:** Dictionary of name-value pairs to pass to U-SQL job *(optional)*\n",
|
||||
"- **degree_of_parallelism:** the degree of parallelism to use for this job *(optional)*\n",
|
||||
"- **priority:** the priority value to use for the current job *(optional)*\n",
|
||||
"- **runtime_version:** the runtime version of the Data Lake Analytics engine *(optional)*\n",
|
||||
"- **root_folder:** folder that contains the script, assemblies etc. *(optional)*\n",
|
||||
"- **hash_paths:** list of paths to hash to detect a change (script file is always hashed) *(optional)*\n",
|
||||
"\n",
|
||||
"### Remarks\n",
|
||||
"\n",
|
||||
"You can use `@@name@@` syntax in your script to refer to inputs, outputs, and params.\n",
|
||||
"\n",
|
||||
"* if `name` is the name of an input or output port binding, any occurences of `@@name@@` in the script\n",
|
||||
"are replaced with actual data path of corresponding port binding.\n",
|
||||
"* if `name` matches any key in `params` dict, any occurences of `@@name@@` will be replaced with\n",
|
||||
"corresponding value in dict.\n",
|
||||
"\n",
|
||||
"#### Sample script\n",
|
||||
"\n",
|
||||
"```\n",
|
||||
"@resourcereader =\n",
|
||||
" EXTRACT query string\n",
|
||||
" FROM \"@@script_input@@\"\n",
|
||||
" USING Extractors.Csv();\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"OUTPUT @resourcereader\n",
|
||||
"TO \"@@script_output@@\"\n",
|
||||
"USING Outputters.Csv();\n",
|
||||
"```"
|
||||
"- **source_directory:** folder that contains the script, assemblies etc. *(optional)*\n",
|
||||
"- **hash_paths:** list of paths to hash to detect a change (script file is always hashed) *(optional)*"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -256,10 +293,11 @@
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"adla_step = AdlaStep(\n",
|
||||
" name='adla_script_step',\n",
|
||||
" script_name='test_adla_script.usql',\n",
|
||||
" inputs=[script_input],\n",
|
||||
" outputs=[script_output],\n",
|
||||
" name='extract_employee_names',\n",
|
||||
" script_name='sample_script.usql',\n",
|
||||
" source_directory=sample_folder,\n",
|
||||
" inputs=[sample_input],\n",
|
||||
" outputs=[sample_output],\n",
|
||||
" compute_target=adla_compute)"
|
||||
]
|
||||
},
|
||||
@@ -276,13 +314,9 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"pipeline = Pipeline(\n",
|
||||
" description=\"adla_102\",\n",
|
||||
" workspace=ws, \n",
|
||||
" steps=[adla_step],\n",
|
||||
" default_source_directory=script_folder)\n",
|
||||
"pipeline = Pipeline(workspace=ws, steps=[adla_step])\n",
|
||||
"\n",
|
||||
"pipeline_run = Experiment(ws, experiment_name).submit(pipeline)\n",
|
||||
"pipeline_run = Experiment(ws, 'adla_sample').submit(pipeline)\n",
|
||||
"pipeline_run.wait_for_completion()"
|
||||
]
|
||||
},
|
||||
@@ -302,39 +336,6 @@
|
||||
"from azureml.widgets import RunDetails\n",
|
||||
"RunDetails(pipeline_run).show()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Examine the run\n",
|
||||
"You can cycle through the node_run objects and examine job logs, stdout, and stderr of each of the steps."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"step_runs = pipeline_run.get_children()\n",
|
||||
"for step_run in step_runs:\n",
|
||||
" status = step_run.get_status()\n",
|
||||
" print('node', step_run.name, 'status:', status)\n",
|
||||
" if status == \"Failed\":\n",
|
||||
" joblog = step_run.get_job_log()\n",
|
||||
" print('job log:', joblog)\n",
|
||||
" stdout_log = step_run.get_stdout_log()\n",
|
||||
" print('stdout log:', stdout_log)\n",
|
||||
" stderr_log = step_run.get_stderr_log()\n",
|
||||
" print('stderr log:', stderr_log)\n",
|
||||
" with open(\"logs-\" + step_run.name + \".txt\", \"w\") as f:\n",
|
||||
" f.write(joblog)\n",
|
||||
" print(\"Job log written to logs-\"+ step_run.name + \".txt\")\n",
|
||||
" if status == \"Finished\":\n",
|
||||
" stdout_log = step_run.get_stdout_log()\n",
|
||||
" print('stdout log:', stdout_log)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
|
||||
@@ -176,7 +176,7 @@
|
||||
"### Type of Data Access\n",
|
||||
"Databricks allows to interact with Azure Blob and ADLS in two ways.\n",
|
||||
"- **Direct Access**: Databricks allows you to interact with Azure Blob or ADLS URIs directly. The input or output URIs will be mapped to a Databricks widget param in the Databricks notebook.\n",
|
||||
"- **Mouting**: You will be supplied with additional parameters and secrets that will enable you to mount your ADLS or Azure Blob input or output location in your Databricks notebook."
|
||||
"- **Mounting**: You will be supplied with additional parameters and secrets that will enable you to mount your ADLS or Azure Blob input or output location in your Databricks notebook."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -348,6 +348,7 @@
|
||||
"\n",
|
||||
"## Use runconfig to specify library dependencies\n",
|
||||
"You can use a runconfig to specify the library dependencies for your cluster in Databricks. The runconfig will contain a databricks section as follows:\n",
|
||||
"\n",
|
||||
"```yaml\n",
|
||||
"environment:\n",
|
||||
"# Databricks details\n",
|
||||
@@ -365,14 +366,21 @@
|
||||
" repo: ''\n",
|
||||
"# List of RCran libraries\n",
|
||||
" rcranLibraries:\n",
|
||||
" - package: ada\n",
|
||||
" -\n",
|
||||
"# Coordinates.\n",
|
||||
" package: ada\n",
|
||||
"# Repo\n",
|
||||
" repo: http://cran.us.r-project.org\n",
|
||||
"# List of JAR libraries\n",
|
||||
" jarLibraries:\n",
|
||||
" - library: dbfs:/mnt/libraries/library.jar\n",
|
||||
" -\n",
|
||||
"# Coordinates.\n",
|
||||
" library: dbfs:/mnt/libraries/library.jar\n",
|
||||
"# List of Egg libraries\n",
|
||||
" eggLibraries:\n",
|
||||
" - library: dbfs:/mnt/libraries/library.egg\n",
|
||||
" -\n",
|
||||
"# Coordinates.\n",
|
||||
" library: dbfs:/mnt/libraries/library.egg\n",
|
||||
"```\n",
|
||||
"\n",
|
||||
"You can then create a RunConfiguration object using this file and pass it as the runconfig parameter to DatabricksStep.\n",
|
||||
@@ -409,7 +417,7 @@
|
||||
" notebook_params={'myparam': 'testparam'},\n",
|
||||
" run_name='DB_Notebook_demo',\n",
|
||||
" compute_target=databricks_compute,\n",
|
||||
" allow_reuse=False\n",
|
||||
" allow_reuse=True\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
@@ -453,14 +461,16 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### 2. Running a Python script that is already added in DBFS\n",
|
||||
"To run a Python script that is already uploaded to DBFS, follow the instructions below. You will first upload the Python script to DBFS using the [CLI](https://docs.azuredatabricks.net/user-guide/dbfs-databricks-file-system.html).\n",
|
||||
"### 2. Running a Python script from DBFS\n",
|
||||
"This shows how to run a Python script in DBFS. \n",
|
||||
"\n",
|
||||
"The commented out code in the below cell assumes that you have uploaded `train-db-dbfs.py` to the root folder in DBFS. You can upload `train-db-dbfs.py` to the root folder in DBFS using this commandline so you can use `python_script_path = \"dbfs:/train-db-dbfs.py\"`:\n",
|
||||
"To complete this, you will need to first upload the Python script in your local machine to DBFS using the [CLI](https://docs.azuredatabricks.net/user-guide/dbfs-databricks-file-system.html). The CLI command is given below:\n",
|
||||
"\n",
|
||||
"```\n",
|
||||
"dbfs cp ./train-db-dbfs.py dbfs:/train-db-dbfs.py\n",
|
||||
"```"
|
||||
"```\n",
|
||||
"\n",
|
||||
"The code in the below cell assumes that you have completed the previous step of uploading the script `train-db-dbfs.py` to the root folder in DBFS."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -469,7 +479,7 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"python_script_path = \"dbfs:/train-db-dbfs.py\"\n",
|
||||
"python_script_path = os.getenv(\"DATABRICKS_PYTHON_SCRIPT_PATH\", \"<my-databricks-python-script-path>\") # Databricks python script path\n",
|
||||
"\n",
|
||||
"dbPythonInDbfsStep = DatabricksStep(\n",
|
||||
" name=\"DBPythonInDBFS\",\n",
|
||||
@@ -479,7 +489,7 @@
|
||||
" python_script_params={'--input_data'},\n",
|
||||
" run_name='DB_Python_demo',\n",
|
||||
" compute_target=databricks_compute,\n",
|
||||
" allow_reuse=False\n",
|
||||
" allow_reuse=True\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
@@ -548,7 +558,7 @@
|
||||
" source_directory=source_directory,\n",
|
||||
" run_name='DB_Python_Local_demo',\n",
|
||||
" compute_target=databricks_compute,\n",
|
||||
" allow_reuse=False\n",
|
||||
" allow_reuse=True\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
@@ -609,7 +619,7 @@
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"main_jar_class_name = \"com.microsoft.aeva.Main\"\n",
|
||||
"jar_library_dbfs_path = \"dbfs:/train-db-dbfs.jar\"\n",
|
||||
"jar_library_dbfs_path = os.getenv(\"DATABRICKS_JAR_LIB_PATH\", \"<my-databricks-jar-lib-path>\") # Databricks jar library path\n",
|
||||
"\n",
|
||||
"dbJarInDbfsStep = DatabricksStep(\n",
|
||||
" name=\"DBJarInDBFS\",\n",
|
||||
@@ -620,7 +630,7 @@
|
||||
" run_name='DB_JAR_demo',\n",
|
||||
" jar_libraries=[JarLibrary(jar_library_dbfs_path)],\n",
|
||||
" compute_target=databricks_compute,\n",
|
||||
" allow_reuse=False\n",
|
||||
" allow_reuse=True\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
|
||||
@@ -0,0 +1 @@
|
||||
Test1
|
||||
@@ -0,0 +1,106 @@
|
||||
# Copyright (c) Microsoft Corporation. All rights reserved.
|
||||
# Licensed under the MIT License.
|
||||
|
||||
import numpy as np
|
||||
import argparse
|
||||
import os
|
||||
import tensorflow as tf
|
||||
|
||||
from azureml.core import Run
|
||||
from utils import load_data
|
||||
|
||||
print("TensorFlow version:", tf.VERSION)
|
||||
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument('--data-folder', type=str, dest='data_folder', help='data folder mounting point')
|
||||
parser.add_argument('--batch-size', type=int, dest='batch_size', default=50, help='mini batch size for training')
|
||||
parser.add_argument('--first-layer-neurons', type=int, dest='n_hidden_1', default=100,
|
||||
help='# of neurons in the first layer')
|
||||
parser.add_argument('--second-layer-neurons', type=int, dest='n_hidden_2', default=100,
|
||||
help='# of neurons in the second layer')
|
||||
parser.add_argument('--learning-rate', type=float, dest='learning_rate', default=0.01, help='learning rate')
|
||||
args = parser.parse_args()
|
||||
|
||||
data_folder = os.path.join(args.data_folder, 'mnist')
|
||||
|
||||
print('training dataset is stored here:', data_folder)
|
||||
|
||||
X_train = load_data(os.path.join(data_folder, 'train-images.gz'), False) / 255.0
|
||||
X_test = load_data(os.path.join(data_folder, 'test-images.gz'), False) / 255.0
|
||||
|
||||
y_train = load_data(os.path.join(data_folder, 'train-labels.gz'), True).reshape(-1)
|
||||
y_test = load_data(os.path.join(data_folder, 'test-labels.gz'), True).reshape(-1)
|
||||
|
||||
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape, sep='\n')
|
||||
training_set_size = X_train.shape[0]
|
||||
|
||||
n_inputs = 28 * 28
|
||||
n_h1 = args.n_hidden_1
|
||||
n_h2 = args.n_hidden_2
|
||||
n_outputs = 10
|
||||
learning_rate = args.learning_rate
|
||||
n_epochs = 50
|
||||
batch_size = args.batch_size
|
||||
|
||||
with tf.name_scope('network'):
|
||||
# construct the DNN
|
||||
X = tf.placeholder(tf.float32, shape=(None, n_inputs), name='X')
|
||||
y = tf.placeholder(tf.int64, shape=(None), name='y')
|
||||
h1 = tf.layers.dense(X, n_h1, activation=tf.nn.relu, name='h1')
|
||||
h2 = tf.layers.dense(h1, n_h2, activation=tf.nn.relu, name='h2')
|
||||
output = tf.layers.dense(h2, n_outputs, name='output')
|
||||
|
||||
with tf.name_scope('train'):
|
||||
cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=output)
|
||||
loss = tf.reduce_mean(cross_entropy, name='loss')
|
||||
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
|
||||
train_op = optimizer.minimize(loss)
|
||||
|
||||
with tf.name_scope('eval'):
|
||||
correct = tf.nn.in_top_k(output, y, 1)
|
||||
acc_op = tf.reduce_mean(tf.cast(correct, tf.float32))
|
||||
|
||||
init = tf.global_variables_initializer()
|
||||
saver = tf.train.Saver()
|
||||
|
||||
# start an Azure ML run
|
||||
run = Run.get_context()
|
||||
|
||||
with tf.Session() as sess:
|
||||
init.run()
|
||||
for epoch in range(n_epochs):
|
||||
|
||||
# randomly shuffle training set
|
||||
indices = np.random.permutation(training_set_size)
|
||||
X_train = X_train[indices]
|
||||
y_train = y_train[indices]
|
||||
|
||||
# batch index
|
||||
b_start = 0
|
||||
b_end = b_start + batch_size
|
||||
for _ in range(training_set_size // batch_size):
|
||||
# get a batch
|
||||
X_batch, y_batch = X_train[b_start: b_end], y_train[b_start: b_end]
|
||||
|
||||
# update batch index for the next batch
|
||||
b_start = b_start + batch_size
|
||||
b_end = min(b_start + batch_size, training_set_size)
|
||||
|
||||
# train
|
||||
sess.run(train_op, feed_dict={X: X_batch, y: y_batch})
|
||||
# evaluate training set
|
||||
acc_train = acc_op.eval(feed_dict={X: X_batch, y: y_batch})
|
||||
# evaluate validation set
|
||||
acc_val = acc_op.eval(feed_dict={X: X_test, y: y_test})
|
||||
|
||||
# log accuracies
|
||||
run.log('training_acc', np.float(acc_train))
|
||||
run.log('validation_acc', np.float(acc_val))
|
||||
print(epoch, '-- Training accuracy:', acc_train, '\b Validation accuracy:', acc_val)
|
||||
y_hat = np.argmax(output.eval(feed_dict={X: X_test}), axis=1)
|
||||
|
||||
run.log('final_acc', np.float(acc_val))
|
||||
|
||||
os.makedirs('./outputs/model', exist_ok=True)
|
||||
# files saved in the "./outputs" folder are automatically uploaded into run history
|
||||
saver.save(sess, './outputs/model/mnist-tf.model')
|
||||
@@ -0,0 +1,27 @@
|
||||
# Copyright (c) Microsoft Corporation. All rights reserved.
|
||||
# Licensed under the MIT License.
|
||||
|
||||
import gzip
|
||||
import numpy as np
|
||||
import struct
|
||||
|
||||
|
||||
# load compressed MNIST gz files and return numpy arrays
|
||||
def load_data(filename, label=False):
|
||||
with gzip.open(filename) as gz:
|
||||
struct.unpack('I', gz.read(4))
|
||||
n_items = struct.unpack('>I', gz.read(4))
|
||||
if not label:
|
||||
n_rows = struct.unpack('>I', gz.read(4))[0]
|
||||
n_cols = struct.unpack('>I', gz.read(4))[0]
|
||||
res = np.frombuffer(gz.read(n_items[0] * n_rows * n_cols), dtype=np.uint8)
|
||||
res = res.reshape(n_items[0], n_rows * n_cols)
|
||||
else:
|
||||
res = np.frombuffer(gz.read(n_items[0]), dtype=np.uint8)
|
||||
res = res.reshape(n_items[0], 1)
|
||||
return res
|
||||
|
||||
|
||||
# one-hot encode a 1-D array
|
||||
def one_hot_encode(array, num_of_classes):
|
||||
return np.eye(num_of_classes)[array.reshape(-1)]
|
||||
@@ -489,11 +489,11 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.core.authentication import AzureCliAuthentication\n",
|
||||
"from azureml.core.authentication import InteractiveLoginAuthentication\n",
|
||||
"import requests\n",
|
||||
"\n",
|
||||
"cli_auth = AzureCliAuthentication()\n",
|
||||
"aad_token = cli_auth.get_authentication_header()"
|
||||
"auth = InteractiveLoginAuthentication()\n",
|
||||
"aad_token = auth.get_authentication_header()\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
||||
@@ -465,11 +465,11 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.core.authentication import AzureCliAuthentication\n",
|
||||
"from azureml.core.authentication import InteractiveLoginAuthentication\n",
|
||||
"import requests\n",
|
||||
"\n",
|
||||
"cli_auth = AzureCliAuthentication()\n",
|
||||
"aad_token = cli_auth.get_authentication_header()"
|
||||
"auth = InteractiveLoginAuthentication()\n",
|
||||
"aad_token = auth.get_authentication_header()\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
||||
@@ -79,6 +79,31 @@
|
||||
"check that the you used correct login and entered the correct subscription ID."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In some cases, you may see a version of the error message containing text: ```All the subscriptions that you have access to = []```\n",
|
||||
"\n",
|
||||
"In such a case, you may have to specify the tenant ID of the Azure Active Directory you're using. An example would be accessing a subscription as a guest to a tenant that is not your default. You specify the tenant by explicitly instantiating _InteractiveLoginAuthentication_ with tenant ID as argument ([see instructions how to obtain tenant Id](#get-tenant-id))."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.core.authentication import InteractiveLoginAuthentication\n",
|
||||
"\n",
|
||||
"interactive_auth = InteractiveLoginAuthentication(tenant_id=\"my-tenant-id\")\n",
|
||||
"\n",
|
||||
"ws = Workspace(subscription_id=\"my-subscription-id\",\n",
|
||||
" resource_group=\"my-ml-rg\",\n",
|
||||
" workspace_name=\"my-ml-workspace\",\n",
|
||||
" auth=interactive_auth)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
@@ -114,9 +139,7 @@
|
||||
"source": [
|
||||
"### Service Principal Authentication\n",
|
||||
"\n",
|
||||
"When setting up a machine learning workflow as an automated process, we recommed using Service Principal Authentication.\n",
|
||||
"\n",
|
||||
"This approach decouples the authentication from any specific user login, and allows a managed access control.\n",
|
||||
"When setting up a machine learning workflow as an automated process, we recommend using Service Principal Authentication. This approach decouples the authentication from any specific user login, and allows managed access control.\n",
|
||||
"\n",
|
||||
"Note that you must have administrator privileges over the Azure subscription to complete these steps.\n",
|
||||
"\n",
|
||||
@@ -142,6 +165,8 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<a id =\"get-tenant-id\"></a>\n",
|
||||
"\n",
|
||||
"Also, you need to obtain the tenant ID of your Azure subscription. Go back to **Azure Active Directory**, select **Properties** and copy _Directory ID_.\n",
|
||||
"\n",
|
||||
""
|
||||
|
||||
@@ -298,7 +298,7 @@
|
||||
" process_count_per_node=1,\n",
|
||||
" distributed_backend='mpi',\n",
|
||||
" pip_packages=['cntk-gpu==2.6'],\n",
|
||||
" custom_docker_base_image='microsoft/mmlspark:gpu-0.12',\n",
|
||||
" custom_docker_image='microsoft/mmlspark:gpu-0.12',\n",
|
||||
" use_gpu=True)"
|
||||
]
|
||||
},
|
||||
@@ -306,7 +306,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We would like to train our model using a [pre-built Docker container](https://hub.docker.com/r/microsoft/mmlspark/). To do so, specify the name of the docker image to the argument `custom_docker_base_image`. You can only provide images available in public docker repositories such as Docker Hub using this argument. To use an image from a private docker repository, use the constructor's `environment_definition` parameter instead. Finally, we provide the `cntk` package to `pip_packages` to install CNTK 2.6 on our custom image.\n",
|
||||
"We would like to train our model using a [pre-built Docker container](https://hub.docker.com/r/microsoft/mmlspark/). To do so, specify the name of the docker image to the argument `custom_docker_image`. Finally, we provide the `cntk` package to `pip_packages` to install CNTK 2.6 on our custom image.\n",
|
||||
"\n",
|
||||
"The above code specifies that we will run our training script on `2` nodes, with one worker per node. In order to run distributed CNTK, which uses MPI, you must provide the argument `distributed_backend='mpi'`."
|
||||
]
|
||||
|
||||
@@ -291,29 +291,27 @@
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# use a custom Docker image\n",
|
||||
"from azureml.core import RunConfiguration\n",
|
||||
"\n",
|
||||
"rc = RunConfiguration()\n",
|
||||
"rc.environment.docker.enabled = True\n",
|
||||
"from azureml.core.runconfig import ContainerRegistry\n",
|
||||
"\n",
|
||||
"# this is an image available in Docker Hub\n",
|
||||
"rc.environment.docker.base_image = 'continuumio/miniconda3'\n",
|
||||
"image_name = 'continuumio/miniconda3'\n",
|
||||
"\n",
|
||||
"# you can also point to an image in a private ACR\n",
|
||||
"#rc.environment.docker.base_image = \"mycustomimage:1.0\"\n",
|
||||
"#rc.environment.docker.base_image_registry.address = \"myregistry.azurecr.io\"\n",
|
||||
"#rc.environment.docker.base_image_registry.username = \"username\"\n",
|
||||
"#rc.environment.docker.base_image_registry.password = \"password\"\n",
|
||||
"image_registry_details = ContainerRegistry()\n",
|
||||
"image_registry_details.address = \"myregistry.azurecr.io\"\n",
|
||||
"image_registry_details.username = \"username\"\n",
|
||||
"image_registry_details.password = \"password\"\n",
|
||||
"\n",
|
||||
"# don't let the system build a new conda environment\n",
|
||||
"rc.environment.python.user_managed_dependencies = True\n",
|
||||
"# point to an existing python environment instead\n",
|
||||
"rc.environment.python.interpreter_path = '/opt/conda/bin/python'\n",
|
||||
"user_managed_dependencies = True\n",
|
||||
"\n",
|
||||
"# submit to a local Docker container. if you don't have Docker engine running locally, you can set compute_target to cpu_cluster.\n",
|
||||
"est = Estimator(source_directory='.', compute_target='local', \n",
|
||||
" entry_script='dummy_train.py',\n",
|
||||
" environment_definition=rc.environment)\n",
|
||||
" custom_docker_image=image_name,\n",
|
||||
" image_registry_details=image_registry_details,\n",
|
||||
" user_managed=user_managed_dependencies\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
"run = exp.submit(est)\n",
|
||||
"RunDetails(run).show()"
|
||||
|
||||
150
how-to-use-azureml/training/train-in-spark/iris.csv
Normal file
150
how-to-use-azureml/training/train-in-spark/iris.csv
Normal file
@@ -0,0 +1,150 @@
|
||||
5.1,3.5,1.4,0.2,Iris-setosa
|
||||
4.9,3.0,1.4,0.2,Iris-setosa
|
||||
4.7,3.2,1.3,0.2,Iris-setosa
|
||||
4.6,3.1,1.5,0.2,Iris-setosa
|
||||
5.0,3.6,1.4,0.2,Iris-setosa
|
||||
5.4,3.9,1.7,0.4,Iris-setosa
|
||||
4.6,3.4,1.4,0.3,Iris-setosa
|
||||
5.0,3.4,1.5,0.2,Iris-setosa
|
||||
4.4,2.9,1.4,0.2,Iris-setosa
|
||||
4.9,3.1,1.5,0.1,Iris-setosa
|
||||
5.4,3.7,1.5,0.2,Iris-setosa
|
||||
4.8,3.4,1.6,0.2,Iris-setosa
|
||||
4.8,3.0,1.4,0.1,Iris-setosa
|
||||
4.3,3.0,1.1,0.1,Iris-setosa
|
||||
5.8,4.0,1.2,0.2,Iris-setosa
|
||||
5.7,4.4,1.5,0.4,Iris-setosa
|
||||
5.4,3.9,1.3,0.4,Iris-setosa
|
||||
5.1,3.5,1.4,0.3,Iris-setosa
|
||||
5.7,3.8,1.7,0.3,Iris-setosa
|
||||
5.1,3.8,1.5,0.3,Iris-setosa
|
||||
5.4,3.4,1.7,0.2,Iris-setosa
|
||||
5.1,3.7,1.5,0.4,Iris-setosa
|
||||
4.6,3.6,1.0,0.2,Iris-setosa
|
||||
5.1,3.3,1.7,0.5,Iris-setosa
|
||||
4.8,3.4,1.9,0.2,Iris-setosa
|
||||
5.0,3.0,1.6,0.2,Iris-setosa
|
||||
5.0,3.4,1.6,0.4,Iris-setosa
|
||||
5.2,3.5,1.5,0.2,Iris-setosa
|
||||
5.2,3.4,1.4,0.2,Iris-setosa
|
||||
4.7,3.2,1.6,0.2,Iris-setosa
|
||||
4.8,3.1,1.6,0.2,Iris-setosa
|
||||
5.4,3.4,1.5,0.4,Iris-setosa
|
||||
5.2,4.1,1.5,0.1,Iris-setosa
|
||||
5.5,4.2,1.4,0.2,Iris-setosa
|
||||
4.9,3.1,1.5,0.1,Iris-setosa
|
||||
5.0,3.2,1.2,0.2,Iris-setosa
|
||||
5.5,3.5,1.3,0.2,Iris-setosa
|
||||
4.9,3.1,1.5,0.1,Iris-setosa
|
||||
4.4,3.0,1.3,0.2,Iris-setosa
|
||||
5.1,3.4,1.5,0.2,Iris-setosa
|
||||
5.0,3.5,1.3,0.3,Iris-setosa
|
||||
4.5,2.3,1.3,0.3,Iris-setosa
|
||||
4.4,3.2,1.3,0.2,Iris-setosa
|
||||
5.0,3.5,1.6,0.6,Iris-setosa
|
||||
5.1,3.8,1.9,0.4,Iris-setosa
|
||||
4.8,3.0,1.4,0.3,Iris-setosa
|
||||
5.1,3.8,1.6,0.2,Iris-setosa
|
||||
4.6,3.2,1.4,0.2,Iris-setosa
|
||||
5.3,3.7,1.5,0.2,Iris-setosa
|
||||
5.0,3.3,1.4,0.2,Iris-setosa
|
||||
7.0,3.2,4.7,1.4,Iris-versicolor
|
||||
6.4,3.2,4.5,1.5,Iris-versicolor
|
||||
6.9,3.1,4.9,1.5,Iris-versicolor
|
||||
5.5,2.3,4.0,1.3,Iris-versicolor
|
||||
6.5,2.8,4.6,1.5,Iris-versicolor
|
||||
5.7,2.8,4.5,1.3,Iris-versicolor
|
||||
6.3,3.3,4.7,1.6,Iris-versicolor
|
||||
4.9,2.4,3.3,1.0,Iris-versicolor
|
||||
6.6,2.9,4.6,1.3,Iris-versicolor
|
||||
5.2,2.7,3.9,1.4,Iris-versicolor
|
||||
5.0,2.0,3.5,1.0,Iris-versicolor
|
||||
5.9,3.0,4.2,1.5,Iris-versicolor
|
||||
6.0,2.2,4.0,1.0,Iris-versicolor
|
||||
6.1,2.9,4.7,1.4,Iris-versicolor
|
||||
5.6,2.9,3.6,1.3,Iris-versicolor
|
||||
6.7,3.1,4.4,1.4,Iris-versicolor
|
||||
5.6,3.0,4.5,1.5,Iris-versicolor
|
||||
5.8,2.7,4.1,1.0,Iris-versicolor
|
||||
6.2,2.2,4.5,1.5,Iris-versicolor
|
||||
5.6,2.5,3.9,1.1,Iris-versicolor
|
||||
5.9,3.2,4.8,1.8,Iris-versicolor
|
||||
6.1,2.8,4.0,1.3,Iris-versicolor
|
||||
6.3,2.5,4.9,1.5,Iris-versicolor
|
||||
6.1,2.8,4.7,1.2,Iris-versicolor
|
||||
6.4,2.9,4.3,1.3,Iris-versicolor
|
||||
6.6,3.0,4.4,1.4,Iris-versicolor
|
||||
6.8,2.8,4.8,1.4,Iris-versicolor
|
||||
6.7,3.0,5.0,1.7,Iris-versicolor
|
||||
6.0,2.9,4.5,1.5,Iris-versicolor
|
||||
5.7,2.6,3.5,1.0,Iris-versicolor
|
||||
5.5,2.4,3.8,1.1,Iris-versicolor
|
||||
5.5,2.4,3.7,1.0,Iris-versicolor
|
||||
5.8,2.7,3.9,1.2,Iris-versicolor
|
||||
6.0,2.7,5.1,1.6,Iris-versicolor
|
||||
5.4,3.0,4.5,1.5,Iris-versicolor
|
||||
6.0,3.4,4.5,1.6,Iris-versicolor
|
||||
6.7,3.1,4.7,1.5,Iris-versicolor
|
||||
6.3,2.3,4.4,1.3,Iris-versicolor
|
||||
5.6,3.0,4.1,1.3,Iris-versicolor
|
||||
5.5,2.5,4.0,1.3,Iris-versicolor
|
||||
5.5,2.6,4.4,1.2,Iris-versicolor
|
||||
6.1,3.0,4.6,1.4,Iris-versicolor
|
||||
5.8,2.6,4.0,1.2,Iris-versicolor
|
||||
5.0,2.3,3.3,1.0,Iris-versicolor
|
||||
5.6,2.7,4.2,1.3,Iris-versicolor
|
||||
5.7,3.0,4.2,1.2,Iris-versicolor
|
||||
5.7,2.9,4.2,1.3,Iris-versicolor
|
||||
6.2,2.9,4.3,1.3,Iris-versicolor
|
||||
5.1,2.5,3.0,1.1,Iris-versicolor
|
||||
5.7,2.8,4.1,1.3,Iris-versicolor
|
||||
6.3,3.3,6.0,2.5,Iris-virginica
|
||||
5.8,2.7,5.1,1.9,Iris-virginica
|
||||
7.1,3.0,5.9,2.1,Iris-virginica
|
||||
6.3,2.9,5.6,1.8,Iris-virginica
|
||||
6.5,3.0,5.8,2.2,Iris-virginica
|
||||
7.6,3.0,6.6,2.1,Iris-virginica
|
||||
4.9,2.5,4.5,1.7,Iris-virginica
|
||||
7.3,2.9,6.3,1.8,Iris-virginica
|
||||
6.7,2.5,5.8,1.8,Iris-virginica
|
||||
7.2,3.6,6.1,2.5,Iris-virginica
|
||||
6.5,3.2,5.1,2.0,Iris-virginica
|
||||
6.4,2.7,5.3,1.9,Iris-virginica
|
||||
6.8,3.0,5.5,2.1,Iris-virginica
|
||||
5.7,2.5,5.0,2.0,Iris-virginica
|
||||
5.8,2.8,5.1,2.4,Iris-virginica
|
||||
6.4,3.2,5.3,2.3,Iris-virginica
|
||||
6.5,3.0,5.5,1.8,Iris-virginica
|
||||
7.7,3.8,6.7,2.2,Iris-virginica
|
||||
7.7,2.6,6.9,2.3,Iris-virginica
|
||||
6.0,2.2,5.0,1.5,Iris-virginica
|
||||
6.9,3.2,5.7,2.3,Iris-virginica
|
||||
5.6,2.8,4.9,2.0,Iris-virginica
|
||||
7.7,2.8,6.7,2.0,Iris-virginica
|
||||
6.3,2.7,4.9,1.8,Iris-virginica
|
||||
6.7,3.3,5.7,2.1,Iris-virginica
|
||||
7.2,3.2,6.0,1.8,Iris-virginica
|
||||
6.2,2.8,4.8,1.8,Iris-virginica
|
||||
6.1,3.0,4.9,1.8,Iris-virginica
|
||||
6.4,2.8,5.6,2.1,Iris-virginica
|
||||
7.2,3.0,5.8,1.6,Iris-virginica
|
||||
7.4,2.8,6.1,1.9,Iris-virginica
|
||||
7.9,3.8,6.4,2.0,Iris-virginica
|
||||
6.4,2.8,5.6,2.2,Iris-virginica
|
||||
6.3,2.8,5.1,1.5,Iris-virginica
|
||||
6.1,2.6,5.6,1.4,Iris-virginica
|
||||
7.7,3.0,6.1,2.3,Iris-virginica
|
||||
6.3,3.4,5.6,2.4,Iris-virginica
|
||||
6.4,3.1,5.5,1.8,Iris-virginica
|
||||
6.0,3.0,4.8,1.8,Iris-virginica
|
||||
6.9,3.1,5.4,2.1,Iris-virginica
|
||||
6.7,3.1,5.6,2.4,Iris-virginica
|
||||
6.9,3.1,5.1,2.3,Iris-virginica
|
||||
5.8,2.7,5.1,1.9,Iris-virginica
|
||||
6.8,3.2,5.9,2.3,Iris-virginica
|
||||
6.7,3.3,5.7,2.5,Iris-virginica
|
||||
6.7,3.0,5.2,2.3,Iris-virginica
|
||||
6.3,2.5,5.0,1.9,Iris-virginica
|
||||
6.5,3.0,5.2,2.0,Iris-virginica
|
||||
6.2,3.4,5.4,2.3,Iris-virginica
|
||||
5.9,3.0,5.1,1.8,Iris-virginica
|
||||
|
278
how-to-use-azureml/training/train-in-spark/train-in-spark.ipynb
Normal file
278
how-to-use-azureml/training/train-in-spark/train-in-spark.ipynb
Normal file
@@ -0,0 +1,278 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Copyright (c) Microsoft Corporation. All rights reserved.\n",
|
||||
"\n",
|
||||
"Licensed under the MIT License."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# 05. Train in Spark\n",
|
||||
"* Create Workspace\n",
|
||||
"* Create Experiment\n",
|
||||
"* Copy relevant files to the script folder\n",
|
||||
"* Configure and Run"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Prerequisites\n",
|
||||
"Make sure you go through the [configuration notebook](../../../configuration.ipynb) first if you haven't."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Check core SDK version number\n",
|
||||
"import azureml.core\n",
|
||||
"\n",
|
||||
"print(\"SDK version:\", azureml.core.VERSION)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Initialize Workspace\n",
|
||||
"\n",
|
||||
"Initialize a workspace object from persisted configuration."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.core import Workspace\n",
|
||||
"\n",
|
||||
"ws = Workspace.from_config()\n",
|
||||
"print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep='\\n')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Create Experiment\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"experiment_name = 'train-on-spark'\n",
|
||||
"\n",
|
||||
"from azureml.core import Experiment\n",
|
||||
"exp = Experiment(workspace=ws, name=experiment_name)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## View `train-spark.py`\n",
|
||||
"\n",
|
||||
"For convenience, we created a training script for you. It is printed below as a text, but you can also run `%pfile ./train-spark.py` in a cell to show the file."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"with open('train-spark.py', 'r') as training_script:\n",
|
||||
" print(training_script.read())"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Configure & Run"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"**Note** You can use Docker-based execution to run the Spark job in local computer or a remote VM. Please see the `train-in-remote-vm` notebook for example on how to configure and run in Docker mode in a VM. Make sure you choose a Docker image that has Spark installed, such as `azureml.core.runconfig.DEFAULT_MMLSPARK_CPU_IMAGE`."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Attach an HDI cluster\n",
|
||||
"Here we will use a actual Spark cluster, HDInsight for Spark, to run this job. To use HDI commpute target:\n",
|
||||
" 1. Create a Spark for HDI cluster in Azure. Here are some [quick instructions](https://docs.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-jupyter-spark-sql). Make sure you use the Ubuntu flavor, NOT CentOS.\n",
|
||||
" 2. Enter the IP address, username and password below"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.core.compute import ComputeTarget, HDInsightCompute\n",
|
||||
"from azureml.exceptions import ComputeTargetException\n",
|
||||
"import os\n",
|
||||
"\n",
|
||||
"try:\n",
|
||||
" # if you want to connect using SSH key instead of username/password you can provide parameters private_key_file and private_key_passphrase\n",
|
||||
" attach_config = HDInsightCompute.attach_configuration(address=os.environ.get('hdiservername', '<my_hdi_cluster_name>-ssh.azurehdinsight.net'), \n",
|
||||
" ssh_port=22, \n",
|
||||
" username=os.environ.get('hdiusername', '<ssh_username>'), \n",
|
||||
" password=os.environ.get('hdipassword', '<my_password>'))\n",
|
||||
" hdi_compute = ComputeTarget.attach(workspace=ws, \n",
|
||||
" name='myhdi', \n",
|
||||
" attach_configuration=attach_config)\n",
|
||||
"\n",
|
||||
"except ComputeTargetException as e:\n",
|
||||
" print(\"Caught = {}\".format(e.message))\n",
|
||||
" \n",
|
||||
" \n",
|
||||
"hdi_compute.wait_for_completion(show_output=True)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Configure HDI run"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Configure an execution using the HDInsight cluster with a conda environment that has `numpy`."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.core.runconfig import RunConfiguration\n",
|
||||
"from azureml.core.conda_dependencies import CondaDependencies\n",
|
||||
"\n",
|
||||
"# use pyspark framework\n",
|
||||
"hdi_run_config = RunConfiguration(framework=\"pyspark\")\n",
|
||||
"\n",
|
||||
"# Set compute target to the HDI cluster\n",
|
||||
"hdi_run_config.target = hdi_compute.name\n",
|
||||
"\n",
|
||||
"# specify CondaDependencies object to ask system installing numpy\n",
|
||||
"cd = CondaDependencies()\n",
|
||||
"cd.add_conda_package('numpy')\n",
|
||||
"hdi_run_config.environment.python.conda_dependencies = cd"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Submit the script to HDI"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.core import ScriptRunConfig\n",
|
||||
"\n",
|
||||
"script_run_config = ScriptRunConfig(source_directory = '.',\n",
|
||||
" script= 'train-spark.py',\n",
|
||||
" run_config = hdi_run_config)\n",
|
||||
"run = exp.submit(config=script_run_config)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Monitor the run using a Juypter widget"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.widgets import RunDetails\n",
|
||||
"RunDetails(run).show()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Note: if you need to cancel a run, you can follow [these instructions](https://aka.ms/aml-docs-cancel-run)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"After the run is succesfully finished, you can check the metrics logged."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# get all metris logged in the run\n",
|
||||
"metrics = run.get_metrics()\n",
|
||||
"print(metrics)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"authors": [
|
||||
{
|
||||
"name": "aashishb"
|
||||
}
|
||||
],
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6",
|
||||
"language": "python",
|
||||
"name": "python36"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.7"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
97
how-to-use-azureml/training/train-in-spark/train-spark.py
Normal file
97
how-to-use-azureml/training/train-in-spark/train-spark.py
Normal file
@@ -0,0 +1,97 @@
|
||||
# Copyright (c) Microsoft. All rights reserved.
|
||||
# Licensed under the MIT license.
|
||||
|
||||
import numpy as np
|
||||
import pyspark
|
||||
import os
|
||||
import urllib
|
||||
import sys
|
||||
|
||||
from pyspark.sql.functions import *
|
||||
from pyspark.ml.classification import *
|
||||
from pyspark.ml.evaluation import *
|
||||
from pyspark.ml.feature import *
|
||||
from pyspark.sql.types import StructType, StructField
|
||||
from pyspark.sql.types import DoubleType, IntegerType, StringType
|
||||
|
||||
from azureml.core.run import Run
|
||||
|
||||
# initialize logger
|
||||
run = Run.get_context()
|
||||
|
||||
# start Spark session
|
||||
spark = pyspark.sql.SparkSession.builder.appName('Iris').getOrCreate()
|
||||
|
||||
# print runtime versions
|
||||
print('****************')
|
||||
print('Python version: {}'.format(sys.version))
|
||||
print('Spark version: {}'.format(spark.version))
|
||||
print('****************')
|
||||
|
||||
# load iris.csv into Spark dataframe
|
||||
schema = StructType([
|
||||
StructField("sepal-length", DoubleType()),
|
||||
StructField("sepal-width", DoubleType()),
|
||||
StructField("petal-length", DoubleType()),
|
||||
StructField("petal-width", DoubleType()),
|
||||
StructField("class", StringType())
|
||||
])
|
||||
|
||||
data = spark.read.format("com.databricks.spark.csv") \
|
||||
.option("header", "true") \
|
||||
.schema(schema) \
|
||||
.load("iris.csv")
|
||||
|
||||
print("First 10 rows of Iris dataset:")
|
||||
data.show(10)
|
||||
|
||||
# vectorize all numerical columns into a single feature column
|
||||
feature_cols = data.columns[:-1]
|
||||
assembler = pyspark.ml.feature.VectorAssembler(
|
||||
inputCols=feature_cols, outputCol='features')
|
||||
data = assembler.transform(data)
|
||||
|
||||
# convert text labels into indices
|
||||
data = data.select(['features', 'class'])
|
||||
label_indexer = pyspark.ml.feature.StringIndexer(
|
||||
inputCol='class', outputCol='label').fit(data)
|
||||
data = label_indexer.transform(data)
|
||||
|
||||
# only select the features and label column
|
||||
data = data.select(['features', 'label'])
|
||||
print("Reading for machine learning")
|
||||
data.show(10)
|
||||
|
||||
# change regularization rate and you will likely get a different accuracy.
|
||||
reg = 0.01
|
||||
# load regularization rate from argument if present
|
||||
if len(sys.argv) > 1:
|
||||
reg = float(sys.argv[1])
|
||||
|
||||
# log regularization rate
|
||||
run.log("Regularization Rate", reg)
|
||||
|
||||
# use Logistic Regression to train on the training set
|
||||
train, test = data.randomSplit([0.70, 0.30])
|
||||
lr = pyspark.ml.classification.LogisticRegression(regParam=reg)
|
||||
model = lr.fit(train)
|
||||
|
||||
# predict on the test set
|
||||
prediction = model.transform(test)
|
||||
print("Prediction")
|
||||
prediction.show(10)
|
||||
|
||||
# evaluate the accuracy of the model using the test set
|
||||
evaluator = pyspark.ml.evaluation.MulticlassClassificationEvaluator(
|
||||
metricName='accuracy')
|
||||
accuracy = evaluator.evaluate(prediction)
|
||||
|
||||
print()
|
||||
print('#####################################')
|
||||
print('Regularization rate is {}'.format(reg))
|
||||
print("Accuracy is {}".format(accuracy))
|
||||
print('#####################################')
|
||||
print()
|
||||
|
||||
# log accuracy
|
||||
run.log('Accuracy', accuracy)
|
||||
@@ -81,7 +81,7 @@
|
||||
"from azureml.core import Experiment, Workspace\n",
|
||||
"\n",
|
||||
"# Check core SDK version number\n",
|
||||
"print(\"This notebook was created using version 1.0.10 of the Azure ML SDK\")\n",
|
||||
"print(\"This notebook was created using version 1.0.2 of the Azure ML SDK\")\n",
|
||||
"print(\"You are currently using version\", azureml.core.VERSION, \"of the Azure ML SDK\")\n",
|
||||
"print(\"\")\n",
|
||||
"\n",
|
||||
|
||||
@@ -15,6 +15,6 @@ As a pre-requisite, run the [configuration Notebook](../configuration.ipynb) not
|
||||
|
||||
### Regression
|
||||
* [Part 1](regression-part1-data-prep.ipynb): Prepare the data using Azure Machine Learning Data Prep SDK.
|
||||
* [Part 2](regression-part1-automated-ml.ipynb): Train a model using Automated Machine Learning.
|
||||
* [Part 2](regression-part2-automated-ml.ipynb): Train a model using Automated Machine Learning.
|
||||
|
||||
Also find quickstarts and how-tos on the [official documentation site for Azure Machine Learning service](https://docs.microsoft.com/en-us/azure/machine-learning/service/).
|
||||
|
||||
Reference in New Issue
Block a user