Compare commits

...

30 Commits

Author SHA1 Message Date
amlrelsa-ms
883e4a4c59 update samples from Release-92 as a part of SDK release 2021-03-10 01:48:54 +00:00
Harneet Virk
e90826b331 Merge pull request #1384 from yunjie-hub/master
Add synapse sample notebooks
2021-03-09 12:40:33 -08:00
yunjie-hub
ac04172f6d Add files via upload 2021-03-09 12:38:23 -08:00
Harneet Virk
8c0000beb4 Merge pull request #1382 from Azure/release_update/Release-91
update samples from Release-91 as a part of  SDK release
2021-03-08 21:43:10 -08:00
amlrelsa-ms
35287ab0d8 update samples from Release-91 as a part of SDK release 2021-03-09 05:36:08 +00:00
Harneet Virk
3fe4f8b038 Merge pull request #1375 from Azure/release_update/Release-90
update samples from Release-90 as a part of  SDK release
2021-03-01 09:15:14 -08:00
amlrelsa-ms
1722678469 update samples from Release-90 as a part of SDK release 2021-03-01 17:13:25 +00:00
Harneet Virk
17da7e8706 Merge pull request #1364 from Azure/release_update/Release-89
update samples from Release-89 as a part of  SDK release
2021-02-23 17:27:27 -08:00
amlrelsa-ms
d2e7213ff3 update samples from Release-89 as a part of SDK release 2021-02-24 01:26:17 +00:00
mx-iao
882cb76e8a Merge pull request #1361 from Azure/minxia/distr-pytorch
Update distributed pytorch example
2021-02-23 12:07:20 -08:00
mx-iao
37f37a46c1 Delete pytorch_mnist.py 2021-02-23 11:19:39 -08:00
mx-iao
0cd1412421 Delete distributed-pytorch-with-nccl-gloo.ipynb 2021-02-23 11:19:33 -08:00
mx-iao
c3ae9f00f6 Add files via upload 2021-02-23 11:19:02 -08:00
mx-iao
11b02c650c Rename how-to-use-azureml/ml-frameworks/pytorch/distributed-pytorch-with-distributeddataparallel.ipynb to how-to-use-azureml/ml-frameworks/pytorch/distributed-pytorch-with-distributeddataparallel/distributed-pytorch-with-distributeddataparallel.ipynb 2021-02-23 11:18:43 -08:00
mx-iao
606048c71f Add files via upload 2021-02-23 11:18:10 -08:00
Harneet Virk
cb1c354d44 Merge pull request #1353 from Azure/release_update/Release-88
update samples from Release-88 as a part of  SDK release 1.23.0
2021-02-22 11:49:02 -08:00
amlrelsa-ms
c868fff5a2 update samples from Release-88 as a part of SDK release 2021-02-22 19:23:04 +00:00
Harneet Virk
bc4e6611c4 Merge pull request #1342 from Azure/release_update/Release-87
update samples from Release-87 as a part of  SDK release
2021-02-16 18:43:49 -08:00
amlrelsa-ms
0a58881b70 update samples from Release-87 as a part of SDK release 2021-02-17 02:13:51 +00:00
Harneet Virk
2544e85c5f Merge pull request #1333 from Azure/release_update/Release-85
SDK release 1.22.0
2021-02-10 07:59:22 -08:00
amlrelsa-ms
7fe27501d1 update samples from Release-85 as a part of SDK release 2021-02-10 15:27:28 +00:00
Harneet Virk
624c46e7f9 Merge pull request #1321 from Azure/release_update/Release-84
update samples from Release-84 as a part of  SDK release
2021-02-05 19:10:29 -08:00
amlrelsa-ms
40fbadd85c update samples from Release-84 as a part of SDK release 2021-02-06 03:09:22 +00:00
Harneet Virk
0c1fc25542 Merge pull request #1317 from Azure/release_update/Release-83
update samples from Release-83 as a part of  SDK release
2021-02-03 14:31:31 -08:00
amlrelsa-ms
e8e1357229 update samples from Release-83 as a part of SDK release 2021-02-03 05:22:32 +00:00
Harneet Virk
ad44f8fa2b Merge pull request #1313 from zronaghi/contrib-rapids
Update RAPIDS README
2021-01-29 10:33:47 -08:00
Zahra Ronaghi
ee63e759f0 Update RAPIDS README 2021-01-28 22:19:27 -06:00
Harneet Virk
b81d97ebbf Merge pull request #1303 from Azure/release_update/Release-82
update samples from Release-82 as a part of  SDK release 1.21.0
2021-01-25 11:09:12 -08:00
amlrelsa-ms
249fb6bbb5 update samples from Release-82 as a part of SDK release 2021-01-25 19:03:14 +00:00
Harneet Virk
cda1f3e4cf Merge pull request #1289 from Azure/release_update/Release-81
update samples from Release-81 as a part of  SDK release
2021-01-11 12:52:48 -07:00
104 changed files with 2311 additions and 9869 deletions

View File

@@ -103,7 +103,7 @@
"source": [
"import azureml.core\n",
"\n",
"print(\"This notebook was created using version 1.20.0 of the Azure ML SDK\")\n",
"print(\"This notebook was created using version 1.24.0 of the Azure ML SDK\")\n",
"print(\"You are currently using version\", azureml.core.VERSION, \"of the Azure ML SDK\")"
]
},

View File

@@ -21,8 +21,8 @@ dependencies:
- pip:
# Required packages for AzureML execution, history, and data preparation.
- azureml-widgets~=1.20.0
- azureml-widgets~=1.24.0
- pytorch-transformers==1.0.0
- spacy==2.1.8
- https://aka.ms/automl-resources/packages/en_core_web_sm-2.1.0.tar.gz
- -r https://automlcesdkdataresources.blob.core.windows.net/validated-requirements/1.20.0/validated_win32_requirements.txt [--no-deps]
- -r https://automlcesdkdataresources.blob.core.windows.net/validated-requirements/1.24.0/validated_win32_requirements.txt [--no-deps]

View File

@@ -21,9 +21,8 @@ dependencies:
- pip:
# Required packages for AzureML execution, history, and data preparation.
- azureml-widgets~=1.20.0
- azureml-widgets~=1.24.0
- pytorch-transformers==1.0.0
- spacy==2.1.8
- https://aka.ms/automl-resources/packages/en_core_web_sm-2.1.0.tar.gz
- -r https://automlcesdkdataresources.blob.core.windows.net/validated-requirements/1.20.0/validated_linux_requirements.txt [--no-deps]
- -r https://automlcesdkdataresources.blob.core.windows.net/validated-requirements/1.24.0/validated_linux_requirements.txt [--no-deps]

View File

@@ -22,8 +22,8 @@ dependencies:
- pip:
# Required packages for AzureML execution, history, and data preparation.
- azureml-widgets~=1.20.0
- azureml-widgets~=1.24.0
- pytorch-transformers==1.0.0
- spacy==2.1.8
- https://aka.ms/automl-resources/packages/en_core_web_sm-2.1.0.tar.gz
- -r https://automlcesdkdataresources.blob.core.windows.net/validated-requirements/1.20.0/validated_darwin_requirements.txt [--no-deps]
- https://aka.ms/automl-resources/packages/en_core_web_sm-2.1.0.tar.gz
- -r https://automlcesdkdataresources.blob.core.windows.net/validated-requirements/1.24.0/validated_darwin_requirements.txt [--no-deps]

View File

@@ -105,7 +105,7 @@
"metadata": {},
"outputs": [],
"source": [
"print(\"This notebook was created using version 1.20.0 of the Azure ML SDK\")\n",
"print(\"This notebook was created using version 1.24.0 of the Azure ML SDK\")\n",
"print(\"You are currently using version\", azureml.core.VERSION, \"of the Azure ML SDK\")"
]
},

View File

@@ -93,7 +93,7 @@
"metadata": {},
"outputs": [],
"source": [
"print(\"This notebook was created using version 1.20.0 of the Azure ML SDK\")\n",
"print(\"This notebook was created using version 1.24.0 of the Azure ML SDK\")\n",
"print(\"You are currently using version\", azureml.core.VERSION, \"of the Azure ML SDK\")"
]
},

View File

@@ -96,7 +96,7 @@
"metadata": {},
"outputs": [],
"source": [
"print(\"This notebook was created using version 1.20.0 of the Azure ML SDK\")\n",
"print(\"This notebook was created using version 1.24.0 of the Azure ML SDK\")\n",
"print(\"You are currently using version\", azureml.core.VERSION, \"of the Azure ML SDK\")"
]
},

View File

@@ -81,7 +81,7 @@
"metadata": {},
"outputs": [],
"source": [
"print(\"This notebook was created using version 1.20.0 of the Azure ML SDK\")\n",
"print(\"This notebook was created using version 1.24.0 of the Azure ML SDK\")\n",
"print(\"You are currently using version\", azureml.core.VERSION, \"of the Azure ML SDK\")"
]
},

View File

@@ -5,7 +5,7 @@ set options=%3
set PIP_NO_WARN_SCRIPT_LOCATION=0
IF "%conda_env_name%"=="" SET conda_env_name="azure_automl_experimental"
IF "%automl_env_file%"=="" SET automl_env_file="automl_env.yml"
IF "%automl_env_file%"=="" SET automl_env_file="automl_thin_client_env.yml"
IF NOT EXIST %automl_env_file% GOTO YmlMissing

View File

@@ -12,7 +12,7 @@ fi
if [ "$AUTOML_ENV_FILE" == "" ]
then
AUTOML_ENV_FILE="automl_env.yml"
AUTOML_ENV_FILE="automl_thin_client_env.yml"
fi
if [ ! -f $AUTOML_ENV_FILE ]; then

View File

@@ -12,7 +12,7 @@ fi
if [ "$AUTOML_ENV_FILE" == "" ]
then
AUTOML_ENV_FILE="automl_env.yml"
AUTOML_ENV_FILE="automl_thin_client_env_mac.yml"
fi
if [ ! -f $AUTOML_ENV_FILE ]; then

View File

@@ -5,16 +5,14 @@ dependencies:
- pip<=19.3.1
- python>=3.5.2,<3.8
- nb_conda
- matplotlib==2.1.0
- numpy~=1.18.0
- cython
- urllib3<1.24
- scikit-learn==0.22.1
- pandas==0.25.1
- PyJWT < 2.0.0
- numpy==1.18.5
- pip:
# Required packages for AzureML execution, history, and data preparation.
- azureml-defaults
- azureml-sdk
- azureml-widgets
- azureml-explain-model
- pandas

View File

@@ -6,16 +6,14 @@ dependencies:
- nomkl
- python>=3.5.2,<3.8
- nb_conda
- matplotlib==2.1.0
- numpy~=1.18.0
- cython
- urllib3<1.24
- scikit-learn==0.22.1
- pandas==0.25.1
- PyJWT < 2.0.0
- numpy==1.18.5
- pip:
# Required packages for AzureML execution, history, and data preparation.
- azureml-defaults
- azureml-sdk
- azureml-widgets
- azureml-explain-model
- pandas

View File

@@ -67,11 +67,8 @@
"source": [
"import logging\n",
"\n",
"from matplotlib import pyplot as plt\n",
"import json\n",
"import numpy as np\n",
"import pandas as pd\n",
" \n",
"\n",
"\n",
"import azureml.core\n",
"from azureml.core.experiment import Experiment\n",
@@ -93,7 +90,7 @@
"metadata": {},
"outputs": [],
"source": [
"print(\"This notebook was created using version 1.20.0 of the Azure ML SDK\")\n",
"print(\"This notebook was created using version 1.24.0 of the Azure ML SDK\")\n",
"print(\"You are currently using version\", azureml.core.VERSION, \"of the Azure ML SDK\")"
]
},
@@ -116,9 +113,7 @@
"output['Resource Group'] = ws.resource_group\n",
"output['Location'] = ws.location\n",
"output['Run History Name'] = experiment_name\n",
"pd.set_option('display.max_colwidth', -1)\n",
"outputDf = pd.DataFrame(data = output, index = [''])\n",
"outputDf.T"
"output"
]
},
{
@@ -199,7 +194,6 @@
"|**n_cross_validations**|Number of cross validation splits.|\n",
"|**training_data**|(sparse) array-like, shape = [n_samples, n_features]|\n",
"|**label_column_name**|(sparse) array-like, shape = [n_samples, ], targets values.|\n",
"|**scenario**|We need to set this parameter to 'Latest' to enable some experimental features. This parameter should not be set outside of this experimental notebook.|\n",
"\n",
"**_You can find more information about primary metrics_** [here](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-configure-auto-train#primary-metric)"
]
@@ -228,7 +222,6 @@
" compute_target = compute_target,\n",
" training_data = train_data,\n",
" label_column_name = label,\n",
" scenario='Latest',\n",
" **automl_settings\n",
" )"
]
@@ -276,34 +269,13 @@
"## Results"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Widget for Monitoring Runs\n",
"\n",
"The widget will first report a \"loading\" status while running the first iteration. After completing the first iteration, an auto-updating graph and table will be shown. The widget will refresh once per minute, so you should see the graph update as child runs complete.\n",
"\n",
"**Note:** The widget displays a link at the bottom. Use this link to open a web interface to explore the individual run details."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.widgets import RunDetails\n",
"RunDetails(remote_run).show() "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"remote_run.wait_for_completion()"
"remote_run.wait_for_completion(show_output=True)"
]
},
{
@@ -368,18 +340,12 @@
"metadata": {},
"outputs": [],
"source": [
"# preview the first 3 rows of the dataset\n",
"\n",
"test_data = test_data.to_pandas_dataframe()\n",
"y_test = test_data['ERP'].fillna(0)\n",
"test_data = test_data.drop('ERP', 1)\n",
"test_data = test_data.fillna(0)\n",
"y_test = test_data.keep_columns('ERP')\n",
"test_data = test_data.drop_columns('ERP')\n",
"\n",
"\n",
"train_data = train_data.to_pandas_dataframe()\n",
"y_train = train_data['ERP'].fillna(0)\n",
"train_data = train_data.drop('ERP', 1)\n",
"train_data = train_data.fillna(0)\n"
"y_train = train_data.keep_columns('ERP')\n",
"train_data = train_data.drop_columns('ERP')\n"
]
},
{
@@ -397,7 +363,16 @@
"outputs": [],
"source": [
"from azureml.train.automl.model_proxy import ModelProxy\n",
"best_model_proxy = ModelProxy(best_run)"
"best_model_proxy = ModelProxy(best_run)\n",
"y_pred_train = best_model_proxy.predict(train_data)\n",
"y_pred_test = best_model_proxy.predict(test_data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Exploring results"
]
},
{
@@ -406,60 +381,15 @@
"metadata": {},
"outputs": [],
"source": [
"y_pred_train = best_model_proxy.predict(train_data).to_pandas_dataframe().values.flatten()\n",
"y_pred_train = y_pred_train.to_pandas_dataframe().values.flatten()\n",
"y_train = y_train.to_pandas_dataframe().values.flatten()\n",
"y_residual_train = y_train - y_pred_train\n",
"\n",
"y_pred_test = best_model_proxy.predict(test_data).to_pandas_dataframe().values.flatten()\n",
"y_residual_test = y_test - y_pred_test"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%matplotlib inline\n",
"from sklearn.metrics import mean_squared_error, r2_score\n",
"\n",
"# Set up a multi-plot chart.\n",
"f, (a0, a1) = plt.subplots(1, 2, gridspec_kw = {'width_ratios':[1, 1], 'wspace':0, 'hspace': 0})\n",
"f.suptitle('Regression Residual Values', fontsize = 18)\n",
"f.set_figheight(6)\n",
"f.set_figwidth(16)\n",
"\n",
"# Plot residual values of training set.\n",
"a0.axis([0, 360, -100, 100])\n",
"a0.plot(y_residual_train, 'bo', alpha = 0.5)\n",
"a0.plot([-10,360],[0,0], 'r-', lw = 3)\n",
"a0.text(16,170,'RMSE = {0:.2f}'.format(np.sqrt(mean_squared_error(y_train, y_pred_train))), fontsize = 12)\n",
"a0.text(16,140,'R2 score = {0:.2f}'.format(r2_score(y_train, y_pred_train)),fontsize = 12)\n",
"a0.set_xlabel('Training samples', fontsize = 12)\n",
"a0.set_ylabel('Residual Values', fontsize = 12)\n",
"\n",
"# Plot residual values of test set.\n",
"a1.axis([0, 90, -100, 100])\n",
"a1.plot(y_residual_test, 'bo', alpha = 0.5)\n",
"a1.plot([-10,360],[0,0], 'r-', lw = 3)\n",
"a1.text(5,170,'RMSE = {0:.2f}'.format(np.sqrt(mean_squared_error(y_test, y_pred_test))), fontsize = 12)\n",
"a1.text(5,140,'R2 score = {0:.2f}'.format(r2_score(y_test, y_pred_test)),fontsize = 12)\n",
"a1.set_xlabel('Test samples', fontsize = 12)\n",
"a1.set_yticklabels([])\n",
"\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%matplotlib inline\n",
"test_pred = plt.scatter(y_test, y_pred_test, color='')\n",
"test_test = plt.scatter(y_test, y_test, color='g')\n",
"plt.legend((test_pred, test_test), ('prediction', 'truth'), loc='upper left', fontsize=8)\n",
"plt.show()"
"y_pred_test = y_pred_test.to_pandas_dataframe().values.flatten()\n",
"y_test = y_test.to_pandas_dataframe().values.flatten()\n",
"y_residual_test = y_test - y_pred_test\n",
"print(y_residual_train)\n",
"print(y_residual_test)"
]
},
{

View File

@@ -113,7 +113,7 @@
"metadata": {},
"outputs": [],
"source": [
"print(\"This notebook was created using version 1.20.0 of the Azure ML SDK\")\n",
"print(\"This notebook was created using version 1.24.0 of the Azure ML SDK\")\n",
"print(\"You are currently using version\", azureml.core.VERSION, \"of the Azure ML SDK\")"
]
},
@@ -218,6 +218,8 @@
"\n",
"**Time series identifier columns** are identified by values of the columns listed `time_series_id_column_names`, for example \"store\" and \"item\" if your data has multiple time series of sales, one series for each combination of store and item sold.\n",
"\n",
"**Forecast frequency (freq)** This optional parameter represents the period with which the forecast is desired, for example, daily, weekly, yearly, etc. Use this parameter for the correction of time series containing irregular data points or for padding of short time series. The frequency needs to be a pandas offset alias. Please refer to [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects) for more information.\n",
"\n",
"This dataset has only one time series. Please see the [orange juice notebook](https://github.com/Azure/MachineLearningNotebooks/tree/master/how-to-use-azureml/automated-machine-learning/forecasting-orange-juice-sales) for an example of a multi-time series dataset."
]
},

View File

@@ -87,7 +87,7 @@
"metadata": {},
"outputs": [],
"source": [
"print(\"This notebook was created using version 1.20.0 of the Azure ML SDK\")\n",
"print(\"This notebook was created using version 1.24.0 of the Azure ML SDK\")\n",
"print(\"You are currently using version\", azureml.core.VERSION, \"of the Azure ML SDK\")"
]
},
@@ -205,6 +205,10 @@
"outputs": [],
"source": [
"dataset = Dataset.Tabular.from_delimited_files(path = [(datastore, 'dataset/bike-no.csv')]).with_timestamp_columns(fine_grain_timestamp=time_column_name) \n",
"\n",
"# Drop the columns 'casual' and 'registered' as these columns are a breakdown of the total and therefore a leak.\n",
"dataset = dataset.drop_columns(columns=['casual', 'registered'])\n",
"\n",
"dataset.take(5).to_pandas_dataframe().reset_index(drop=True)"
]
},
@@ -251,7 +255,7 @@
"|**forecast_horizon**|The forecast horizon is how many periods forward you would like to forecast. This integer horizon is in units of the timeseries frequency (e.g. daily, weekly).|\n",
"|**country_or_region_for_holidays**|The country/region used to generate holiday features. These should be ISO 3166 two-letter country/region codes (i.e. 'US', 'GB').|\n",
"|**target_lags**|The target_lags specifies how far back we will construct the lags of the target variable.|\n",
"|**drop_column_names**|Name(s) of columns to drop prior to modeling|"
"|**freq**|Forecast frequency. This optional parameter represents the period with which the forecast is desired, for example, daily, weekly, yearly, etc. Use this parameter for the correction of time series containing irregular data points or for padding of short time series. The frequency needs to be a pandas offset alias. Please refer to [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects) for more information."
]
},
{
@@ -314,8 +318,7 @@
" time_column_name=time_column_name,\n",
" forecast_horizon=forecast_horizon,\n",
" country_or_region_for_holidays='US', # set country_or_region will trigger holiday featurizer\n",
" target_lags='auto', # use heuristic based lag setting \n",
" drop_column_names=['casual', 'registered'] # these columns are a breakdown of the total and therefore a leak\n",
" target_lags='auto' # use heuristic based lag setting \n",
")\n",
"\n",
"automl_config = AutoMLConfig(task='forecasting', \n",

View File

@@ -97,7 +97,7 @@
"metadata": {},
"outputs": [],
"source": [
"print(\"This notebook was created using version 1.20.0 of the Azure ML SDK\")\n",
"print(\"This notebook was created using version 1.24.0 of the Azure ML SDK\")\n",
"print(\"You are currently using version\", azureml.core.VERSION, \"of the Azure ML SDK\")"
]
},
@@ -301,7 +301,8 @@
"|Property|Description|\n",
"|-|-|\n",
"|**time_column_name**|The name of your time column.|\n",
"|**forecast_horizon**|The forecast horizon is how many periods forward you would like to forecast. This integer horizon is in units of the timeseries frequency (e.g. daily, weekly).|"
"|**forecast_horizon**|The forecast horizon is how many periods forward you would like to forecast. This integer horizon is in units of the timeseries frequency (e.g. daily, weekly).|\n",
"|**freq**|Forecast frequency. This optional parameter represents the period with which the forecast is desired, for example, daily, weekly, yearly, etc. Use this parameter for the correction of time series containing irregular data points or for padding of short time series. The frequency needs to be a pandas offset alias. Please refer to [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects) for more information."
]
},
{

View File

@@ -94,7 +94,7 @@
"metadata": {},
"outputs": [],
"source": [
"print(\"This notebook was created using version 1.20.0 of the Azure ML SDK\")\n",
"print(\"This notebook was created using version 1.24.0 of the Azure ML SDK\")\n",
"print(\"You are currently using version\", azureml.core.VERSION, \"of the Azure ML SDK\")"
]
},
@@ -302,7 +302,8 @@
"* Set early termination to True, so the iterations through the models will stop when no improvements in accuracy score will be made.\n",
"* Set limitations on the length of experiment run to 15 minutes.\n",
"* Finally, we set the task to be forecasting.\n",
"* We apply the lag lead operator to the target value i.e. we use the previous values as a predictor for the future ones."
"* We apply the lag lead operator to the target value i.e. we use the previous values as a predictor for the future ones.\n",
"* [Optional] Forecast frequency parameter (freq) represents the period with which the forecast is desired, for example, daily, weekly, yearly, etc. Use this parameter for the correction of time series containing irregular data points or for padding of short time series. The frequency needs to be a pandas offset alias. Please refer to [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects) for more information."
]
},
{

View File

@@ -82,7 +82,7 @@
"metadata": {},
"outputs": [],
"source": [
"print(\"This notebook was created using version 1.20.0 of the Azure ML SDK\")\n",
"print(\"This notebook was created using version 1.24.0 of the Azure ML SDK\")\n",
"print(\"You are currently using version\", azureml.core.VERSION, \"of the Azure ML SDK\")"
]
},
@@ -169,6 +169,10 @@
"source": [
"time_column_name = 'WeekStarting'\n",
"data = pd.read_csv(\"dominicks_OJ.csv\", parse_dates=[time_column_name])\n",
"\n",
"# Drop the columns 'logQuantity' as it is a leaky feature.\n",
"data.drop('logQuantity', axis=1, inplace=True)\n",
"\n",
"data.head()"
]
},
@@ -343,7 +347,6 @@
"outputs": [],
"source": [
"featurization_config = FeaturizationConfig()\n",
"featurization_config.drop_columns = ['logQuantity'] # 'logQuantity' is a leaky feature, so we remove it.\n",
"# Force the CPWVOL5 feature to be numeric type.\n",
"featurization_config.add_column_purpose('CPWVOL5', 'Numeric')\n",
"# Fill missing values in the target column, Quantity, with zeros.\n",
@@ -366,7 +369,8 @@
"|-|-|\n",
"|**time_column_name**|The name of your time column.|\n",
"|**forecast_horizon**|The forecast horizon is how many periods forward you would like to forecast. This integer horizon is in units of the timeseries frequency (e.g. daily, weekly).|\n",
"|**time_series_id_column_names**|The column names used to uniquely identify the time series in data that has multiple rows with the same timestamp. If the time series identifiers are not defined, the data set is assumed to be one time series.|"
"|**time_series_id_column_names**|The column names used to uniquely identify the time series in data that has multiple rows with the same timestamp. If the time series identifiers are not defined, the data set is assumed to be one time series.|\n",
"|**freq**|Forecast frequency. This optional parameter represents the period with which the forecast is desired, for example, daily, weekly, yearly, etc. Use this parameter for the correction of time series containing irregular data points or for padding of short time series. The frequency needs to be a pandas offset alias. Please refer to [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects) for more information."
]
},
{

View File

@@ -96,7 +96,7 @@
"metadata": {},
"outputs": [],
"source": [
"print(\"This notebook was created using version 1.20.0 of the Azure ML SDK\")\n",
"print(\"This notebook was created using version 1.24.0 of the Azure ML SDK\")\n",
"print(\"You are currently using version\", azureml.core.VERSION, \"of the Azure ML SDK\")"
]
},

View File

@@ -96,7 +96,7 @@
"metadata": {},
"outputs": [],
"source": [
"print(\"This notebook was created using version 1.20.0 of the Azure ML SDK\")\n",
"print(\"This notebook was created using version 1.24.0 of the Azure ML SDK\")\n",
"print(\"You are currently using version\", azureml.core.VERSION, \"of the Azure ML SDK\")"
]
},

View File

@@ -92,7 +92,7 @@
"metadata": {},
"outputs": [],
"source": [
"print(\"This notebook was created using version 1.20.0 of the Azure ML SDK\")\n",
"print(\"This notebook was created using version 1.24.0 of the Azure ML SDK\")\n",
"print(\"You are currently using version\", azureml.core.VERSION, \"of the Azure ML SDK\")"
]
},
@@ -375,18 +375,12 @@
"metadata": {},
"outputs": [],
"source": [
"# preview the first 3 rows of the dataset\n",
"\n",
"test_data = test_data.to_pandas_dataframe()\n",
"y_test = test_data['ERP'].fillna(0)\n",
"test_data = test_data.drop('ERP', 1)\n",
"test_data = test_data.fillna(0)\n",
"y_test = test_data.keep_columns('ERP').to_pandas_dataframe()\n",
"test_data = test_data.drop_columns('ERP').to_pandas_dataframe()\n",
"\n",
"\n",
"train_data = train_data.to_pandas_dataframe()\n",
"y_train = train_data['ERP'].fillna(0)\n",
"train_data = train_data.drop('ERP', 1)\n",
"train_data = train_data.fillna(0)\n"
"y_train = train_data.keep_columns('ERP').to_pandas_dataframe()\n",
"train_data = train_data.drop_columns('ERP').to_pandas_dataframe()\n"
]
},
{
@@ -396,10 +390,10 @@
"outputs": [],
"source": [
"y_pred_train = fitted_model.predict(train_data)\n",
"y_residual_train = y_train - y_pred_train\n",
"y_residual_train = y_train.values - y_pred_train\n",
"\n",
"y_pred_test = fitted_model.predict(test_data)\n",
"y_residual_test = y_test - y_pred_test"
"y_residual_test = y_test.values - y_pred_test"
]
},
{

View File

@@ -0,0 +1,84 @@
Azure Synapse Analyticsis a limitless analytics service that brings together data integration, enterprise data warehousing, and big data analytics. It gives you the freedom to query data on your terms, using either serverless or dedicated resources—at scale. Azure Synapse brings these worlds together with a unified experience to ingest, explore, prepare, manage, and serve data for immediate BI and machine learning needs.A coreoffering within Azure Synapse Analyticsare serverlessApache Spark poolsenhanced for big data workloads.
Synapse in Aml integration is for customerswho want to useApacheSparkin AzureSynapse Analyticsto prepare data at scale in Azure ML before training their ML model. This will allow customers to work on their end-to-end ML lifecycle including large-scale data preparation, model training and deployment within Azure ML workspace without having to use suboptimal tools for machine learning or switch between multipletools for data preparation and model training.The ability to perform all ML tasks within Azure ML willreducetimerequired for customersto iterate on a machine learning project which typically includesmultiple rounds ofdata preparation and training.
In the public preview, the capabilities are provided:
- Link Azure Synapse Analytics workspace to Azure Machine Learning workspace (via ARM, UI or SDK)
- Attach Apache Spark pools powered by Azure Synapse Analytics as Azure Machine Learning compute targets (via ARM, UI or SDK)
- Launch Apache Spark sessions in notebooks and perform interactive data exploration and preparation. This interactive experience leverages Apache Spark magic and customers will have session-level Conda support to install packages.
- Productionize ML pipelines by leveraging Apache Spark pools to pre-process big data
# Using Synapse in Azure machine learning
## Create synapse resources
Follow up the documents to create Synapse workspace and resource-setup.sh is available for you to create the resources.
- Create from [Portal](https://docs.microsoft.com/en-us/azure/synapse-analytics/quickstart-create-workspace)
- Create from [Cli](https://docs.microsoft.com/en-us/azure/synapse-analytics/quickstart-create-workspace-cli)
Follow up the documents to create Synapse spark pool
- Create from [Portal](https://docs.microsoft.com/en-us/azure/synapse-analytics/quickstart-create-apache-spark-pool-portal)
- Create from [Cli](https://docs.microsoft.com/en-us/cli/azure/ext/synapse/synapse/spark/pool?view=azure-cli-latest)
## Link Synapse Workspace
Make sure you are the owner of synapse workspace so that you can link synapse workspace into AML.
You can run resource-setup.py to link the synapse workspace and attach compute
```python
from azureml.core import Workspace
ws = Workspace.from_config()
from azureml.core import LinkedService, SynapseWorkspaceLinkedServiceConfiguration
synapse_link_config = SynapseWorkspaceLinkedServiceConfiguration(
subscription_id="<subscription id>",
resource_group="<resource group",
name="<synapse workspace name>"
)
linked_service = LinkedService.register(
workspace=ws,
name='<link name>',
linked_service_config=synapse_link_config)
```
## Attach synapse spark pool as AzureML compute
```python
from azureml.core.compute import SynapseCompute, ComputeTarget
spark_pool_name = "<spark pool name>"
attached_synapse_name = "<attached compute name>"
attach_config = SynapseCompute.attach_configuration(
linked_service,
type="SynapseSpark",
pool_name=spark_pool_name)
synapse_compute=ComputeTarget.attach(
workspace=ws,
name=attached_synapse_name,
attach_configuration=attach_config)
synapse_compute.wait_for_completion()
```
## Set up permission
Grant Spark admin role to system assigned identity of the linked service so that the user can submit experiment run or pipeline run from AML workspace to synapse spark pool.
Grant Spark admin role to the specific user so that the user can start spark session to synapse spark pool.
You can get the system assigned identity information by running
```python
print(linked_service.system_assigned_identity_principal_id)
```
- Launch synapse studio of the synapse workspace and grant linked service MSI "Synapse Apache Spark administrator" role.
- In azure portal grant linked service MSI "Storage Blob Data Contributor" role of the primary adlsgen2 account of synapse workspace to use the library management feature.

View File

@@ -77,7 +77,7 @@
"source": [
"## Create trained model\n",
"\n",
"For this example, we will train a small model on scikit-learn's [diabetes dataset](https://scikit-learn.org/stable/datasets/index.html#diabetes-dataset). "
"For this example, we will train a small model on scikit-learn's [diabetes dataset](https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset). "
]
},
{
@@ -382,13 +382,111 @@
"source": [
"## Update Service\n",
"\n",
"If you want to change your model(s), Conda dependencies, or deployment configuration, call `update()` to rebuild the Docker image.\n",
"\n",
"```python\n",
"local_service.update(models=[SomeOtherModelObject],\n",
"If you want to change your model(s), Conda dependencies or deployment configuration, call `update()` to rebuild the Docker image.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"local_service.update(models=[model],\n",
" inference_config=inference_config,\n",
" deployment_config=local_config)\n",
"```"
" deployment_config=deployment_config)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Deploy model to AKS cluster based on the LocalWebservice's configuration."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# This is a one time setup for AKS Cluster. You can reuse this cluster for multiple deployments after it has been created. If you delete the cluster or the resource group that contains it, then you would have to recreate it.\n",
"from azureml.core.compute import AksCompute, ComputeTarget\n",
"from azureml.core.compute_target import ComputeTargetException\n",
"\n",
"# Choose a name for your AKS cluster\n",
"aks_name = 'my-aks-9' \n",
"\n",
"# Verify the cluster does not exist already\n",
"try:\n",
" aks_target = ComputeTarget(workspace=ws, name=aks_name)\n",
" print('Found existing cluster, use it.')\n",
"except ComputeTargetException:\n",
" # Use the default configuration (can also provide parameters to customize)\n",
" prov_config = AksCompute.provisioning_configuration()\n",
"\n",
" # Create the cluster\n",
" aks_target = ComputeTarget.create(workspace = ws, \n",
" name = aks_name, \n",
" provisioning_configuration = prov_config)\n",
"\n",
"if aks_target.get_status() != \"Succeeded\":\n",
" aks_target.wait_for_completion(show_output=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.webservice import AksWebservice\n",
"# Set the web service configuration (using default here)\n",
"aks_config = AksWebservice.deploy_configuration()\n",
"\n",
"# # Enable token auth and disable (key) auth on the webservice\n",
"# aks_config = AksWebservice.deploy_configuration(token_auth_enabled=True, auth_enabled=False)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"aks_service_name ='aks-service-1'\n",
"\n",
"aks_service = local_service.deploy_to_cloud(name=aks_service_name,\n",
" deployment_config=aks_config,\n",
" deployment_target=aks_target)\n",
"\n",
"aks_service.wait_for_deployment(show_output = True)\n",
"print(aks_service.state)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Test aks service\n",
"\n",
"sample_input = json.dumps({\n",
" 'data': dataset_x[0:2].tolist()\n",
"})\n",
"\n",
"aks_service.run(sample_input)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Delete the service if not needed.\n",
"aks_service.delete()"
]
},
{

View File

@@ -94,6 +94,17 @@ def main():
os.makedirs(output_dir, exist_ok=True)
kwargs = {'num_workers': 1, 'pin_memory': True} if use_cuda else {}
# Use Azure Open Datasets for MNIST dataset
datasets.MNIST.resources = [
("https://azureopendatastorage.azurefd.net/mnist/train-images-idx3-ubyte.gz",
"f68b3c2dcbeaaa9fbdd348bbdeb94873"),
("https://azureopendatastorage.azurefd.net/mnist/train-labels-idx1-ubyte.gz",
"d53e105ee54ea40749a09fcbcd1e9432"),
("https://azureopendatastorage.azurefd.net/mnist/t10k-images-idx3-ubyte.gz",
"9fb629c4189551a2d022fa330f9573f3"),
("https://azureopendatastorage.azurefd.net/mnist/t10k-labels-idx1-ubyte.gz",
"ec29112dd5afa0611ce80d1b7f02629c")
]
train_loader = torch.utils.data.DataLoader(
datasets.MNIST('data', train=True, download=True,
transform=transforms.Compose([transforms.ToTensor(),

View File

@@ -70,16 +70,16 @@
"\n",
"import urllib.request\n",
"\n",
"onnx_model_url = \"https://www.cntk.ai/OnnxModels/emotion_ferplus/opset_7/emotion_ferplus.tar.gz\"\n",
"onnx_model_url = \"https://github.com/onnx/models/blob/master/vision/body_analysis/emotion_ferplus/model/emotion-ferplus-7.tar.gz?raw=true\"\n",
"\n",
"urllib.request.urlretrieve(onnx_model_url, filename=\"emotion_ferplus.tar.gz\")\n",
"urllib.request.urlretrieve(onnx_model_url, filename=\"emotion-ferplus-7.tar.gz\")\n",
"\n",
"# the ! magic command tells our jupyter notebook kernel to run the following line of \n",
"# code from the command line instead of the notebook kernel\n",
"\n",
"# We use tar and xvcf to unzip the files we just retrieved from the ONNX model zoo\n",
"\n",
"!tar xvzf emotion_ferplus.tar.gz"
"!tar xvzf emotion-ferplus-7.tar.gz"
]
},
{
@@ -570,7 +570,7 @@
"metadata": {},
"outputs": [],
"source": [
"plt.figure(figsize = (16, 6), frameon=False)\n",
"plt.figure(figsize = (16, 6))\n",
"plt.subplot(1, 8, 1)\n",
"\n",
"plt.text(x = 0, y = -30, s = \"True Label: \", fontsize = 13, color = 'black')\n",

View File

@@ -70,9 +70,9 @@
"\n",
"import urllib.request\n",
"\n",
"onnx_model_url = \"https://www.cntk.ai/OnnxModels/mnist/opset_7/mnist.tar.gz\"\n",
"onnx_model_url = \"https://github.com/onnx/models/blob/master/vision/classification/mnist/model/mnist-7.tar.gz?raw=true\"\n",
"\n",
"urllib.request.urlretrieve(onnx_model_url, filename=\"mnist.tar.gz\")"
"urllib.request.urlretrieve(onnx_model_url, filename=\"mnist-7.tar.gz\")"
]
},
{
@@ -86,7 +86,7 @@
"\n",
"# We use tar and xvcf to unzip the files we just retrieved from the ONNX model zoo\n",
"\n",
"!tar xvzf mnist.tar.gz"
"!tar xvzf mnist-7.tar.gz"
]
},
{
@@ -521,7 +521,7 @@
"metadata": {},
"outputs": [],
"source": [
"plt.figure(figsize = (16, 6), frameon=False)\n",
"plt.figure(figsize = (16, 6))\n",
"plt.subplot(1, 8, 1)\n",
"\n",
"plt.text(x = 0, y = -30, s = \"True Label: \", fontsize = 13, color = 'black')\n",
@@ -684,18 +684,7 @@
"\n",
"A convolution layer is a set of filters. Each filter is defined by a weight (**W**) matrix, and bias ($b$).\n",
"\n",
"![](https://www.cntk.ai/jup/cntk103d_filterset_v2.png)\n",
"\n",
"These filters are scanned across the image performing the dot product between the weights and corresponding input value ($x$). The bias value is added to the output of the dot product and the resulting sum is optionally mapped through an activation function. This process is illustrated in the following animation."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"Image(url=\"https://www.cntk.ai/jup/cntk103d_conv2d_final.gif\", width= 200)"
"These filters are scanned across the image performing the dot product between the weights and corresponding input value ($x$). The bias value is added to the output of the dot product and the resulting sum is optionally mapped through an activation function."
]
},
{
@@ -707,24 +696,6 @@
"The MNIST model from the ONNX Model Zoo uses maxpooling to update the weights in its convolutions, summarized by the graphic below. You can see the entire workflow of our pre-trained model in the following image, with our input images and our output probabilities of each of our 10 labels. If you're interested in exploring the logic behind creating a Deep Learning model further, please look at the [training tutorial for our ONNX MNIST Convolutional Neural Network](https://github.com/Microsoft/CNTK/blob/master/Tutorials/CNTK_103D_MNIST_ConvolutionalNeuralNetwork.ipynb). "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Max-Pooling for Convolutional Neural Nets\n",
"\n",
"![](http://www.cntk.ai/jup/c103d_max_pooling.gif)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Pre-Trained Model Architecture\n",
"\n",
"![](http://www.cntk.ai/jup/conv103d_mnist-conv-mp.png)"
]
},
{
"cell_type": "code",
"execution_count": null,

View File

@@ -259,7 +259,7 @@
"run_config.environment.docker.enabled = True\n",
"\n",
"azureml_pip_packages = [\n",
" 'azureml-defaults', 'azureml-contrib-interpret', 'azureml-telemetry', 'azureml-interpret'\n",
" 'azureml-defaults', 'azureml-telemetry', 'azureml-interpret'\n",
"]\n",
"\n",
"# Note: this is to pin the scikit-learn and pandas versions to be same as notebook.\n",

View File

@@ -3,9 +3,11 @@ dependencies:
- pip:
- azureml-sdk
- azureml-interpret
- interpret-community[visualization]
- flask
- flask-cors
- gevent>=1.3.6
- jinja2
- ipython
- matplotlib
- azureml-contrib-interpret
- sklearn-pandas<2.0.0
- azureml-dataset-runtime
- ipywidgets

View File

@@ -57,7 +57,7 @@
"Problem: IBM employee attrition classification with scikit-learn (run model explainer locally and upload explanation to the Azure Machine Learning Run History)\n",
"\n",
"1. Train a SVM classification model using Scikit-learn\n",
"2. Run 'explain_model' with AML Run History, which leverages run history service to store and manage the explanation data\n",
"2. Run 'explain-model-sample' with AML Run History, which leverages run history service to store and manage the explanation data\n",
"---\n",
"\n",
"Setup: If you are using Jupyter notebooks, the extensions should be installed automatically with the package.\n",
@@ -226,36 +226,6 @@
" ('classifier', SVC(C=1.0, probability=True))])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"'''\n",
"# Uncomment below if sklearn-pandas is not installed\n",
"#!pip install sklearn-pandas\n",
"from sklearn_pandas import DataFrameMapper\n",
"\n",
"# Impute, standardize the numeric features and one-hot encode the categorical features. \n",
"\n",
"\n",
"numeric_transformations = [([f], Pipeline(steps=[('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())])) for f in numerical]\n",
"\n",
"categorical_transformations = [([f], OneHotEncoder(handle_unknown='ignore', sparse=False)) for f in categorical]\n",
"\n",
"transformations = numeric_transformations + categorical_transformations\n",
"\n",
"# Append classifier to preprocessing pipeline.\n",
"# Now we have a full prediction pipeline.\n",
"clf = Pipeline(steps=[('preprocessor', transformations),\n",
" ('classifier', SVC(C=1.0, probability=True))]) \n",
"\n",
"\n",
"\n",
"'''"
]
},
{
"cell_type": "markdown",
"metadata": {},
@@ -475,7 +445,7 @@
"metadata": {},
"outputs": [],
"source": [
"experiment_name = 'explain_model'\n",
"experiment_name = 'explain-model-sample'\n",
"experiment = Experiment(ws, experiment_name)\n",
"run = experiment.start_logging()\n",
"client = ExplanationClient.from_run(run)"

View File

@@ -3,7 +3,10 @@ dependencies:
- pip:
- azureml-sdk
- azureml-interpret
- interpret-community[visualization]
- flask
- flask-cors
- gevent>=1.3.6
- jinja2
- ipython
- matplotlib
- azureml-contrib-interpret
- ipywidgets

View File

@@ -166,12 +166,12 @@
"source": [
"from sklearn.model_selection import train_test_split\n",
"import joblib\n",
"from sklearn.compose import ColumnTransformer\n",
"from sklearn.preprocessing import StandardScaler, OneHotEncoder\n",
"from sklearn.impute import SimpleImputer\n",
"from sklearn.pipeline import Pipeline\n",
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"from sklearn_pandas import DataFrameMapper\n",
"\n",
"from interpret.ext.blackbox import TabularExplainer\n",
"\n",
@@ -201,17 +201,23 @@
"# Store the numerical columns in a list numerical\n",
"numerical = attritionXData.columns.difference(categorical)\n",
"\n",
"numeric_transformations = [([f], Pipeline(steps=[\n",
"# We create the preprocessing pipelines for both numeric and categorical data.\n",
"numeric_transformer = Pipeline(steps=[\n",
" ('imputer', SimpleImputer(strategy='median')),\n",
" ('scaler', StandardScaler())])) for f in numerical]\n",
" ('scaler', StandardScaler())])\n",
"\n",
"categorical_transformations = [([f], OneHotEncoder(handle_unknown='ignore', sparse=False)) for f in categorical]\n",
"categorical_transformer = Pipeline(steps=[\n",
" ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),\n",
" ('onehot', OneHotEncoder(handle_unknown='ignore'))])\n",
"\n",
"transformations = numeric_transformations + categorical_transformations\n",
"transformations = ColumnTransformer(\n",
" transformers=[\n",
" ('num', numeric_transformer, numerical),\n",
" ('cat', categorical_transformer, categorical)])\n",
"\n",
"# Append classifier to preprocessing pipeline.\n",
"# Now we have a full prediction pipeline.\n",
"clf = Pipeline(steps=[('preprocessor', DataFrameMapper(transformations)),\n",
"clf = Pipeline(steps=[('preprocessor', transformations),\n",
" ('classifier', RandomForestClassifier())])\n",
"\n",
"# Split data into train and test\n",
@@ -323,7 +329,7 @@
"\n",
"# azureml-defaults is required to host the model as a web service.\n",
"azureml_pip_packages = [\n",
" 'azureml-defaults', 'azureml-contrib-interpret', 'azureml-core', 'azureml-telemetry',\n",
" 'azureml-defaults', 'azureml-core', 'azureml-telemetry',\n",
" 'azureml-interpret'\n",
"]\n",
" \n",
@@ -350,7 +356,7 @@
"# the submitted job is run in. Note the remote environment(s) needs to be similar to the local\n",
"# environment, otherwise if a model is trained or deployed in a different environment this can\n",
"# cause errors. Please take extra care when specifying your dependencies in a production environment.\n",
"myenv = CondaDependencies.create(pip_packages=['sklearn-pandas', 'pyyaml', sklearn_dep, pandas_dep] + azureml_pip_packages,\n",
"myenv = CondaDependencies.create(pip_packages=['pyyaml', sklearn_dep, pandas_dep] + azureml_pip_packages,\n",
" pin_sdk_version=False)\n",
"\n",
"with open(\"myenv.yml\",\"w\") as f:\n",

View File

@@ -3,8 +3,10 @@ dependencies:
- pip:
- azureml-sdk
- azureml-interpret
- interpret-community[visualization]
- flask
- flask-cors
- gevent>=1.3.6
- jinja2
- ipython
- matplotlib
- azureml-contrib-interpret
- sklearn-pandas<2.0.0
- ipywidgets

View File

@@ -267,7 +267,7 @@
"run_config.environment.python.user_managed_dependencies = False\n",
"\n",
"azureml_pip_packages = [\n",
" 'azureml-defaults', 'azureml-contrib-interpret', 'azureml-telemetry', 'azureml-interpret'\n",
" 'azureml-defaults', 'azureml-telemetry', 'azureml-interpret'\n",
"]\n",
" \n",
"\n",
@@ -294,7 +294,7 @@
"# the submitted job is run in. Note the remote environment(s) needs to be similar to the local\n",
"# environment, otherwise if a model is trained or deployed in a different environment this can\n",
"# cause errors. Please take extra care when specifying your dependencies in a production environment.\n",
"azureml_pip_packages.extend(['sklearn-pandas', 'pyyaml', sklearn_dep, pandas_dep])\n",
"azureml_pip_packages.extend(['pyyaml', sklearn_dep, pandas_dep])\n",
"run_config.environment.python.conda_dependencies = CondaDependencies.create(pip_packages=azureml_pip_packages)\n",
"# Now submit a run on AmlCompute\n",
"from azureml.core.script_run_config import ScriptRunConfig\n",
@@ -431,7 +431,7 @@
"\n",
"# WARNING: to install this, g++ needs to be available on the Docker image and is not by default (look at the next cell)\n",
"azureml_pip_packages = [\n",
" 'azureml-defaults', 'azureml-contrib-interpret', 'azureml-core', 'azureml-telemetry',\n",
" 'azureml-defaults', 'azureml-core', 'azureml-telemetry',\n",
" 'azureml-interpret'\n",
"]\n",
" \n",
@@ -458,7 +458,7 @@
"# the submitted job is run in. Note the remote environment(s) needs to be similar to the local\n",
"# environment, otherwise if a model is trained or deployed in a different environment this can\n",
"# cause errors. Please take extra care when specifying your dependencies in a production environment.\n",
"azureml_pip_packages.extend(['sklearn-pandas', 'pyyaml', sklearn_dep, pandas_dep])\n",
"azureml_pip_packages.extend(['pyyaml', sklearn_dep, pandas_dep])\n",
"myenv = CondaDependencies.create(pip_packages=azureml_pip_packages)\n",
"\n",
"with open(\"myenv.yml\",\"w\") as f:\n",

View File

@@ -3,10 +3,12 @@ dependencies:
- pip:
- azureml-sdk
- azureml-interpret
- interpret-community[visualization]
- flask
- flask-cors
- gevent>=1.3.6
- jinja2
- ipython
- matplotlib
- azureml-contrib-interpret
- sklearn-pandas<2.0.0
- azureml-dataset-runtime
- azureml-core
- ipywidgets

View File

@@ -5,13 +5,13 @@
import os
import pandas as pd
import zipfile
from sklearn.model_selection import train_test_split
import joblib
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn_pandas import DataFrameMapper
from azureml.core.run import Run
from interpret.ext.blackbox import TabularExplainer
@@ -57,16 +57,22 @@ for col, value in attritionXData.iteritems():
# store the numerical columns
numerical = attritionXData.columns.difference(categorical)
numeric_transformations = [([f], Pipeline(steps=[
# We create the preprocessing pipelines for both numeric and categorical data.
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])) for f in numerical]
('scaler', StandardScaler())])
categorical_transformations = [([f], OneHotEncoder(handle_unknown='ignore', sparse=False)) for f in categorical]
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))])
transformations = numeric_transformations + categorical_transformations
transformations = ColumnTransformer(
transformers=[
('num', numeric_transformer, numerical),
('cat', categorical_transformer, categorical)])
# append classifier to preprocessing pipeline
clf = Pipeline(steps=[('preprocessor', DataFrameMapper(transformations)),
clf = Pipeline(steps=[('preprocessor', transformations),
('classifier', LogisticRegression(solver='lbfgs'))])
# get the run this was submitted from to interact with run history

View File

@@ -9,7 +9,7 @@ These notebooks below are designed to go in sequence.
4. [aml-pipelines-data-transfer.ipynb](https://aka.ms/pl-data-trans): This notebook shows how you transfer data between supported datastores.
5. [aml-pipelines-use-databricks-as-compute-target.ipynb](https://aka.ms/pl-databricks): This notebooks shows how you can use Pipelines to send your compute payload to Azure Databricks.
6. [aml-pipelines-use-adla-as-compute-target.ipynb](https://aka.ms/pl-adla): This notebook shows how you can use Azure Data Lake Analytics (ADLA) as a compute target.
7. [aml-pipelines-how-to-use-estimatorstep.ipynb](https://aka.ms/pl-estimator): This notebook shows how to use the EstimatorStep.
7. [aml-pipelines-with-commandstep.ipynb](aml-pipelines-with-commandstep.ipynb): This notebook shows how to use the CommandStep.
8. [aml-pipelines-parameter-tuning-with-hyperdrive.ipynb](https://aka.ms/pl-hyperdrive): HyperDriveStep in Pipelines shows how you can do hyper parameter tuning using Pipelines.
9. [aml-pipelines-how-to-use-azurebatch-to-run-a-windows-executable.ipynb](https://aka.ms/pl-azbatch): AzureBatchStep can be used to run your custom code in AzureBatch cluster.
10. [aml-pipelines-setup-schedule-for-a-published-pipeline.ipynb](https://aka.ms/pl-schedule): Once you publish a Pipeline, you can schedule it to trigger based on an interval or on data change in a defined datastore.
@@ -19,5 +19,6 @@ These notebooks below are designed to go in sequence.
14. [aml-pipelines-how-to-use-pipeline-drafts.ipynb](http://aka.ms/pl-pl-draft): This notebook shows how to use Pipeline Drafts. Pipeline Drafts are mutable pipelines which can be used to submit runs and create Published Pipelines.
15. [aml-pipelines-hot-to-use-modulestep.ipynb](https://aka.ms/pl-modulestep): This notebook shows how to define Module, ModuleVersion and how to use them in an AML Pipeline using ModuleStep.
16. [aml-pipelines-with-notebook-runner-step.ipynb](https://aka.ms/pl-nbrstep): This notebook shows how you can run another notebook as a step in Azure Machine Learning Pipeline.
17. [aml-pipelines-with-commandstep-r.ipynb](aml-pipelines-with-commandstep-r.ipynb): This notebook shows how to use CommandStep to run R scripts.
![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/README.png)

View File

@@ -22,6 +22,8 @@
"# Azure Machine Learning Pipeline with DataTransferStep\n",
"This notebook is used to demonstrate the use of DataTransferStep in an Azure Machine Learning Pipeline.\n",
"\n",
"> **Note:** In Azure Machine Learning, you can write output data directly to Azure Blob Storage, Azure Data Lake Storage Gen 1, Azure Data Lake Storage Gen 2, Azure FileShare without going through extra DataTransferStep. Learn how to use [OutputFileDatasetConfig](https://docs.microsoft.com/python/api/azureml-core/azureml.data.output_dataset_config.outputfiledatasetconfig?view=azure-ml-py) to achieve that with sample notebooks [here](https://aka.ms/pipeline-with-dataset).**\n",
"\n",
"In certain cases, you will need to transfer data from one data location to another. For example, your data may be in Azure SQL Database and you may want to move it to Azure Data Lake storage. Or, your data is in an ADLS account and you want to make it available in the Blob storage. The built-in **DataTransferStep** class helps you transfer data in these situations.\n",
"\n",
"The below examples show how to move data between different storage types supported in Azure Machine Learning.\n",

View File

@@ -341,7 +341,7 @@
"outputs": [],
"source": [
"pipeline = Pipeline(workspace=ws, steps=[step])\n",
"pipeline_run = Experiment(ws, 'azurebatch_experiment').submit(pipeline)"
"pipeline_run = Experiment(ws, 'azurebatch_sample').submit(pipeline)"
]
},
{

View File

@@ -130,7 +130,7 @@
"\n",
"pipeline_draft = PipelineDraft.create(ws, name=\"TestPipelineDraft\",\n",
" description=\"draft description\",\n",
" experiment_name=\"helloworld\",\n",
" experiment_name=\"pipeline_draft_sample\",\n",
" pipeline=pipeline,\n",
" continue_on_step_failure=True,\n",
" tags={'dev': 'true'},\n",

View File

@@ -121,12 +121,17 @@
"metadata": {},
"outputs": [],
"source": [
"os.makedirs('./data/mnist', exist_ok=True)\n",
"data_folder = os.path.join(os.getcwd(), 'data/mnist')\n",
"os.makedirs(data_folder, exist_ok=True)\n",
"\n",
"urllib.request.urlretrieve('http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz', filename = './data/mnist/train-images.gz')\n",
"urllib.request.urlretrieve('http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz', filename = './data/mnist/train-labels.gz')\n",
"urllib.request.urlretrieve('http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz', filename = './data/mnist/test-images.gz')\n",
"urllib.request.urlretrieve('http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz', filename = './data/mnist/test-labels.gz')"
"urllib.request.urlretrieve('https://azureopendatastorage.blob.core.windows.net/mnist/train-images-idx3-ubyte.gz',\n",
" filename=os.path.join(data_folder, 'train-images-idx3-ubyte.gz'))\n",
"urllib.request.urlretrieve('https://azureopendatastorage.blob.core.windows.net/mnist/train-labels-idx1-ubyte.gz',\n",
" filename=os.path.join(data_folder, 'train-labels-idx1-ubyte.gz'))\n",
"urllib.request.urlretrieve('https://azureopendatastorage.blob.core.windows.net/mnist/t10k-images-idx3-ubyte.gz',\n",
" filename=os.path.join(data_folder, 't10k-images-idx3-ubyte.gz'))\n",
"urllib.request.urlretrieve('https://azureopendatastorage.blob.core.windows.net/mnist/t10k-labels-idx1-ubyte.gz',\n",
" filename=os.path.join(data_folder, 't10k-labels-idx1-ubyte.gz'))"
]
},
{
@@ -146,11 +151,11 @@
"from utils import load_data\n",
"\n",
"# note we also shrink the intensity values (X) from 0-255 to 0-1. This helps the neural network converge faster.\n",
"X_train = load_data('./data/mnist/train-images.gz', False) / 255.0\n",
"y_train = load_data('./data/mnist/train-labels.gz', True).reshape(-1)\n",
"X_train = load_data(os.path.join(data_folder, 'train-images-idx3-ubyte.gz'), False) / np.float32(255.0)\n",
"X_test = load_data(os.path.join(data_folder, 't10k-images-idx3-ubyte.gz'), False) / np.float32(255.0)\n",
"y_train = load_data(os.path.join(data_folder, 'train-labels-idx1-ubyte.gz'), True).reshape(-1)\n",
"y_test = load_data(os.path.join(data_folder, 't10k-labels-idx1-ubyte.gz'), True).reshape(-1)\n",
"\n",
"X_test = load_data('./data/mnist/test-images.gz', False) / 255.0\n",
"y_test = load_data('./data/mnist/test-labels.gz', True).reshape(-1)\n",
"\n",
"count = 0\n",
"sample_size = 30\n",

View File

@@ -41,14 +41,14 @@
"source": [
"import azureml.core\n",
"from azureml.core import Workspace, Datastore, Experiment, Dataset\n",
"from azureml.data import OutputFileDatasetConfig\n",
"from azureml.core.compute import AmlCompute\n",
"from azureml.core.compute import ComputeTarget\n",
"\n",
"# Check core SDK version number\n",
"print(\"SDK version:\", azureml.core.VERSION)\n",
"\n",
"from azureml.data.data_reference import DataReference\n",
"from azureml.pipeline.core import Pipeline, PipelineData\n",
"from azureml.pipeline.core import Pipeline\n",
"from azureml.pipeline.steps import PythonScriptStep\n",
"from azureml.pipeline.core.graph import PipelineParameter\n",
"\n",
@@ -140,9 +140,9 @@
"metadata": {},
"outputs": [],
"source": [
"# Define intermediate data using PipelineData\n",
"processed_data1 = PipelineData(\"processed_data1\",datastore=def_blob_store)\n",
"print(\"PipelineData object created\")"
"# Define intermediate data using OutputFileDatasetConfig\n",
"processed_data1 = OutputFileDatasetConfig(name=\"processed_data1\")\n",
"print(\"Output dataset object created\")"
]
},
{
@@ -170,9 +170,7 @@
"\n",
"trainStep = PythonScriptStep(\n",
" script_name=\"train.py\", \n",
" arguments=[\"--input_data\", blob_input_data, \"--output_train\", processed_data1],\n",
" inputs=[blob_input_data],\n",
" outputs=[processed_data1],\n",
" arguments=[\"--input_data\", blob_input_data.as_mount(), \"--output_train\", processed_data1],\n",
" compute_target=aml_compute, \n",
" source_directory=source_directory\n",
")\n",
@@ -195,16 +193,14 @@
"metadata": {},
"outputs": [],
"source": [
"# extractStep to use the intermediate data produced by step4\n",
"# extractStep to use the intermediate data produced by trainStep\n",
"# This step also produces an output processed_data2\n",
"processed_data2 = PipelineData(\"processed_data2\", datastore=def_blob_store)\n",
"processed_data2 = OutputFileDatasetConfig(name=\"processed_data2\")\n",
"source_directory = \"publish_run_extract\"\n",
"\n",
"extractStep = PythonScriptStep(\n",
" script_name=\"extract.py\",\n",
" arguments=[\"--input_extract\", processed_data1, \"--output_extract\", processed_data2],\n",
" inputs=[processed_data1],\n",
" outputs=[processed_data2],\n",
" arguments=[\"--input_extract\", processed_data1.as_input(), \"--output_extract\", processed_data2],\n",
" compute_target=aml_compute, \n",
" source_directory=source_directory)\n",
"print(\"extractStep created\")"
@@ -256,15 +252,17 @@
"metadata": {},
"outputs": [],
"source": [
"# Now define step6 that takes two inputs (both intermediate data), and produce an output\n",
"processed_data3 = PipelineData(\"processed_data3\", datastore=def_blob_store)\n",
"# Now define compareStep that takes two inputs (both intermediate data), and produce an output\n",
"processed_data3 = OutputFileDatasetConfig(name=\"processed_data3\")\n",
"\n",
"# You can register the output as dataset after job completion\n",
"processed_data3 = processed_data3.register_on_complete(\"compare_result\")\n",
"\n",
"source_directory = \"publish_run_compare\"\n",
"\n",
"compareStep = PythonScriptStep(\n",
" script_name=\"compare.py\",\n",
" arguments=[\"--compare_data1\", processed_data1, \"--compare_data2\", processed_data2, \"--output_compare\", processed_data3, \"--pipeline_param\", pipeline_param],\n",
" inputs=[processed_data1, processed_data2],\n",
" outputs=[processed_data3], \n",
" arguments=[\"--compare_data1\", processed_data1.as_input(), \"--compare_data2\", processed_data2.as_input(), \"--output_compare\", processed_data3, \"--pipeline_param\", pipeline_param], \n",
" compute_target=aml_compute, \n",
" source_directory=source_directory)\n",
"print(\"compareStep created\")"
@@ -327,7 +325,7 @@
"outputs": [],
"source": [
"# submit a pipeline run\n",
"pipeline_run1 = Experiment(ws, 'Pipeline_experiment').submit(pipeline1)\n",
"pipeline_run1 = Experiment(ws, 'Pipeline_experiment_sample').submit(pipeline1)\n",
"# publish a pipeline from the submitted pipeline run\n",
"published_pipeline2 = pipeline_run1.publish_pipeline(name=\"My_New_Pipeline2\", description=\"My Published Pipeline Description\", version=\"0.1\", continue_on_step_failure=True)\n",
"published_pipeline2"

View File

@@ -259,7 +259,7 @@
"\n",
"schedule = Schedule.create(workspace=ws, name=\"My_Schedule\",\n",
" pipeline_id=pub_pipeline_id, \n",
" experiment_name='Schedule_Run',\n",
" experiment_name='Schedule-run-sample',\n",
" recurrence=recurrence,\n",
" wait_for_provisioning=True,\n",
" description=\"Schedule Run\")\n",
@@ -445,7 +445,7 @@
"\n",
"schedule = Schedule.create(workspace=ws, name=\"My_Schedule\",\n",
" pipeline_id=pub_pipeline_id, \n",
" experiment_name='Schedule_Run',\n",
" experiment_name='Schedule-run-sample',\n",
" datastore=datastore,\n",
" wait_for_provisioning=True,\n",
" description=\"Schedule Run\")\n",
@@ -516,7 +516,7 @@
"\n",
"schedule = Schedule.create_for_pipeline_endpoint(workspace=ws, name=\"My_Endpoint_Schedule\",\n",
" pipeline_endpoint_id=published_pipeline_endpoint_id,\n",
" experiment_name='Schedule_Run',\n",
" experiment_name='Schedule-run-sample',\n",
" recurrence=recurrence, description=\"Schedule_Run\",\n",
" wait_for_provisioning=True)\n",
"\n",

View File

@@ -553,7 +553,7 @@
"outputs": [],
"source": [
"from azureml.core import Experiment\n",
"pipeline_run = Experiment(ws, name=\"submit_from_endpoint\").submit(pipeline_endpoint_by_name, tags={'endpoint_tag': \"1\"}, pipeline_version=\"0\")"
"pipeline_run = Experiment(ws, name=\"submit_endpoint_sample\").submit(pipeline_endpoint_by_name, tags={'endpoint_tag': \"1\"}, pipeline_version=\"0\")"
]
}
],

View File

@@ -101,7 +101,7 @@
"metadata": {},
"source": [
"## Create an Azure ML experiment\n",
"Let's create an experiment named \"automlstep-classification\" and a folder to hold the training scripts. The script runs will be recorded under the experiment in Azure.\n",
"Let's create an experiment named \"automlstep-sample\" and a folder to hold the training scripts. The script runs will be recorded under the experiment in Azure.\n",
"\n",
"The best practice is to use separate folders for scripts and its dependent files for each step and specify that folder as the `source_directory` for the step. This helps reduce the size of the snapshot created for the step (only the specific folder is snapshotted). Since changes in any files in the `source_directory` would trigger a re-upload of the snapshot, this helps keep the reuse of the step when there are no changes in the `source_directory` of the step."
]
@@ -113,7 +113,7 @@
"outputs": [],
"source": [
"# Choose a name for the run history container in the workspace.\n",
"experiment_name = 'automlstep-classification'\n",
"experiment_name = 'automlstep-sample'\n",
"project_folder = './project'\n",
"\n",
"experiment = Experiment(ws, experiment_name)\n",

View File

@@ -0,0 +1,343 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Copyright (c) Microsoft Corporation. All rights reserved.\n",
"\n",
"Licensed under the MIT License."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-how-to-use-estimatorstep.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# How to use CommandStep in Azure ML Pipelines\n",
"\n",
"This notebook shows how to use the CommandStep with Azure Machine Learning Pipelines for running R scripts in a pipeline.\n",
"\n",
"The example shows training a model in R to predict probability of fatality for vehicle crashes.\n",
"\n",
"\n",
"## Prerequisite:\n",
"* Understand the [architecture and terms](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture) introduced by Azure Machine Learning\n",
"* If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, go through the [configuration notebook](https://aka.ms/pl-config) to:\n",
" * install the Azure ML SDK\n",
" * create a workspace and its configuration file (`config.json`)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's get started. First let's import some Python libraries."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import azureml.core\n",
"# check core SDK version number\n",
"print(\"Azure ML SDK Version: \", azureml.core.VERSION)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Initialize workspace\n",
"Initialize a [Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace) object from the existing workspace you created in the Prerequisites step. `Workspace.from_config()` creates a workspace object from the details stored in `config.json`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core import Workspace\n",
"ws = Workspace.from_config()\n",
"print('Workspace name: ' + ws.name, \n",
" 'Azure region: ' + ws.location, \n",
" 'Subscription id: ' + ws.subscription_id, \n",
" 'Resource group: ' + ws.resource_group, sep = '\\n')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create or Attach existing AmlCompute\n",
"You will need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for training your model. In this tutorial, you create `AmlCompute` as your training compute resource."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we could not find the cluster with the given name, then we will create a new cluster here. We will create an `AmlCompute` cluster of `STANDARD_D2_V2` CPU VMs. This process is broken down into 3 steps:\n",
"1. create the configuration (this step is local and only takes a second)\n",
"2. create the cluster (this step will take about **20 seconds**)\n",
"3. provision the VMs to bring the cluster to the initial size (of 1 in this case). This step will take about **3-5 minutes** and is providing only sparse output in the process. Please make sure to wait until the call returns before moving to the next cell"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.compute import ComputeTarget, AmlCompute\n",
"from azureml.core.compute_target import ComputeTargetException\n",
"\n",
"# choose a name for your cluster\n",
"cluster_name = \"cpu-cluster\"\n",
"\n",
"try:\n",
" compute_target = ComputeTarget(workspace=ws, name=cluster_name)\n",
" print('Found existing compute target')\n",
"except ComputeTargetException:\n",
" print('Creating a new compute target...')\n",
" compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2', max_nodes=4)\n",
"\n",
" # create the cluster\n",
" compute_target = ComputeTarget.create(ws, cluster_name, compute_config)\n",
"\n",
" # can poll for a minimum number of nodes and for a specific timeout. \n",
" # if no min node count is provided it uses the scale settings for the cluster\n",
" compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)\n",
"\n",
"# use get_status() to get a detailed status for the current cluster. \n",
"print(compute_target.get_status().serialize())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that you have created the compute target, let's see what the workspace's `compute_targets` property returns. You should now see one entry named 'cpu-cluster' of type `AmlCompute`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create a CommandStep\n",
"CommandStep adds a step to run a command in a Pipeline. For the full set of configurable options see the CommandStep [reference docs](https://docs.microsoft.com/python/api/azureml-pipeline-steps/azureml.pipeline.steps.commandstep?view=azure-ml-py).\n",
"\n",
"- **name:** Name of the step\n",
"- **runconfig:** ScriptRunConfig object. You can configure a ScriptRunConfig object as you would for a standalone non-pipeline run and pass it in to this parameter. If using this option, you do not have to specify the `command`, `source_directory`, `compute_target` parameters of the CommandStep constructor as they are already defined in your ScriptRunConfig.\n",
"- **runconfig_pipeline_params:** Override runconfig properties at runtime using key-value pairs each with name of the runconfig property and PipelineParameter for that property\n",
"- **command:** The command to run or path of the executable/script relative to `source_directory`. It is required unless the `runconfig` parameter is specified. It can be specified with string arguments in a single string or with input/output/PipelineParameter in a list.\n",
"- **source_directory:** A folder containing the script and other resources used in the step.\n",
"- **compute_target:** Compute target to use \n",
"- **allow_reuse:** Whether the step should reuse previous results when run with the same settings/inputs. If this is false, a new run will always be generated for this step during pipeline execution.\n",
"- **version:** Optional version tag to denote a change in functionality for the step\n",
"\n",
"> The best practice is to use separate folders for scripts and its dependent files for each step and specify that folder as the `source_directory` for the step. This helps reduce the size of the snapshot created for the step (only the specific folder is snapshotted). Since changes in any files in the `source_directory` would trigger a re-upload of the snapshot, this helps keep the reuse of the step when there are no changes in the `source_directory` of the step."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Configure environment\n",
"\n",
"Configure the environment for the train step. In this example we will create an environment from the Dockerfile we have included.\n",
"\n",
"> Azure ML currently requires Python as an implicit dependency, so Python must installed in your image even if your training script does not have this dependency."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core import Environment\n",
"import os\n",
"\n",
"src_dir = 'commandstep_r'\n",
"\n",
"env = Environment.from_dockerfile(name='r_env', dockerfile=os.path.join(src_dir, 'Dockerfile'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Configure input training dataset\n",
"\n",
"This tutorial uses data from the US National Highway Traffic Safety Administration. This dataset includes data from over 25,000 car crashes in the US, with variables you can use to predict the likelihood of a fatality. We have included an Rdata file that includes the accidents data for analysis.\n",
"\n",
"Here we use the workspace's default datastore to upload the training data file (**accidents.Rd**); in practice you can use any datastore you want."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"datastore = ws.get_default_datastore()\n",
"data_ref = datastore.upload_files(files=[os.path.join(src_dir, 'accidents.Rd')], target_path='accidentdata')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now create a FileDataset from the data, which will be used as an input to the train step."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core import Dataset\n",
"dataset = Dataset.File.from_files(datastore.path('accidentdata'))\n",
"dataset"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now create a ScriptRunConfig that configures the training run. Note that in the `command` we include the input dataset for the training data.\n",
"\n",
"> For detailed guidance on how to move data in pipelines for input and output data, see the documentation [Moving data into and between ML pipelines](https://docs.microsoft.com/azure/machine-learning/how-to-move-data-in-out-of-pipelines)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core import ScriptRunConfig\n",
"\n",
"train_config = ScriptRunConfig(source_directory=src_dir,\n",
" command=['Rscript accidents.R --data_folder', dataset.as_mount(), '--output_folder outputs'],\n",
" compute_target=compute_target,\n",
" environment=env)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now create a CommandStep and pass in the ScriptRunConfig object to the `runconfig` parameter."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.pipeline.steps import CommandStep\n",
"\n",
"train = CommandStep(name='train', runconfig=train_config)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Build and Submit the Pipeline"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.pipeline.core import Pipeline\n",
"from azureml.core import Experiment\n",
"\n",
"pipeline = Pipeline(workspace=ws, steps=[train])\n",
"pipeline_run = Experiment(ws, 'r-commandstep-pipeline').submit(pipeline)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## View Run Details"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.widgets import RunDetails\n",
"RunDetails(pipeline_run).show()"
]
}
],
"metadata": {
"authors": [
{
"name": "minxia"
}
],
"category": "tutorial",
"compute": [
"AML Compute"
],
"datasets": [
"Custom"
],
"deployment": [
"None"
],
"exclude_from_index": false,
"framework": [
"Azure ML"
],
"friendly_name": "Azure Machine Learning Pipeline with CommandStep for R",
"kernelspec": {
"display_name": "Python 3.6",
"language": "python",
"name": "python36"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.7"
},
"order_index": 7,
"star_tag": [
"None"
],
"tags": [
"None"
],
"task": "Demonstrates the use of CommandStep for running R scripts"
},
"nbformat": 4,
"nbformat_minor": 2
}

View File

@@ -20,15 +20,15 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# How to use EstimatorStep in AML Pipeline\n",
"# How to use CommandStep in Azure ML Pipelines\n",
"\n",
"This notebook shows how to use the EstimatorStep with Azure Machine Learning Pipelines. Estimator is a convenient object in Azure Machine Learning that wraps run configuration information to help simplify the tasks of specifying how a script is executed.\n",
"This notebook shows how to use the CommandStep with Azure Machine Learning Pipelines for running commands in steps. The example shows running distributed TensorFlow training from within a pipeline.\n",
"\n",
"\n",
"## Prerequisite:\n",
"* Understand the [architecture and terms](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture) introduced by Azure Machine Learning\n",
"* If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, go through the [configuration notebook](https://aka.ms/pl-config) to:\n",
" * install the AML SDK\n",
" * install the Azure ML SDK\n",
" * create a workspace and its configuration file (`config.json`)"
]
},
@@ -100,75 +100,57 @@
"from azureml.core.compute_target import ComputeTargetException\n",
"\n",
"# choose a name for your cluster\n",
"cluster_name = \"amlcomp\"\n",
"cluster_name = \"gpu-cluster\"\n",
"\n",
"try:\n",
" cpu_cluster = ComputeTarget(workspace=ws, name=cluster_name)\n",
" gpu_cluster = ComputeTarget(workspace=ws, name=cluster_name)\n",
" print('Found existing compute target')\n",
"except ComputeTargetException:\n",
" print('Creating a new compute target...')\n",
" compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_NC6', max_nodes=4)\n",
"\n",
" # create the cluster\n",
" cpu_cluster = ComputeTarget.create(ws, cluster_name, compute_config)\n",
" gpu_cluster = ComputeTarget.create(ws, cluster_name, compute_config)\n",
"\n",
" # can poll for a minimum number of nodes and for a specific timeout. \n",
" # if no min node count is provided it uses the scale settings for the cluster\n",
" cpu_cluster.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)\n",
" gpu_cluster.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)\n",
"\n",
"# use get_status() to get a detailed status for the current cluster. \n",
"print(cpu_cluster.get_status().serialize())"
"print(gpu_cluster.get_status().serialize())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that you have created the compute target, let's see what the workspace's `compute_targets` property returns. You should now see one entry named 'cpu-cluster' of type `AmlCompute`."
"Now that you have created the compute target, let's see what the workspace's `compute_targets` property returns. You should now see one entry named 'gpu-cluster' of type `AmlCompute`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Use a simple script\n",
"We have already created a simple \"hello world\" script. This is the script that we will submit through the estimator pattern. It prints a hello-world message, and if Azure ML SDK is installed, it will also logs an array of values ([Fibonacci numbers](https://en.wikipedia.org/wiki/Fibonacci_number))."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Build an Estimator object\n",
"Estimator by default will attempt to use Docker-based execution. You can also enable Docker and let estimator pick the default CPU image supplied by Azure ML for execution. You can target an AmlCompute cluster (or any other supported compute target types). You can also customize the conda environment by adding conda and/or pip packages.\n",
"## Create a CommandStep\n",
"CommandStep adds a step to run a command in a Pipeline. For the full set of configurable options see the CommandStep [reference docs](https://docs.microsoft.com/python/api/azureml-pipeline-steps/azureml.pipeline.steps.commandstep?view=azure-ml-py).\n",
"\n",
"> Note: The arguments to the entry script used in the Estimator object should be specified as *list* using\n",
" 'estimator_entry_script_arguments' parameter when instantiating EstimatorStep. Estimator object's parameter\n",
" 'script_params' accepts a dictionary. However 'estimator_entry_script_arguments' parameter expects arguments as\n",
" a list.\n",
"- **name:** Name of the step\n",
"- **runconfig:** ScriptRunConfig object. You can configure a ScriptRunConfig object as you would for a standalone non-pipeline run and pass it in to this parameter. If using this option, you do not have to specify the `command`, `source_directory`, `compute_target` parameters of the CommandStep constructor as they are already defined in your ScriptRunConfig.\n",
"- **runconfig_pipeline_params:** Override runconfig properties at runtime using key-value pairs each with name of the runconfig property and PipelineParameter for that property\n",
"- **command:** The command to run or path of the executable/script relative to `source_directory`. It is required unless the `runconfig` parameter is specified. It can be specified with string arguments in a single string or with input/output/PipelineParameter in a list.\n",
"- **source_directory:** A folder containing the script and other resources used in the step.\n",
"- **compute_target:** Compute target to use \n",
"- **allow_reuse:** Whether the step should reuse previous results when run with the same settings/inputs. If this is false, a new run will always be generated for this step during pipeline execution.\n",
"- **version:** Optional version tag to denote a change in functionality for the step\n",
"\n",
"> Estimator object initialization involves specifying a list of data input and output.\n",
" In Pipelines, a step can take another step's output as input. So when creating an EstimatorStep.\n",
" \n",
"> The best practice is to use separate folders for scripts and its dependent files for each step and specify that folder as the `source_directory` for the step. This helps reduce the size of the snapshot created for the step (only the specific folder is snapshotted). Since changes in any files in the `source_directory` would trigger a re-upload of the snapshot, this helps keep the reuse of the step when there are no changes in the `source_directory` of the step."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"datareference-remarks-sample"
]
},
"outputs": [],
"cell_type": "markdown",
"metadata": {},
"source": [
"from azureml.core import Datastore\n",
"\n",
"def_blob_store = Datastore(ws, \"workspaceblobstore\")\n",
"\n",
"#upload input data to workspaceblobstore\n",
"def_blob_store.upload_files(files=['20news.pkl'], target_path='20newsgroups', overwrite=True)"
"First define the environment that you want to step to run in. This example users a curated TensorFlow environment, but in practice you can configure any environment you want."
]
},
{
@@ -177,46 +159,46 @@
"metadata": {},
"outputs": [],
"source": [
"from azureml.core import Dataset\n",
"from azureml.data import OutputFileDatasetConfig\n",
"from azureml.core import Environment\n",
"\n",
"# create dataset to be used as the input to estimator step\n",
"input_data = Dataset.File.from_files(def_blob_store.path('20newsgroups/20news.pkl'))\n",
"\n",
"# OutputFileDatasetConfig by default write output to the default workspaceblobstore\n",
"output = OutputFileDatasetConfig()\n",
"\n",
"source_directory = 'estimator_train'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.train.estimator import Estimator\n",
"\n",
"est = Estimator(source_directory=source_directory, \n",
" compute_target=cpu_cluster, \n",
" entry_script='dummy_train.py', \n",
" conda_packages=['scikit-learn'])"
"tf_env = Environment.get(ws, name='AzureML-TensorFlow-2.3-GPU')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create an EstimatorStep\n",
"[EstimatorStep](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-steps/azureml.pipeline.steps.estimator_step.estimatorstep?view=azure-ml-py) adds a step to run Estimator in a Pipeline.\n",
"This example will first create a ScriptRunConfig object that configures the training job. Since we are running a distributed job, specify the `distributed_job_config` parameter. If you are just running a single-node job, omit that parameter.\n",
"\n",
"- **name:** Name of the step\n",
"- **estimator:** Estimator object\n",
"- **estimator_entry_script_arguments:** A list of command-line arguments\n",
"- **runconfig_pipeline_params:** Override runconfig properties at runtime using key-value pairs each with name of the runconfig property and PipelineParameter for that property\n",
"- **compute_target:** Compute target to use \n",
"- **allow_reuse:** Whether the step should reuse previous results when run with the same settings/inputs. If this is false, a new run will always be generated for this step during pipeline execution.\n",
"- **version:** Optional version tag to denote a change in functionality for the step"
"> If you have an input dataset you want to use in this step, you can specify that as part of the command. For example, if you have a FileDataset object called `dataset` and a `--data-dir` script argument, you can do the following: `command=['python train.py --epochs 30 --data-dir', dataset.as_mount()]`.\n",
"\n",
"> For detailed guidance on how to move data in pipelines for input and output data, see the documentation [Moving data into and between ML pipelines](https://docs.microsoft.com/azure/machine-learning/how-to-move-data-in-out-of-pipelines)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core import ScriptRunConfig\n",
"from azureml.core.runconfig import MpiConfiguration\n",
"\n",
"src_dir = 'commandstep_train'\n",
"distr_config = MpiConfiguration(node_count=2) # you can also specify the process_count_per_node parameter for multi-process-per-node training\n",
"\n",
"src = ScriptRunConfig(source_directory=src_dir,\n",
" command=['python train.py --epochs 30'],\n",
" compute_target=gpu_cluster,\n",
" environment=tf_env,\n",
" distributed_job_config=distr_config)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now create a CommandStep and pass in the ScriptRunConfig object to the `runconfig` parameter."
]
},
{
@@ -229,20 +211,16 @@
},
"outputs": [],
"source": [
"from azureml.pipeline.steps import EstimatorStep\n",
"from azureml.pipeline.steps import CommandStep\n",
"\n",
"est_step = EstimatorStep(name=\"Estimator_Train\", \n",
" estimator=est, \n",
" estimator_entry_script_arguments=[\"--datadir\", input_data.as_mount(), \"--output\", output],\n",
" runconfig_pipeline_params=None, \n",
" compute_target=cpu_cluster)"
"train = CommandStep(name='train-mnist', runconfig=src)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Build and Submit the Experiment"
"## Build and Submit the Pipeline"
]
},
{
@@ -253,8 +231,9 @@
"source": [
"from azureml.pipeline.core import Pipeline\n",
"from azureml.core import Experiment\n",
"pipeline = Pipeline(workspace=ws, steps=[est_step])\n",
"pipeline_run = Experiment(ws, 'Estimator_sample').submit(pipeline)"
"\n",
"pipeline = Pipeline(workspace=ws, steps=[train])\n",
"pipeline_run = Experiment(ws, 'train-commandstep-pipeline').submit(pipeline)"
]
},
{
@@ -295,7 +274,7 @@
"framework": [
"Azure ML"
],
"friendly_name": "Azure Machine Learning Pipeline with EstimatorStep",
"friendly_name": "Azure Machine Learning Pipeline with CommandStep",
"kernelspec": {
"display_name": "Python 3.6",
"language": "python",
@@ -311,7 +290,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.7"
"version": "3.7.7"
},
"order_index": 7,
"star_tag": [
@@ -320,7 +299,7 @@
"tags": [
"None"
],
"task": "Demonstrates the use of EstimatorStep"
"task": "Demonstrates the use of CommandStep"
},
"nbformat": 4,
"nbformat_minor": 2

View File

@@ -1,4 +1,4 @@
name: aml-pipelines-how-to-use-estimatorstep
name: aml-pipelines-with-commandstep
dependencies:
- pip:
- azureml-sdk

View File

@@ -428,7 +428,7 @@
"metadata": {},
"outputs": [],
"source": [
"pipeline_run1 = Experiment(ws, 'Data_dependency').submit(pipeline1)\n",
"pipeline_run1 = Experiment(ws, 'Data_dependency_sample').submit(pipeline1)\n",
"print(\"Pipeline is submitted for execution\")"
]
},

View File

@@ -0,0 +1,11 @@
FROM rocker/tidyverse:4.0.0-ubuntu18.04
# Install python
RUN apt-get update -qq && \
apt-get install -y python3
# Create link for python
RUN ln -f /usr/bin/python3 /usr/bin/python
# Install additional R packages
RUN R -e "install.packages(c('optparse'), repos = 'https://cloud.r-project.org/')"

View File

@@ -0,0 +1,34 @@
#' Copyright(c) Microsoft Corporation.
#' Licensed under the MIT license.
library(optparse)
options <- list(
make_option(c("-d", "--data_folder")),
make_option(c("--output_folder"))
)
opt_parser <- OptionParser(option_list = options)
opt <- parse_args(opt_parser)
paste(opt$data_folder)
accidents <- readRDS(file.path(opt$data_folder, "accidents.Rd"))
summary(accidents)
mod <- glm(dead ~ dvcat + seatbelt + frontal + sex + ageOFocc + yearVeh + airbag + occRole, family=binomial, data=accidents)
summary(mod)
predictions <- factor(ifelse(predict(mod)>0.1, "dead","alive"))
accuracy <- mean(predictions == accidents$dead)
# make directory for output dir
output_dir = opt$output_folder
if (!dir.exists(output_dir)){
dir.create(output_dir)
}
# save model
model_path = file.path(output_dir, "model.rds")
saveRDS(mod, file = model_path)
message("Model saved")

View File

@@ -0,0 +1,8 @@
channels:
- conda-forge
dependencies:
- python=3.7
- pip:
- azureml-defaults
- tensorflow-gpu==2.3.0
- horovod==0.19.5

View File

@@ -0,0 +1,120 @@
# Copyright 2019 Uber Technologies, Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Script adapted from: https://github.com/horovod/horovod/blob/master/examples/tensorflow2_keras_mnist.py
# ==============================================================================
import tensorflow as tf
import horovod.tensorflow.keras as hvd
import os
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--learning-rate", "-lr", type=float, default=0.001)
parser.add_argument("--epochs", type=int, default=24)
parser.add_argument("--steps-per-epoch", type=int, default=500)
args = parser.parse_args()
# Horovod: initialize Horovod.
hvd.init()
# Horovod: pin GPU to be used to process local rank (one GPU per process)
gpus = tf.config.experimental.list_physical_devices("GPU")
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
if gpus:
tf.config.experimental.set_visible_devices(gpus[hvd.local_rank()], "GPU")
(mnist_images, mnist_labels), _ = tf.keras.datasets.mnist.load_data(
path="mnist-%d.npz" % hvd.rank()
)
dataset = tf.data.Dataset.from_tensor_slices(
(
tf.cast(mnist_images[..., tf.newaxis] / 255.0, tf.float32),
tf.cast(mnist_labels, tf.int64),
)
)
dataset = dataset.repeat().shuffle(10000).batch(128)
mnist_model = tf.keras.Sequential(
[
tf.keras.layers.Conv2D(32, [3, 3], activation="relu"),
tf.keras.layers.Conv2D(64, [3, 3], activation="relu"),
tf.keras.layers.MaxPooling2D(pool_size=(2, 2)),
tf.keras.layers.Dropout(0.25),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(128, activation="relu"),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(10, activation="softmax"),
]
)
# Horovod: adjust learning rate based on number of GPUs.
scaled_lr = args.learning_rate * hvd.size()
opt = tf.optimizers.Adam(scaled_lr)
# Horovod: add Horovod DistributedOptimizer.
opt = hvd.DistributedOptimizer(opt)
# Horovod: Specify `experimental_run_tf_function=False` to ensure TensorFlow
# uses hvd.DistributedOptimizer() to compute gradients.
mnist_model.compile(
loss=tf.losses.SparseCategoricalCrossentropy(),
optimizer=opt,
metrics=["accuracy"],
experimental_run_tf_function=False,
)
callbacks = [
# Horovod: broadcast initial variable states from rank 0 to all other processes.
# This is necessary to ensure consistent initialization of all workers when
# training is started with random weights or restored from a checkpoint.
hvd.callbacks.BroadcastGlobalVariablesCallback(0),
# Horovod: average metrics among workers at the end of every epoch.
#
# Note: This callback must be in the list before the ReduceLROnPlateau,
# TensorBoard or other metrics-based callbacks.
hvd.callbacks.MetricAverageCallback(),
# Horovod: using `lr = 1.0 * hvd.size()` from the very beginning leads to worse final
# accuracy. Scale the learning rate `lr = 1.0` ---> `lr = 1.0 * hvd.size()` during
# the first three epochs. See https://arxiv.org/abs/1706.02677 for details.
hvd.callbacks.LearningRateWarmupCallback(
warmup_epochs=3, initial_lr=scaled_lr, verbose=1
),
]
# Horovod: save checkpoints only on worker 0 to prevent other workers from corrupting them.
if hvd.rank() == 0:
output_dir = "./outputs"
os.makedirs(output_dir, exist_ok=True)
callbacks.append(
tf.keras.callbacks.ModelCheckpoint(
os.path.join(output_dir, "checkpoint-{epoch}.h5")
)
)
# Horovod: write logs on worker 0.
verbose = 1 if hvd.rank() == 0 else 0
# Train the model.
# Horovod: adjust number of steps based on number of GPUs.
mnist_model.fit(
dataset,
steps_per_epoch=args.steps_per_epoch // hvd.size(),
callbacks=callbacks,
epochs=args.epochs,
verbose=verbose,
)

View File

@@ -1,30 +0,0 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
import argparse
import os
print("*********************************************************")
print("Hello Azure ML!")
parser = argparse.ArgumentParser()
parser.add_argument('--datadir', type=str, help="data directory")
parser.add_argument('--output', type=str, help="output")
args = parser.parse_args()
print("Argument 1: %s" % args.datadir)
print("Argument 2: %s" % args.output)
if not (args.output is None):
os.makedirs(args.output, exist_ok=True)
print("%s created" % args.output)
try:
from azureml.core import Run
run = Run.get_context()
print("Log Fibonacci numbers.")
run.log_list('Fibonacci numbers', [0, 1, 1, 2, 3, 5, 8, 13, 21, 34])
run.complete()
except:
print("Warning: you need to install Azure ML SDK in order to log metrics.")
print("*********************************************************")

View File

@@ -22,3 +22,6 @@ print("Argument 4: %s" % args.pipeline_param)
if not (args.output_compare is None):
os.makedirs(args.output_compare, exist_ok=True)
print("%s created" % args.output_compare)
with open(os.path.join(args.output_compare, 'compare.txt'), 'w') as fw:
fw.write('here is the compare result')

View File

@@ -19,3 +19,8 @@ print("Argument 2: %s" % args.output_extract)
if not (args.output_extract is None):
os.makedirs(args.output_extract, exist_ok=True)
print("%s created" % args.output_extract)
with open(os.path.join(args.input_extract, '20news.pkl'), 'rb') as f:
content = f.read()
with open(os.path.join(args.output_extract, '20news.pkl'), 'wb') as fw:
fw.write(content)

View File

@@ -20,3 +20,8 @@ print("Argument 2: %s" % args.output_train)
if not (args.output_train is None):
os.makedirs(args.output_train, exist_ok=True)
print("%s created" % args.output_train)
with open(os.path.join(args.input_data, '20news.pkl'), 'rb') as f:
content = f.read()
with open(os.path.join(args.output_train, '20news.pkl'), 'wb') as fw:
fw.write(content)

View File

@@ -28,6 +28,7 @@
" 2. Azure CLI Authentication\n",
" 3. Managed Service Identity (MSI) Authentication\n",
" 4. Service Principal Authentication\n",
" 5. Token Authentication\n",
" \n",
"The interactive authentication is suitable for local experimentation on your own computer. Azure CLI authentication is suitable if you are already using Azure CLI for managing Azure resources, and want to sign in only once. The MSI and Service Principal authentication are suitable for automated workflows, for example as part of Azure Devops build."
]
@@ -319,6 +320,66 @@
"See [Register an application with the Microsoft identity platform](https://docs.microsoft.com/en-us/azure/active-directory/develop/quickstart-register-app) quickstart for more details about application registrations. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Token Authentication\n",
"\n",
"When token generation and its refresh needs to be outside on AML SDK, we recommend using Token Authentication. It can be used for getting token for AML or ARM audience. Thus giving more granular control over token generated.\n",
"\n",
"This authentication class requires users to provide method `get_token_for_audience` which will be called to retrieve the token based on the audience passed.\n",
"\n",
"Audience that is passed to `get_token_for_audience` can be ARM or AML. Exact value that will be passed as audience will depend on cloud and type for audience."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.authentication import TokenAuthentication, Audience\n",
"\n",
"# This is a sample method to retrieve token and will be passed to TokenAuthentication\n",
"def get_token_for_audience(audience):\n",
" from adal import AuthenticationContext\n",
" client_id = \"my-client-id\"\n",
" client_secret = \"my-client-secret\"\n",
" tenant_id = \"my-tenant-id\"\n",
" auth_context = AuthenticationContext(\"https://login.microsoftonline.com/{}\".format(tenant_id))\n",
" resp = auth_context.acquire_token_with_client_credentials(audience,client_id,client_secret)\n",
" token = resp[\"accessToken\"]\n",
" return token\n",
"\n",
"\n",
"token_auth = TokenAuthentication(get_token_for_audience=get_token_for_audience)\n",
"\n",
"ws = Workspace(\n",
" subscription_id=\"my-subscription-id\",\n",
" resource_group=\"my-ml-rg\",\n",
" workspace_name=\"my-ml-workspace\",\n",
" auth=token_auth\n",
" )\n",
"\n",
"print(\"Found workspace {} at location {}\".format(ws.name, ws.location))\n",
"\n",
"token_aml_audience = token_auth.get_token(Audience.aml)\n",
"token_arm_audience = token_auth.get_token(Audience.arm)\n",
"\n",
"# Value of audience pass to `get_token_for_audience` can be retrieved as follows:\n",
"# aud_aml_val = token_auth.get_aml_resource_id() # For AML\n",
"# aud_arm_val = token_auth._cloud_type.endpoints.active_directory_resource_id # For ARM\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Token authentication object can be used to retrieve token for either AML or ARM audience,\n",
"which can be used by other clients to authenticate to AML or ARM"
]
},
{
"cell_type": "markdown",
"metadata": {},
@@ -350,7 +411,7 @@
},
"outputs": [],
"source": [
"import os, uuid\n",
"import uuid\n",
"\n",
"local_secret = os.environ.get(\"LOCAL_SECRET\", default = str(uuid.uuid4())) # Use random UUID as a substitute for real secret.\n",
"keyvault = ws.get_default_keyvault()\n",

View File

@@ -4,6 +4,8 @@ import os
import numpy as np
from utils import download_mnist
import chainer
from chainer import backend
from chainer import backends
@@ -17,6 +19,7 @@ from chainer.training import extensions
from chainer.dataset import concat_examples
from chainer.backends.cuda import to_cpu
from azureml.core.run import Run
run = Run.get_context()
@@ -49,7 +52,7 @@ def main():
args = parser.parse_args()
# Download the MNIST data if you haven't downloaded it yet
train, test = datasets.mnist.get_mnist(withlabel=True, ndim=1)
train, test = download_mnist()
gpu_id = args.gpu_id
batchsize = args.batchsize

View File

@@ -2,6 +2,8 @@ import numpy as np
import os
import json
from utils import download_mnist
from chainer import serializers, using_config, Variable, datasets
import chainer.functions as F
import chainer.links as L
@@ -41,7 +43,7 @@ def init():
def run(input_data):
i = np.array(json.loads(input_data)['data'])
_, test = datasets.get_mnist()
_, test = download_mnist()
x = Variable(np.asarray([test[i][0]]))
y = model(x)

View File

@@ -217,7 +217,8 @@
"import shutil\n",
"\n",
"shutil.copy('chainer_mnist.py', project_folder)\n",
"shutil.copy('chainer_score.py', project_folder)"
"shutil.copy('chainer_score.py', project_folder)\n",
"shutil.copy('utils.py', project_folder)"
]
},
{
@@ -263,6 +264,7 @@
"- python=3.6.2\n",
"- pip:\n",
" - azureml-defaults\n",
" - azureml-opendatasets\n",
" - chainer==5.1.0\n",
" - cupy-cuda90==5.1.0\n",
" - mpi4py==3.0.0\n",
@@ -557,6 +559,7 @@
"cd.add_conda_package('numpy')\n",
"cd.add_pip_package('chainer==5.1.0')\n",
"cd.add_pip_package(\"azureml-defaults\")\n",
"cd.add_pip_package(\"azureml-opendatasets\")\n",
"cd.save_to_file(base_directory='./', conda_file_path='myenv.yml')\n",
"\n",
"print(cd.serialize_to_string())"
@@ -584,7 +587,8 @@
"\n",
"\n",
"myenv = Environment.from_conda_specification(name=\"myenv\", file_path=\"myenv.yml\")\n",
"inference_config = InferenceConfig(entry_script=\"chainer_score.py\", environment=myenv)\n",
"inference_config = InferenceConfig(entry_script=\"chainer_score.py\", environment=myenv,\n",
" source_directory=project_folder)\n",
"\n",
"aciconfig = AciWebservice.deploy_configuration(cpu_cores=1,\n",
" auth_enabled=True, # this flag generates API keys to secure access\n",
@@ -592,11 +596,11 @@
" tags={'name': 'mnist', 'framework': 'Chainer'},\n",
" description='Chainer DNN with MNIST')\n",
"\n",
"service = Model.deploy(workspace=ws, \n",
" name='chainer-mnist-1', \n",
" models=[model], \n",
" inference_config=inference_config, \n",
" deployment_config=aciconfig)\n",
"service = Model.deploy(workspace=ws,\n",
" name='chainer-mnist-1',\n",
" models=[model],\n",
" inference_config=inference_config,\n",
" deployment_config=aciconfig)\n",
"service.wait_for_deployment(True)\n",
"print(service.state)\n",
"print(service.scoring_uri)"
@@ -685,13 +689,16 @@
" res = res.reshape(n_items[0], 1)\n",
" return res\n",
"\n",
"os.makedirs('./data/mnist', exist_ok=True)\n",
"urllib.request.urlretrieve('http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz', filename = './data/mnist/test-images.gz')\n",
"urllib.request.urlretrieve('http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz', filename = './data/mnist/test-labels.gz')\n",
"data_folder = os.path.join(os.getcwd(), 'data/mnist')\n",
"os.makedirs(data_folder, exist_ok=True)\n",
"\n",
"X_test = load_data('./data/mnist/test-images.gz', False)\n",
"y_test = load_data('./data/mnist/test-labels.gz', True).reshape(-1)\n",
"urllib.request.urlretrieve('https://azureopendatastorage.blob.core.windows.net/mnist/t10k-images-idx3-ubyte.gz',\n",
" filename=os.path.join(data_folder, 't10k-images-idx3-ubyte.gz'))\n",
"urllib.request.urlretrieve('https://azureopendatastorage.blob.core.windows.net/mnist/t10k-labels-idx1-ubyte.gz',\n",
" filename=os.path.join(data_folder, 't10k-labels-idx1-ubyte.gz'))\n",
"\n",
"X_test = load_data(os.path.join(data_folder, 't10k-images-idx3-ubyte.gz'), False) / np.float32(255.0)\n",
"y_test = load_data(os.path.join(data_folder, 't10k-labels-idx1-ubyte.gz'), True).reshape(-1)\n",
"\n",
"# send a random row from the test set to score\n",
"random_index = np.random.randint(0, len(X_test)-1)\n",

View File

@@ -10,3 +10,4 @@ dependencies:
- gzip
- struct
- requests
- azureml-opendatasets

View File

@@ -0,0 +1,50 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
import glob
import gzip
import numpy as np
import os
import struct
from azureml.core import Dataset
from azureml.opendatasets import MNIST
from chainer.datasets import tuple_dataset
# load compressed MNIST gz files and return numpy arrays
def load_data(filename, label=False):
with gzip.open(filename) as gz:
struct.unpack('I', gz.read(4))
n_items = struct.unpack('>I', gz.read(4))
if not label:
n_rows = struct.unpack('>I', gz.read(4))[0]
n_cols = struct.unpack('>I', gz.read(4))[0]
res = np.frombuffer(gz.read(n_items[0] * n_rows * n_cols), dtype=np.uint8)
res = res.reshape(n_items[0], n_rows * n_cols)
else:
res = np.frombuffer(gz.read(n_items[0]), dtype=np.uint8)
res = res.reshape(n_items[0], 1)
return res
def download_mnist():
data_folder = os.path.join(os.getcwd(), 'data/mnist')
os.makedirs(data_folder, exist_ok=True)
mnist_file_dataset = MNIST.get_file_dataset()
mnist_file_dataset.download(data_folder, overwrite=True)
X_train = load_data(glob.glob(os.path.join(data_folder, "**/train-images-idx3-ubyte.gz"),
recursive=True)[0], False) / 255.0
X_test = load_data(glob.glob(os.path.join(data_folder, "**/t10k-images-idx3-ubyte.gz"),
recursive=True)[0], False) / 255.0
y_train = load_data(glob.glob(os.path.join(data_folder, "**/train-labels-idx1-ubyte.gz"),
recursive=True)[0], True).reshape(-1)
y_test = load_data(glob.glob(os.path.join(data_folder, "**/t10k-labels-idx1-ubyte.gz"),
recursive=True)[0], True).reshape(-1)
train = tuple_dataset.TupleDataset(X_train.astype(np.float32), y_train.astype(np.int32))
test = tuple_dataset.TupleDataset(X_test.astype(np.float32), y_test.astype(np.int32))
return train, test

View File

@@ -21,7 +21,8 @@
"metadata": {},
"source": [
"# Distributed PyTorch with DistributedDataParallel\n",
"In this tutorial, you will train a PyTorch model on the [MNIST](http://yann.lecun.com/exdb/mnist/) dataset using distributed training with PyTorch's `DistributedDataParallel` module across a GPU cluster. "
"\n",
"In this tutorial, you will train a PyTorch model on the [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html) dataset using distributed training with PyTorch's `DistributedDataParallel` module across a GPU cluster."
]
},
{
@@ -113,7 +114,7 @@
"from azureml.core.compute_target import ComputeTargetException\n",
"\n",
"# choose a name for your cluster\n",
"cluster_name = \"gpu-cluster\"\n",
"cluster_name = 'gpu-cluster'\n",
"\n",
"try:\n",
" compute_target = ComputeTarget(workspace=ws, name=cluster_name)\n",
@@ -139,6 +140,68 @@
"The above code creates GPU compute. If you instead want to create CPU compute, provide a different VM size to the `vm_size` parameter, such as `STANDARD_D2_V2`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Prepare dataset\n",
"\n",
"Prepare the dataset used for training. We will first download and extract the publicly available CIFAR-10 dataset from the cs.toronto.edu website and then create an Azure ML FileDataset to use the data for training."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Download and extract CIFAR-10 data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import urllib\n",
"import tarfile\n",
"import os\n",
"\n",
"url = 'https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz'\n",
"filename = 'cifar-10-python.tar.gz'\n",
"data_root = 'cifar-10'\n",
"filepath = os.path.join(data_root, filename)\n",
"\n",
"if not os.path.isdir(data_root):\n",
" os.makedirs(data_root, exist_ok=True)\n",
" urllib.request.urlretrieve(url, filepath)\n",
" with tarfile.open(filepath, \"r:gz\") as tar:\n",
" tar.extractall(path=data_root)\n",
" os.remove(filepath) # delete tar.gz file after extraction"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create Azure ML dataset\n",
"\n",
"The `upload_directory` method will upload the data to a datastore and create a FileDataset from it. In this tutorial we will use the workspace's default datastore."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core import Dataset\n",
"\n",
"datastore = ws.get_default_datastore()\n",
"dataset = Dataset.File.upload_directory(\n",
" src_dir=data_root, target=(datastore, data_root)\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
@@ -161,8 +224,6 @@
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"\n",
"project_folder = './pytorch-distr'\n",
"os.makedirs(project_folder, exist_ok=True)"
]
@@ -172,26 +233,14 @@
"metadata": {},
"source": [
"### Prepare training script\n",
"Now you will need to create your training script. In this tutorial, the script for distributed training of MNIST is already provided for you at `pytorch_mnist.py`. In practice, you should be able to take any custom PyTorch training script as is and run it with Azure ML without having to modify your code.\n",
"\n",
"However, if you would like to use Azure ML's [metric logging](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#logging) capabilities, you will have to add a small amount of Azure ML logic inside your training script. In this example, at each logging interval, we will log the loss for that minibatch to our Azure ML run.\n",
"\n",
"To do so, in `pytorch_mnist.py`, we will first access the Azure ML `Run` object within the script:\n",
"```Python\n",
"from azureml.core.run import Run\n",
"run = Run.get_context()\n",
"```\n",
"Later within the script, we log the loss metric to our run:\n",
"```Python\n",
"run.log('loss', losses.avg)\n",
"```"
"Now you will need to create your training script. In this tutorial, the script for distributed training on CIFAR-10 is already provided for you at `train.py`. In practice, you should be able to take any custom PyTorch training script as is and run it with Azure ML without having to modify your code."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once your script is ready, copy the training script `pytorch_mnist.py` into the project directory."
"Once your script is ready, copy the training script `train.py` into the project directory."
]
},
{
@@ -202,7 +251,7 @@
"source": [
"import shutil\n",
"\n",
"shutil.copy('pytorch_mnist.py', project_folder)"
"shutil.copy('train.py', project_folder)"
]
},
{
@@ -231,26 +280,7 @@
"source": [
"### Create an environment\n",
"\n",
"Define a conda environment YAML file with your training script dependencies and create an Azure ML environment."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%writefile conda_dependencies.yml\n",
"\n",
"channels:\n",
"- conda-forge\n",
"dependencies:\n",
"- python=3.6.2\n",
"- pip:\n",
" - azureml-defaults\n",
" - torch==1.6.0\n",
" - torchvision==0.7.0\n",
" - future==0.17.1"
"In this tutorial, we will use one of Azure ML's curated PyTorch environments for training. [Curated environments](https://docs.microsoft.com/azure/machine-learning/how-to-use-environments#use-a-curated-environment) are available in your workspace by default. Specifically, we will use the PyTorch 1.6 GPU curated environment."
]
},
{
@@ -261,24 +291,39 @@
"source": [
"from azureml.core import Environment\n",
"\n",
"pytorch_env = Environment.from_conda_specification(name = 'pytorch-1.6-gpu', file_path = './conda_dependencies.yml')\n",
"\n",
"# Specify a GPU base image\n",
"pytorch_env.docker.enabled = True\n",
"pytorch_env.docker.base_image = 'mcr.microsoft.com/azureml/openmpi3.1.2-cuda10.1-cudnn7-ubuntu18.04'"
"pytorch_env = Environment.get(ws, name='AzureML-PyTorch-1.6-GPU')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Configure the training job: torch.distributed with NCCL backend\n",
"### Configure the training job\n",
"\n",
"Create a ScriptRunConfig object to specify the configuration details of your training job, including your training script, environment to use, and the compute target to run on.\n",
"To launch a distributed PyTorch job on Azure ML, you have two options:\n",
"\n",
"In order to run a distributed PyTorch job with **torch.distributed** using the NCCL backend, create a `PyTorchConfiguration` and pass it to the `distributed_job_config` parameter of the ScriptRunConfig constructor. Specify `communication_backend='Nccl'` in the PyTorchConfiguration. The below code will configure a 2-node distributed job. The NCCL backend is the recommended backend for PyTorch distributed GPU training.\n",
"1. Per-process launch - specify the total # of worker processes (typically one per GPU) you want to run, and\n",
"Azure ML will handle launching each process.\n",
"2. Per-node launch with [torch.distributed.launch](https://pytorch.org/docs/stable/distributed.html#launch-utility) - provide the `torch.distributed.launch` command you want to\n",
"run on each node.\n",
"\n",
"The script arguments refers to the Azure ML-set environment variables `AZ_BATCHAI_PYTORCH_INIT_METHOD` for shared file-system initialization and `AZ_BATCHAI_TASK_INDEX` for the global rank of each worker process."
"For more information, see the [documentation](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-train-pytorch#distributeddataparallel).\n",
"\n",
"Both options are shown below."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Per-process launch\n",
"\n",
"To use the per-process launch option in which Azure ML will handle launching each of the processes to run your training script,\n",
"\n",
"1. Specify the training script and arguments\n",
"2. Create a `PyTorchConfiguration` and specify `node_count` and `process_count`. The `process_count` is the total number of processes you want to run for the job; this should typically equal the # of GPUs available on each node multiplied by the # of nodes. Since this tutorial uses the `STANDARD_NC6` SKU, which has one GPU, the total process count for a 2-node job is `2`. If you are using a SKU with >1 GPUs, adjust the `process_count` accordingly.\n",
"\n",
"Azure ML will set the `MASTER_ADDR`, `MASTER_PORT`, `NODE_RANK`, `WORLD_SIZE` environment variables on each node, in addition to the process-level `RANK` and `LOCAL_RANK` environment variables, that are needed for distributed PyTorch training."
]
},
{
@@ -290,17 +335,61 @@
"from azureml.core import ScriptRunConfig\n",
"from azureml.core.runconfig import PyTorchConfiguration\n",
"\n",
"args = ['--dist-backend', 'nccl',\n",
" '--dist-url', '$AZ_BATCHAI_PYTORCH_INIT_METHOD',\n",
" '--rank', '$AZ_BATCHAI_TASK_INDEX',\n",
" '--world-size', 2]\n",
"# create distributed config\n",
"distr_config = PyTorchConfiguration(process_count=2, node_count=2)\n",
"\n",
"# create args\n",
"args = [\"--data-dir\", dataset.as_download(), \"--epochs\", 25]\n",
"\n",
"# create job config\n",
"src = ScriptRunConfig(source_directory=project_folder,\n",
" script='pytorch_mnist.py',\n",
" script='train.py',\n",
" arguments=args,\n",
" compute_target=compute_target,\n",
" environment=pytorch_env,\n",
" distributed_job_config=PyTorchConfiguration(communication_backend='Nccl', node_count=2))"
" distributed_job_config=distr_config)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Per-node launch with `torch.distributed.launch`\n",
"\n",
"If you would instead like to use the PyTorch-provided launch utility `torch.distributed.launch` to handle launching the worker processes on each node, you can do so as well. \n",
"\n",
"1. Provide the launch command to the `command` parameter of ScriptRunConfig. For PyTorch jobs Azure ML will set the `MASTER_ADDR`, `MASTER_PORT`, and `NODE_RANK` environment variables on each node, so you can simply just reference those environment variables in your command. If you are using a SKU with >1 GPUs, adjust the `--nproc_per_node` argument accordingly.\n",
"\n",
"2. Create a `PyTorchConfiguration` and specify the `node_count`. You do not need to specify the `process_count`; by default Azure ML will launch one process per node to run the `command` you provided.\n",
"\n",
"Uncomment the code below to configure a job with this method."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"'''\n",
"from azureml.core import ScriptRunConfig\n",
"from azureml.core.runconfig import PyTorchConfiguration\n",
"\n",
"# create distributed config\n",
"distr_config = PyTorchConfiguration(node_count=2)\n",
"\n",
"# define command\n",
"launch_cmd = [\"python -m torch.distributed.launch --nproc_per_node 1 --nnodes 2 \" \\\n",
" \"--node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT --use_env \" \\\n",
" \"train.py --data-dir\", dataset.as_download(), \"--epochs 25\"]\n",
"\n",
"# create job config\n",
"src = ScriptRunConfig(source_directory=project_folder,\n",
" command=launch_cmd,\n",
" compute_target=compute_target,\n",
" environment=pytorch_env,\n",
" distributed_job_config=distr_config)\n",
"'''"
]
},
{
@@ -308,7 +397,7 @@
"metadata": {},
"source": [
"### Submit job\n",
"Run your experiment by submitting your ScriptRunConfig object. Note that this call is asynchronous."
"Run your experiment by submitting your `ScriptRunConfig` object. Note that this call is asynchronous."
]
},
{
@@ -355,50 +444,12 @@
"source": [
"run.wait_for_completion(show_output=True) # this provides a verbose log"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Configure training job: torch.distributed with Gloo backend\n",
"\n",
"If you would instead like to use the Gloo backend for distributed training, you can do so via the following code. The Gloo backend is recommended for distributed CPU training."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core import ScriptRunConfig\n",
"from azureml.core.runconfig import PyTorchConfiguration\n",
"\n",
"args = ['--dist-backend', 'gloo',\n",
" '--dist-url', '$AZ_BATCHAI_PYTORCH_INIT_METHOD',\n",
" '--rank', '$AZ_BATCHAI_TASK_INDEX',\n",
" '--world-size', 2]\n",
"\n",
"src = ScriptRunConfig(source_directory=project_folder,\n",
" script='pytorch_mnist.py',\n",
" arguments=args,\n",
" compute_target=compute_target,\n",
" environment=pytorch_env,\n",
" distributed_job_config=PyTorchConfiguration(communication_backend='Gloo', node_count=2))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once you create the ScriptRunConfig, you can follow the submit steps as shown in the previous steps to submit a PyTorch distributed run using the Gloo backend."
]
}
],
"metadata": {
"authors": [
{
"name": "ninhu"
"name": "minxia"
}
],
"category": "training",
@@ -406,7 +457,7 @@
"AML Compute"
],
"datasets": [
"MNIST"
"CIFAR-10"
],
"deployment": [
"None"
@@ -432,12 +483,12 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
"version": "3.7.7"
},
"tags": [
"None"
],
"task": "Train a model using distributed training via Nccl/Gloo"
"task": "Train a model using distributed training via PyTorch DistributedDataParallel"
},
"nbformat": 4,
"nbformat_minor": 2

View File

@@ -0,0 +1,5 @@
name: distributed-pytorch-with-distributeddataparallel
dependencies:
- pip:
- azureml-sdk
- azureml-widgets

View File

@@ -0,0 +1,238 @@
# Copyright (c) 2017 Facebook, Inc. All rights reserved.
# BSD 3-Clause License
#
# Script adapted from:
# https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html
# ==============================================================================
# imports
import torch
import torchvision
import torchvision.transforms as transforms
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import os
import argparse
# define network architecture
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(3, 32, 3)
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(32, 64, 3)
self.conv3 = nn.Conv2d(64, 128, 3)
self.fc1 = nn.Linear(128 * 6 * 6, 120)
self.dropout = nn.Dropout(p=0.2)
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, 10)
def forward(self, x):
x = F.relu(self.conv1(x))
x = self.pool(F.relu(self.conv2(x)))
x = self.pool(F.relu(self.conv3(x)))
x = x.view(-1, 128 * 6 * 6)
x = self.dropout(F.relu(self.fc1(x)))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x
def train(train_loader, model, criterion, optimizer, epoch, device, print_freq, rank):
running_loss = 0.0
for i, data in enumerate(train_loader, 0):
# get the inputs; data is a list of [inputs, labels]
inputs, labels = data[0].to(device), data[1].to(device)
# zero the parameter gradients
optimizer.zero_grad()
# forward + backward + optimize
outputs = model(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
# print statistics
running_loss += loss.item()
if i % print_freq == 0: # print every print_freq mini-batches
print(
"Rank %d: [%d, %5d] loss: %.3f"
% (rank, epoch + 1, i + 1, running_loss / print_freq)
)
running_loss = 0.0
def evaluate(test_loader, model, device):
classes = (
"plane",
"car",
"bird",
"cat",
"deer",
"dog",
"frog",
"horse",
"ship",
"truck",
)
model.eval()
correct = 0
total = 0
class_correct = list(0.0 for i in range(10))
class_total = list(0.0 for i in range(10))
with torch.no_grad():
for data in test_loader:
images, labels = data[0].to(device), data[1].to(device)
outputs = model(images)
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
c = (predicted == labels).squeeze()
for i in range(10):
label = labels[i]
class_correct[label] += c[i].item()
class_total[label] += 1
# print total test set accuracy
print(
"Accuracy of the network on the 10000 test images: %d %%"
% (100 * correct / total)
)
# print test accuracy for each of the classes
for i in range(10):
print(
"Accuracy of %5s : %2d %%"
% (classes[i], 100 * class_correct[i] / class_total[i])
)
def main(args):
# get PyTorch environment variables
world_size = int(os.environ["WORLD_SIZE"])
rank = int(os.environ["RANK"])
local_rank = int(os.environ["LOCAL_RANK"])
distributed = world_size > 1
# set device
if distributed:
device = torch.device("cuda", local_rank)
else:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# initialize distributed process group using default env:// method
if distributed:
torch.distributed.init_process_group(backend="nccl")
# define train and test dataset DataLoaders
transform = transforms.Compose(
[transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]
)
train_set = torchvision.datasets.CIFAR10(
root=args.data_dir, train=True, download=False, transform=transform
)
if distributed:
train_sampler = torch.utils.data.distributed.DistributedSampler(train_set)
else:
train_sampler = None
train_loader = torch.utils.data.DataLoader(
train_set,
batch_size=args.batch_size,
shuffle=(train_sampler is None),
num_workers=args.workers,
sampler=train_sampler,
)
test_set = torchvision.datasets.CIFAR10(
root=args.data_dir, train=False, download=False, transform=transform
)
test_loader = torch.utils.data.DataLoader(
test_set, batch_size=args.batch_size, shuffle=False, num_workers=args.workers
)
model = Net().to(device)
# wrap model with DDP
if distributed:
model = nn.parallel.DistributedDataParallel(
model, device_ids=[local_rank], output_device=local_rank
)
# define loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(
model.parameters(), lr=args.learning_rate, momentum=args.momentum
)
# train the model
for epoch in range(args.epochs):
print("Rank %d: Starting epoch %d" % (rank, epoch))
if distributed:
train_sampler.set_epoch(epoch)
model.train()
train(
train_loader,
model,
criterion,
optimizer,
epoch,
device,
args.print_freq,
rank,
)
print("Rank %d: Finished Training" % (rank))
if not distributed or rank == 0:
os.makedirs(args.output_dir, exist_ok=True)
model_path = os.path.join(args.output_dir, "cifar_net.pt")
torch.save(model.state_dict(), model_path)
# evaluate on full test dataset
evaluate(test_loader, model, device)
if __name__ == "__main__":
# setup argparse
parser = argparse.ArgumentParser()
parser.add_argument(
"--data-dir", type=str, help="directory containing CIFAR-10 dataset"
)
parser.add_argument("--epochs", default=10, type=int, help="number of epochs")
parser.add_argument(
"--batch-size",
default=16,
type=int,
help="mini batch size for each gpu/process",
)
parser.add_argument(
"--workers",
default=2,
type=int,
help="number of data loading workers for each gpu/process",
)
parser.add_argument(
"--learning-rate", default=0.001, type=float, help="learning rate"
)
parser.add_argument("--momentum", default=0.9, type=float, help="momentum")
parser.add_argument(
"--output-dir", default="outputs", type=str, help="directory to save model to"
)
parser.add_argument(
"--print-freq",
default=200,
type=int,
help="frequency of printing training statistics",
)
args = parser.parse_args()
main(args)

View File

@@ -51,6 +51,17 @@ if args.cuda:
kwargs = {}
# Use Azure Open Datasets for MNIST dataset
datasets.MNIST.resources = [
("https://azureopendatastorage.azurefd.net/mnist/train-images-idx3-ubyte.gz",
"f68b3c2dcbeaaa9fbdd348bbdeb94873"),
("https://azureopendatastorage.azurefd.net/mnist/train-labels-idx1-ubyte.gz",
"d53e105ee54ea40749a09fcbcd1e9432"),
("https://azureopendatastorage.azurefd.net/mnist/t10k-images-idx3-ubyte.gz",
"9fb629c4189551a2d022fa330f9573f3"),
("https://azureopendatastorage.azurefd.net/mnist/t10k-labels-idx1-ubyte.gz",
"ec29112dd5afa0611ce80d1b7f02629c")
]
train_dataset = \
datasets.MNIST('data-%d' % hvd.rank(), train=True, download=True,
transform=transforms.Compose([

View File

@@ -1,209 +0,0 @@
# Copyright (c) 2017, PyTorch contributors
# Modifications copyright (C) Microsoft Corporation
# Licensed under the BSD license
# Adapted from https://github.com/Azure/BatchAI/tree/master/recipes/PyTorch/PyTorch-GPU-Distributed-Gloo
from __future__ import print_function
import argparse
import os
import shutil
import time
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
import torch.nn.parallel
import torch.backends.cudnn as cudnn
import torch.distributed as dist
import torch.utils.data
import torch.utils.data.distributed
import torchvision.models as models
from azureml.core.run import Run
# get the Azure ML run object
run = Run.get_context()
# Training settings
parser = argparse.ArgumentParser(description='PyTorch MNIST Example')
parser.add_argument('--batch-size', type=int, default=64, metavar='N',
help='input batch size for training (default: 64)')
parser.add_argument('--test-batch-size', type=int, default=1000, metavar='N',
help='input batch size for testing (default: 1000)')
parser.add_argument('--epochs', type=int, default=10, metavar='N',
help='number of epochs to train (default: 10)')
parser.add_argument('--lr', type=float, default=0.01, metavar='LR',
help='learning rate (default: 0.01)')
parser.add_argument('--momentum', type=float, default=0.5, metavar='M',
help='SGD momentum (default: 0.5)')
parser.add_argument('--seed', type=int, default=1, metavar='S',
help='random seed (default: 1)')
parser.add_argument('-j', '--workers', default=4, type=int, metavar='N',
help='number of data loading workers (default: 4)')
parser.add_argument('--log-interval', type=int, default=10, metavar='N',
help='how many batches to wait before logging training status')
parser.add_argument('--weight-decay', '--wd', default=1e-4, type=float,
metavar='W', help='weight decay (default: 1e-4)')
parser.add_argument('--world-size', default=1, type=int,
help='number of distributed processes')
parser.add_argument('--dist-url', type=str,
help='url used to set up distributed training')
parser.add_argument('--dist-backend', default='nccl', type=str,
help='distributed backend')
parser.add_argument('--rank', default=-1, type=int,
help='rank of the worker')
best_prec1 = 0
args = parser.parse_args()
args.distributed = args.world_size >= 2
if args.distributed:
dist.init_process_group(backend=args.dist_backend, init_method=args.dist_url,
world_size=args.world_size, rank=args.rank)
train_dataset = datasets.MNIST('data-%d' % args.rank, train=True, download=True,
transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
]))
if args.distributed:
train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)
else:
train_sampler = None
train_loader = torch.utils.data.DataLoader(
train_dataset,
batch_size=args.batch_size, shuffle=(train_sampler is None),
num_workers=args.workers, pin_memory=True, sampler=train_sampler)
test_loader = torch.utils.data.DataLoader(
train_dataset,
batch_size=args.batch_size, shuffle=False,
num_workers=args.workers, pin_memory=True)
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
self.conv2_drop = nn.Dropout2d()
self.fc1 = nn.Linear(320, 50)
self.fc2 = nn.Linear(50, 10)
def forward(self, x):
x = F.relu(F.max_pool2d(self.conv1(x), 2))
x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
x = x.view(-1, 320)
x = F.relu(self.fc1(x))
x = F.dropout(x, training=self.training)
x = self.fc2(x)
return F.log_softmax(x)
model = Net()
if not args.distributed:
model = torch.nn.DataParallel(model).cuda()
else:
model.cuda()
model = torch.nn.parallel.DistributedDataParallel(model)
# define loss function (criterion) and optimizer
criterion = nn.CrossEntropyLoss().cuda()
optimizer = torch.optim.SGD(model.parameters(), args.lr, momentum=args.momentum, weight_decay=args.weight_decay)
def train(epoch):
batch_time = AverageMeter()
data_time = AverageMeter()
losses = AverageMeter()
top1 = AverageMeter()
top5 = AverageMeter()
# switch to train mode
model.train()
end = time.time()
for i, (input, target) in enumerate(train_loader):
# measure data loading time
data_time.update(time.time() - end)
input, target = input.cuda(), target.cuda()
# compute output
try:
output = model(input)
loss = criterion(output, target)
# measure accuracy and record loss
prec1, prec5 = accuracy(output.data, target, topk=(1, 5))
losses.update(loss.item(), input.size(0))
top1.update(prec1[0], input.size(0))
top5.update(prec5[0], input.size(0))
# compute gradient and do SGD step
optimizer.zero_grad()
loss.backward()
optimizer.step()
# measure elapsed time
batch_time.update(time.time() - end)
end = time.time()
if i % 10 == 0:
run.log("loss", losses.avg)
run.log("prec@1", "{0:.3f}".format(top1.avg))
run.log("prec@5", "{0:.3f}".format(top5.avg))
print('Epoch: [{0}][{1}/{2}]\t'
'Time {batch_time.val:.3f} ({batch_time.avg:.3f})\t'
'Data {data_time.val:.3f} ({data_time.avg:.3f})\t'
'Loss {loss.val:.4f} ({loss.avg:.4f})\t'
'Prec@1 {top1.val:.3f} ({top1.avg:.3f})\t'
'Prec@5 {top5.val:.3f} ({top5.avg:.3f})'.format(epoch, i, len(train_loader),
batch_time=batch_time, data_time=data_time,
loss=losses, top1=top1, top5=top5))
except:
import sys
print("Unexpected error:", sys.exc_info()[0])
class AverageMeter(object):
"""Computes and stores the average and current value"""
def __init__(self):
self.reset()
def reset(self):
self.val = 0
self.avg = 0
self.sum = 0
self.count = 0
def update(self, val, n=1):
self.val = val
self.sum += val * n
self.count += n
self.avg = self.sum / self.count
def accuracy(output, target, topk=(1,)):
"""Computes the precision@k for the specified values of k"""
maxk = max(topk)
batch_size = target.size(0)
_, pred = output.topk(maxk, 1, True, True)
pred = pred.t()
correct = pred.eq(target.view(1, -1).expand_as(pred))
res = []
for k in topk:
correct_k = correct[:k].view(-1).float().sum(0, keepdim=True)
res.append(correct_k.mul_(100.0 / batch_size))
return res
for epoch in range(1, args.epochs + 1):
train(epoch)

View File

@@ -102,6 +102,17 @@ torch.manual_seed(args.seed)
device = torch.device("cuda" if use_cuda else "cpu")
kwargs = {'num_workers': 1, 'pin_memory': True} if use_cuda else {}
# Use Azure Open Datasets for MNIST dataset
datasets.MNIST.resources = [
("https://azureopendatastorage.azurefd.net/mnist/train-images-idx3-ubyte.gz",
"f68b3c2dcbeaaa9fbdd348bbdeb94873"),
("https://azureopendatastorage.azurefd.net/mnist/train-labels-idx1-ubyte.gz",
"d53e105ee54ea40749a09fcbcd1e9432"),
("https://azureopendatastorage.azurefd.net/mnist/t10k-images-idx3-ubyte.gz",
"9fb629c4189551a2d022fa330f9573f3"),
("https://azureopendatastorage.azurefd.net/mnist/t10k-labels-idx1-ubyte.gz",
"ec29112dd5afa0611ce80d1b7f02629c")
]
train_loader = torch.utils.data.DataLoader(
datasets.MNIST('../data', train=True, download=True,
transform=transforms.Compose([

View File

@@ -332,6 +332,18 @@
"import random\n",
"import numpy as np\n",
"\n",
"# Use Azure Open Datasets for MNIST dataset\n",
"datasets.MNIST.resources = [\n",
" (\"https://azureopendatastorage.azurefd.net/mnist/train-images-idx3-ubyte.gz\",\n",
" \"f68b3c2dcbeaaa9fbdd348bbdeb94873\"),\n",
" (\"https://azureopendatastorage.azurefd.net/mnist/train-labels-idx1-ubyte.gz\",\n",
" \"d53e105ee54ea40749a09fcbcd1e9432\"),\n",
" (\"https://azureopendatastorage.azurefd.net/mnist/t10k-images-idx3-ubyte.gz\",\n",
" \"9fb629c4189551a2d022fa330f9573f3\"),\n",
" (\"https://azureopendatastorage.azurefd.net/mnist/t10k-labels-idx1-ubyte.gz\",\n",
" \"ec29112dd5afa0611ce80d1b7f02629c\")\n",
"]\n",
"\n",
"test_data = datasets.MNIST('../data', train=False, transform=transforms.Compose([\n",
" transforms.ToTensor(),\n",
" transforms.Normalize((0.1307,), (0.3081,))]))\n",

View File

@@ -147,7 +147,7 @@
"\n",
"To do this, you first must install the Azure Networking API.\n",
"\n",
"`pip install --upgrade azure-mgmt-network`"
"`pip install --upgrade azure-mgmt-network==12.0.0`"
]
},
{
@@ -157,7 +157,7 @@
"outputs": [],
"source": [
"# If you need to install the Azure Networking SDK, uncomment the following line.\n",
"#!pip install --upgrade azure-mgmt-network"
"#!pip install --upgrade azure-mgmt-network==12.0.0"
]
},
{

View File

@@ -5,3 +5,4 @@ dependencies:
- azureml-contrib-reinforcementlearning
- azureml-widgets
- matplotlib
- azure-mgmt-network==12.0.0

View File

@@ -1,70 +0,0 @@
FROM mcr.microsoft.com/azureml/base:openmpi3.1.2-ubuntu18.04
# Install some basic utilities
RUN apt-get update && apt-get install -y \
curl \
ca-certificates \
sudo \
cpio \
git \
bzip2 \
libx11-6 \
tmux \
htop \
gcc \
xvfb \
python-opengl \
x11-xserver-utils \
ffmpeg \
mesa-utils \
nano \
vim \
rsync \
&& rm -rf /var/lib/apt/lists/*
# Create a working directory
RUN mkdir /app
WORKDIR /app
# Install Minecraft needed libraries
RUN mkdir -p /usr/share/man/man1 && \
sudo apt-get update && \
sudo apt-get install -y \
openjdk-8-jre-headless=8u162-b12-1 \
openjdk-8-jdk-headless=8u162-b12-1 \
openjdk-8-jre=8u162-b12-1 \
openjdk-8-jdk=8u162-b12-1
# Create a Python 3.7 environment
RUN conda install conda-build \
&& conda create -y --name py37 python=3.7.3 \
&& conda clean -ya
ENV CONDA_DEFAULT_ENV=py37
# Install minerl
RUN pip install --upgrade --user minerl
RUN pip install \
pandas \
matplotlib \
numpy \
scipy \
azureml-defaults \
tensorboardX \
tensorflow==1.15rc2 \
tabulate \
dm_tree \
lz4 \
ray==0.8.3 \
ray[rllib]==0.8.3 \
ray[tune]==0.8.3
COPY patch_files/* /root/.local/lib/python3.7/site-packages/minerl/env/Malmo/Minecraft/src/main/java/com/microsoft/Malmo/Client/
# Start minerl to pre-fetch minerl files (saves time when starting minerl during training)
RUN xvfb-run -a -s "-screen 0 1400x900x24" python -c "import gym; import minerl; env = gym.make('MineRLTreechop-v0'); env.close();"
RUN pip install --index-url https://test.pypi.org/simple/ malmo && \
python -c "import malmo.minecraftbootstrap; malmo.minecraftbootstrap.download();"
ENV MALMO_XSD_PATH="/app/MalmoPlatform/Schemas"

View File

@@ -1,939 +0,0 @@
// --------------------------------------------------------------------------------------------------
// Copyright (c) 2016 Microsoft Corporation
//
// Permission is hereby granted, free of charge, to any person obtaining a copy of this software and
// associated documentation files (the "Software"), to deal in the Software without restriction,
// including without limitation the rights to use, copy, modify, merge, publish, distribute,
// sublicense, and/or l copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all copies or
// substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT
// NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
// NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,
// DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
// --------------------------------------------------------------------------------------------------
package com.microsoft.Malmo.Client;
import com.microsoft.Malmo.MalmoMod;
import com.microsoft.Malmo.MissionHandlerInterfaces.IWantToQuit;
import com.microsoft.Malmo.Schemas.MissionInit;
import com.microsoft.Malmo.Utils.TCPUtils;
import net.minecraft.profiler.Profiler;
import com.microsoft.Malmo.Utils.TimeHelper;
import net.minecraftforge.common.config.Configuration;
import java.io.*;
import java.net.ServerSocket;
import java.net.Socket;
import java.nio.charset.Charset;
import java.util.Arrays;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.locks.Condition;
import java.util.concurrent.locks.Lock;
import java.util.concurrent.locks.ReentrantLock;
import java.util.Hashtable;
import com.microsoft.Malmo.Utils.TCPInputPoller;
import java.util.logging.Level;
import java.util.LinkedList;
import java.util.List;
/**
* MalmoEnvServer - service supporting OpenAI gym "environment" for multi-agent Malmo missions.
*/
public class MalmoEnvServer implements IWantToQuit {
private static Profiler profiler = new Profiler();
private static int nsteps = 0;
private static boolean debug = false;
private static String hello = "<MalmoEnv" ;
private class EnvState {
// Mission parameters:
String missionInit = null;
String token = null;
String experimentId = null;
int agentCount = 0;
int reset = 0;
boolean quit = false;
boolean synchronous = false;
Long seed = null;
// OpenAI gym state:
boolean done = false;
double reward = 0.0;
byte[] obs = null;
String info = "";
LinkedList<String> commands = new LinkedList<String>();
}
private static boolean envPolicy = false; // Are we configured by config policy?
// Synchronize on EnvStateasd
private Lock lock = new ReentrantLock();
private Condition cond = lock.newCondition();
private EnvState envState = new EnvState();
private Hashtable<String, Integer> initTokens = new Hashtable<String, Integer>();
static final long COND_WAIT_SECONDS = 3; // Max wait in seconds before timing out (and replying to RPC).
static final int BYTES_INT = 4;
static final int BYTES_DOUBLE = 8;
private static final Charset utf8 = Charset.forName("UTF-8");
// Service uses a single per-environment client connection - initiated by the remote environment.
private int port;
private TCPInputPoller missionPoller; // Used for command parsing and not actual communication.
private String version;
// AOG: From running experiments, I've found that MineRL can get stuck resetting the
// environment which causes huge delays while we wait for the Python side to time
// out and restart the Minecraft instace. Minecraft itself is normally in a recoverable
// state, but the MalmoEnvServer instance will be blocked in a tight spin loop trying
// handling a Peek request from the Python client. To unstick things, I've added this
// flag that can be set when we know things are in a bad state to abort the peek request.
// WARNING: THIS IS ONLY TREATING THE SYMPTOM AND NOT THE ROOT CAUSE
// The reason things are getting stuck is because the player is either dying or we're
// receiving a quit request while an episode reset is in progress.
private boolean abortRequest;
public void abort() {
System.out.println("AOG: MalmoEnvServer.abort");
abortRequest = true;
}
/***
* Malmo "Env" service.
* @param port the port the service listens on.
* @param missionPoller for plugging into existing comms handling.
*/
public MalmoEnvServer(String version, int port, TCPInputPoller missionPoller) {
this.version = version;
this.missionPoller = missionPoller;
this.port = port;
// AOG - Assume we don't wan't to be aborting in the first place
this.abortRequest = false;
}
/** Initialize malmo env configuration. For now either on or "legacy" AgentHost protocol.*/
static public void update(Configuration configs) {
envPolicy = configs.get(MalmoMod.ENV_CONFIGS, "env", "false").getBoolean();
}
public static boolean isEnv() {
return envPolicy;
}
/**
* Start servicing the MalmoEnv protocol.
* @throws IOException
*/
public void serve() throws IOException {
ServerSocket serverSocket = new ServerSocket(port);
serverSocket.setPerformancePreferences(0,2,1);
while (true) {
try {
final Socket socket = serverSocket.accept();
socket.setTcpNoDelay(true);
Thread thread = new Thread("EnvServerSocketHandler") {
public void run() {
boolean running = false;
try {
checkHello(socket);
while (true) {
DataInputStream din = new DataInputStream(socket.getInputStream());
int hdr = din.readInt();
byte[] data = new byte[hdr];
din.readFully(data);
String command = new String(data, utf8);
if (command.startsWith("<Step")) {
profiler.startSection("root");
long start = System.nanoTime();
step(command, socket, din);
profiler.endSection();
if (nsteps % 100 == 0 && debug){
List<Profiler.Result> dat = profiler.getProfilingData("root");
for(int qq = 0; qq < dat.size(); qq++){
Profiler.Result res = dat.get(qq);
System.out.println(res.profilerName + " " + res.totalUsePercentage + " "+ res.usePercentage);
}
}
} else if (command.startsWith("<Peek")) {
peek(command, socket, din);
} else if (command.startsWith("<Init")) {
init(command, socket);
} else if (command.startsWith("<Find")) {
find(command, socket);
} else if (command.startsWith("<MissionInit")) {
if (missionInit(din, command, socket))
{
running = true;
}
} else if (command.startsWith("<Quit")) {
quit(command, socket);
profiler.profilingEnabled = false;
} else if (command.startsWith("<Exit")) {
exit(command, socket);
profiler.profilingEnabled = false;
} else if (command.startsWith("<Close")) {
close(command, socket);
profiler.profilingEnabled = false;
} else if (command.startsWith("<Status")) {
status(command, socket);
} else if (command.startsWith("<Echo")) {
command = "<Echo>" + command + "</Echo>";
data = command.getBytes(utf8);
hdr = data.length;
DataOutputStream dout = new DataOutputStream(socket.getOutputStream());
dout.writeInt(hdr);
dout.write(data, 0, hdr);
dout.flush();
} else {
throw new IOException("Unknown env service command");
}
}
} catch (IOException ioe) {
// ioe.printStackTrace();
TCPUtils.Log(Level.SEVERE, "MalmoEnv socket error: " + ioe + " (can be on disconnect)");
// System.out.println("[ERROR] " + "MalmoEnv socket error: " + ioe + " (can be on disconnect)");
// TimeHelper.SyncManager.debugLog("[MALMO_ENV_SERVER] MalmoEnv socket error");
try {
if (running) {
TCPUtils.Log(Level.INFO,"Want to quit on disconnect.");
System.out.println("[LOGTOPY] " + "Want to quit on disconnect.");
setWantToQuit();
}
socket.close();
} catch (IOException ioe2) {
}
}
}
};
thread.start();
} catch (IOException ioe) {
TCPUtils.Log(Level.SEVERE, "MalmoEnv service exits on " + ioe);
}
}
}
private void checkHello(Socket socket) throws IOException {
DataInputStream din = new DataInputStream(socket.getInputStream());
int hdr = din.readInt();
if (hdr <= 0 || hdr > hello.length() + 8) // Version number may be somewhat longer in future.
throw new IOException("Invalid MalmoEnv hello header length");
byte[] data = new byte[hdr];
din.readFully(data);
if (!new String(data).startsWith(hello + version))
throw new IOException("MalmoEnv invalid protocol or version - expected " + hello + version);
}
// Handler for <MissionInit> messages.
private boolean missionInit(DataInputStream din, String command, Socket socket) throws IOException {
String ipOriginator = socket.getInetAddress().getHostName();
int hdr;
byte[] data;
hdr = din.readInt();
data = new byte[hdr];
din.readFully(data);
String id = new String(data, utf8);
TCPUtils.Log(Level.INFO,"Mission Init" + id);
String[] token = id.split(":");
String experimentId = token[0];
int role = Integer.parseInt(token[1]);
int reset = Integer.parseInt(token[2]);
int agentCount = Integer.parseInt(token[3]);
Boolean isSynchronous = Boolean.parseBoolean(token[4]);
Long seed = null;
if(token.length > 5)
seed = Long.parseLong(token[5]);
if(isSynchronous && agentCount > 1){
throw new IOException("Synchronous mode currently does not support multiple agents.");
}
port = -1;
boolean allTokensConsumed = true;
boolean started = false;
lock.lock();
try {
if (role == 0) {
String previousToken = experimentId + ":0:" + (reset - 1);
initTokens.remove(previousToken);
String myToken = experimentId + ":0:" + reset;
if (!initTokens.containsKey(myToken)) {
TCPUtils.Log(Level.INFO,"(Pre)Start " + role + " reset " + reset);
started = startUp(command, ipOriginator, experimentId, reset, agentCount, myToken, seed, isSynchronous);
if (started)
initTokens.put(myToken, 0);
} else {
started = true; // Pre-started previously.
}
// Check that all previous tokens have been consumed. If not don't proceed to mission.
allTokensConsumed = areAllTokensConsumed(experimentId, reset, agentCount);
if (!allTokensConsumed) {
try {
cond.await(COND_WAIT_SECONDS, TimeUnit.SECONDS);
} catch (InterruptedException ie) {
}
allTokensConsumed = areAllTokensConsumed(experimentId, reset, agentCount);
}
} else {
TCPUtils.Log(Level.INFO, "Start " + role + " reset " + reset);
started = startUp(command, ipOriginator, experimentId, reset, agentCount, experimentId + ":" + role + ":" + reset, seed, isSynchronous);
}
} finally {
lock.unlock();
}
DataOutputStream dout = new DataOutputStream(socket.getOutputStream());
dout.writeInt(BYTES_INT);
dout.writeInt(allTokensConsumed && started ? 1 : 0);
dout.flush();
dout.flush();
return allTokensConsumed && started;
}
private boolean areAllTokensConsumed(String experimentId, int reset, int agentCount) {
boolean allTokensConsumed = true;
for (int i = 1; i < agentCount; i++) {
String tokenForAgent = experimentId + ":" + i + ":" + (reset - 1);
if (initTokens.containsKey(tokenForAgent)) {
TCPUtils.Log(Level.FINE,"Mission init - unconsumed " + tokenForAgent);
allTokensConsumed = false;
}
}
return allTokensConsumed;
}
private boolean startUp(String command, String ipOriginator, String experimentId, int reset, int agentCount, String myToken, Long seed, Boolean isSynchronous) throws IOException {
// Clear out mission state
envState.reward = 0.0;
envState.commands.clear();
envState.obs = null;
envState.info = "";
envState.missionInit = command;
envState.done = false;
envState.quit = false;
envState.token = myToken;
envState.experimentId = experimentId;
envState.agentCount = agentCount;
envState.reset = reset;
envState.synchronous = isSynchronous;
envState.seed = seed;
return startUpMission(command, ipOriginator);
}
private boolean startUpMission(String command, String ipOriginator) throws IOException {
if (missionPoller == null)
return false;
ByteArrayOutputStream baos = new ByteArrayOutputStream();
DataOutputStream dos = new DataOutputStream(baos);
missionPoller.commandReceived(command, ipOriginator, dos);
dos.flush();
byte[] reply = baos.toByteArray();
ByteArrayInputStream bais = new ByteArrayInputStream(reply);
DataInputStream dis = new DataInputStream(bais);
int hdr = dis.readInt();
byte[] replyBytes = new byte[hdr];
dis.readFully(replyBytes);
String replyStr = new String(replyBytes);
if (replyStr.equals("MALMOOK")) {
TCPUtils.Log(Level.INFO, "MalmoEnvServer Mission starting ...");
return true;
} else if (replyStr.equals("MALMOBUSY")) {
TCPUtils.Log(Level.INFO, "MalmoEnvServer Busy - I want to quit");
this.envState.quit = true;
}
return false;
}
private static final int stepTagLength = "<Step_>".length(); // Step with option code.
private synchronized void stepSync(String command, Socket socket, DataInputStream din) throws IOException
{
// TimeHelper.SyncManager.debugLog("[MALMO_ENV_SERVER] <STEP> Entering synchronous step.");
nsteps += 1;
profiler.startSection("commandProcessing");
String actions = command.substring(stepTagLength, command.length() - (stepTagLength + 2));
int options = Character.getNumericValue(command.charAt(stepTagLength - 2));
boolean withInfo = options == 0 || options == 2;
// Prepare to write data to the client.
DataOutputStream dout = new DataOutputStream(socket.getOutputStream());
double reward = 0.0;
boolean done;
byte[] obs;
String info = "";
boolean sent = false;
// TimeHelper.SyncManager.debugLog("[MALMO_ENV_SERVER] <STEP> Acquiring lock for synchronous step.");
lock.lock();
try {
// TimeHelper.SyncManager.debugLog("[MALMO_ENV_SERVER] <STEP> Lock is acquired.");
done = envState.done;
// TODO Handle when the environment is done.
// Process the actions.
if (actions.contains("\n")) {
String[] cmds = actions.split("\\n");
for(String cmd : cmds) {
envState.commands.add(cmd);
}
} else {
if (!actions.isEmpty())
envState.commands.add(actions);
}
sent = true;
profiler.endSection(); //cmd
profiler.startSection("requestTick");
// TimeHelper.SyncManager.debugLog("[MALMO_ENV_SERVER] <STEP> Received: " + actions);
// TimeHelper.SyncManager.debugLog("[MALMO_ENV_SERVER] <STEP> Requesting tick.");
// Now wait to run a tick
// If synchronous mode is off then we should see if want to quit is true.
while(!TimeHelper.SyncManager.requestTick() && !done ){Thread.yield();}
// TimeHelper.SyncManager.debugLog("[MALMO_ENV_SERVER] <STEP> Tick request granted.");
profiler.endSection();
profiler.startSection("waitForTick");
// TimeHelper.SyncManager.debugLog("[MALMO_ENV_SERVER] <STEP> Waiting for tick.");
// Then wait until the tick is finished
while(!TimeHelper.SyncManager.isTickCompleted() && !done ){ Thread.yield();}
// TimeHelper.SyncManager.debugLog("[MALMO_ENV_SERVER] <STEP> TICK DONE. Getting observation.");
profiler.endSection();
profiler.startSection("getObservation");
// After which, get the observations.
obs = getObservation(done);
// TimeHelper.SyncManager.debugLog("[MALMO_ENV_SERVER] <STEP> Observation received. Getting info.");
profiler.endSection();
profiler.startSection("getInfo");
// Pick up rewards.
reward = envState.reward;
if (withInfo) {
info = envState.info;
// if(info == null)
// TimeHelper.SyncManager.debugLog("[MALMO_ENV_SERVER] <STEP> FILLING INFO: NULL");
// else
// TimeHelper.SyncManager.debugLog("[MALMO_ENV_SERVER] <STEP> FILLING " + info.toString());
}
done = envState.done;
// TimeHelper.SyncManager.debugLog("[MALMO_ENV_SERVER] <STEP> STATUS " + Boolean.toString(done));
envState.info = null;
envState.obs = null;
envState.reward = 0.0;
// TimeHelper.SyncManager.debugLog("[MALMO_ENV_SERVER] <STEP> Info received..");
profiler.endSection();
} finally {
lock.unlock();
}
// TimeHelper.SyncManager.debugLog("[MALMO_ENV_SERVER] <STEP> Lock released. Writing observation, info, done.");
profiler.startSection("writeObs");
dout.writeInt(obs.length);
dout.write(obs);
dout.writeInt(BYTES_DOUBLE + 2);
dout.writeDouble(reward);
dout.writeByte(done ? 1 : 0);
dout.writeByte(sent ? 1 : 0);
if (withInfo) {
byte[] infoBytes = info.getBytes(utf8);
dout.writeInt(infoBytes.length);
dout.write(infoBytes);
}
profiler.endSection(); //write obs
profiler.startSection("flush");
// TimeHelper.SyncManager.debugLog("[MALMO_ENV_SERVER] <STEP> Packets written. Flushing.");
dout.flush();
profiler.endSection(); // flush
// TimeHelper.SyncManager.debugLog("[MALMO_ENV_SERVER] <STEP> Done with step.");
}
// Handler for <Step_> messages. Single digit option code after _ specifies if turnkey and info are included in message.
private void step(String command, Socket socket, DataInputStream din) throws IOException {
if(envState.synchronous){
stepSync(command, socket, din);
}
else{
System.out.println("[ERROR] Asynchronous stepping is not supported in MineRL.");
}
}
// Handler for <Peek> messages.
private void peek(String command, Socket socket, DataInputStream din) throws IOException {
DataOutputStream dout = new DataOutputStream(socket.getOutputStream());
byte[] obs;
boolean done;
String info = "";
// AOG - As we've only seen issues with the peek reqest, I've focused my changes to just
// this function. Initially we want to be optimistic and assume we're not going to abort
// the request and my observations of event timings indicate that there is plenty of time
// between the peek request being received and the reset failing, so a race condition is
// unlikely.
abortRequest = false;
lock.lock();
try {
// TimeHelper.SyncManager.debugLog("[MALMO_ENV_SERVER] <PEEK> Waiting for pistol to fire.");
while(!TimeHelper.SyncManager.hasServerFiredPistol() && !abortRequest){
// Now wait to run a tick
while(!TimeHelper.SyncManager.requestTick() && !abortRequest){Thread.yield();}
// Then wait until the tick is finished
while(!TimeHelper.SyncManager.isTickCompleted() && !abortRequest){ Thread.yield();}
Thread.yield();
}
if (abortRequest) {
System.out.println("AOG: Aborting peek request");
// AOG - We detect the lack of observation within our Python wrapper and throw a slightly
// diferent exception that by-passes MineRLs automatic clean up code. If we were to report
// 'done', the MineRL detects this as a runtime error and kills the Minecraft process
// triggering a lengthy restart. So far from testing, Minecraft itself is fine can we can
// retry the reset, it's only the tight loops above that were causing things to stall and
// timeout.
// No observation
dout.writeInt(0);
// No info
dout.writeInt(0);
// Done
dout.writeInt(1);
dout.writeByte(0);
dout.flush();
return;
}
// TimeHelper.SyncManager.debugLog("[MALMO_ENV_SERVER] <PEEK> Pistol fired!.");
// Wait two ticks for the first observation from server to be propagated.
while(!TimeHelper.SyncManager.requestTick() ){Thread.yield();}
// Then wait until the tick is finished
while(!TimeHelper.SyncManager.isTickCompleted()){ Thread.yield();}
while(!TimeHelper.SyncManager.requestTick() ){Thread.yield();}
// Then wait until the tick is finished
while(!TimeHelper.SyncManager.isTickCompleted()){ Thread.yield();}
// TimeHelper.SyncManager.debugLog("[MALMO_ENV_SERVER] <PEEK> Getting observation.");
obs = getObservation(false);
// TimeHelper.SyncManager.debugLog("[MALMO_ENV_SERVER] <PEEK> Observation acquired.");
done = envState.done;
info = envState.info;
} finally {
lock.unlock();
}
dout.writeInt(obs.length);
dout.write(obs);
byte[] infoBytes = info.getBytes(utf8);
dout.writeInt(infoBytes.length);
dout.write(infoBytes);
dout.writeInt(1);
dout.writeByte(done ? 1 : 0);
dout.flush();
}
// Get the current observation. If none and not done wait for a short time.
public byte[] getObservation(boolean done) {
byte[] obs = envState.obs;
if (obs == null){
System.out.println("[ERROR] Video observation is null; please notify the developer.");
}
return obs;
}
// Handler for <Find> messages - used by non-zero roles to discover integrated server port from primary (role 0) service.
private final static int findTagLength = "<Find>".length();
private void find(String command, Socket socket) throws IOException {
Integer port;
lock.lock();
try {
String token = command.substring(findTagLength, command.length() - (findTagLength + 1));
TCPUtils.Log(Level.INFO, "Find token? " + token);
// Purge previous token.
String[] tokenSplits = token.split(":");
String experimentId = tokenSplits[0];
int role = Integer.parseInt(tokenSplits[1]);
int reset = Integer.parseInt(tokenSplits[2]);
String previousToken = experimentId + ":" + role + ":" + (reset - 1);
initTokens.remove(previousToken);
cond.signalAll();
// Check for next token. Wait for a short time if not already produced.
port = initTokens.get(token);
if (port == null) {
try {
cond.await(COND_WAIT_SECONDS, TimeUnit.SECONDS);
} catch (InterruptedException ie) {
}
port = initTokens.get(token);
if (port == null) {
port = 0;
TCPUtils.Log(Level.INFO,"Role " + role + " reset " + reset + " waiting for token.");
}
}
} finally {
lock.unlock();
}
DataOutputStream dout = new DataOutputStream(socket.getOutputStream());
dout.writeInt(BYTES_INT);
dout.writeInt(port);
dout.flush();
}
public boolean isSynchronous(){
return envState.synchronous;
}
// Handler for <Init> messages. These reset the service so use with care!
private void init(String command, Socket socket) throws IOException {
lock.lock();
try {
initTokens = new Hashtable<String, Integer>();
DataOutputStream dout = new DataOutputStream(socket.getOutputStream());
dout.writeInt(BYTES_INT);
dout.writeInt(1);
dout.flush();
} finally {
lock.unlock();
}
}
// Handler for <Quit> (quit mission) messages.
private void quit(String command, Socket socket) throws IOException {
lock.lock();
try {
if (!envState.done){
envState.quit = true;
}
// TimeHelper.SyncManager.debugLog("[MALMO_ENV_SERVER] <PEEK> Pistol fired!.");
// Wait two ticks for the first observation from server to be propagated.
while(!TimeHelper.SyncManager.requestTick() ){Thread.yield();}
// Then wait until the tick is finished
while(!TimeHelper.SyncManager.isTickCompleted()){ Thread.yield();}
DataOutputStream dout = new DataOutputStream(socket.getOutputStream());
dout.writeInt(BYTES_INT);
dout.writeInt(envState.done ? 1 : 0);
dout.flush();
} finally {
lock.unlock();
}
}
private final static int closeTagLength = "<Close>".length();
// Handler for <Close> messages.
private void close(String command, Socket socket) throws IOException {
lock.lock();
try {
String token = command.substring(closeTagLength, command.length() - (closeTagLength + 1));
initTokens.remove(token);
DataOutputStream dout = new DataOutputStream(socket.getOutputStream());
dout.writeInt(BYTES_INT);
dout.writeInt(1);
dout.flush();
} finally {
lock.unlock();
}
}
// Handler for <Status> messages.
private void status(String command, Socket socket) throws IOException {
lock.lock();
try {
String status = "{}"; // TODO Possibly have something more interesting to report.
DataOutputStream dout = new DataOutputStream(socket.getOutputStream());
byte[] statusBytes = status.getBytes(utf8);
dout.writeInt(statusBytes.length);
dout.write(statusBytes);
dout.flush();
} finally {
lock.unlock();
}
}
// Handler for <Exit> messages. These "kill the service" temporarily so use with care!f
private void exit(String command, Socket socket) throws IOException {
// lock.lock();
try {
// We may exit before we get a chance to reply.
TimeHelper.SyncManager.setSynchronous(false);
DataOutputStream dout = new DataOutputStream(socket.getOutputStream());
dout.writeInt(BYTES_INT);
dout.writeInt(1);
dout.flush();
ClientStateMachine.exitJava();
} finally {
// lock.unlock();
}
}
// Malmo client state machine interface methods:
public String getCommand() {
try {
String command = envState.commands.poll();
if (command == null)
return "";
else
return command;
} finally {
}
}
public void endMission() {
// lock.lock();
try {
// AOG - If the mission is ending, we always want to abort requests and they won't
// be able to progress to completion and will stall.
System.out.println("AOG: MalmoEnvServer.endMission");
abort();
envState.done = true;
envState.quit = false;
envState.missionInit = null;
if (envState.token != null) {
initTokens.remove(envState.token);
envState.token = null;
envState.experimentId = null;
envState.agentCount = 0;
envState.reset = 0;
// cond.signalAll();
}
// lock.unlock();
} finally {
}
}
// Record a Malmo "observation" json - as the env info since an environment "obs" is a video frame.
public void observation(String info) {
// Parsing obs as JSON would be slower but less fragile than extracting the turn_key using string search.
// lock.lock();
try {
// TimeHelper.SyncManager.debugLog("[MALMO_ENV_SERVER] <OBSERVATION> Inserting: " + info);
envState.info = info;
// cond.signalAll();
} finally {
// lock.unlock();
}
}
public void addRewards(double rewards) {
// lock.lock();
try {
envState.reward += rewards;
} finally {
// lock.unlock();
}
}
public void addFrame(byte[] frame) {
// lock.lock();
try {
envState.obs = frame; // Replaces current.
// cond.signalAll();
} finally {
// lock.unlock();
}
}
public void notifyIntegrationServerStarted(int integrationServerPort) {
lock.lock();
try {
if (envState.token != null) {
TCPUtils.Log(Level.INFO,"Integration server start up - token: " + envState.token);
addTokens(integrationServerPort, envState.token, envState.experimentId, envState.agentCount, envState.reset);
cond.signalAll();
} else {
TCPUtils.Log(Level.WARNING,"No mission token on integration server start up!");
}
} finally {
lock.unlock();
}
}
private void addTokens(int integratedServerPort, String myToken, String experimentId, int agentCount, int reset) {
initTokens.put(myToken, integratedServerPort);
// Place tokens for other agents to find.
for (int i = 1; i < agentCount; i++) {
String tokenForAgent = experimentId + ":" + i + ":" + reset;
initTokens.put(tokenForAgent, integratedServerPort);
}
}
// IWantToQuit implementation.
@Override
public boolean doIWantToQuit(MissionInit missionInit) {
// lock.lock();
try {
return envState.quit;
} finally {
// lock.unlock();
}
}
public Long getSeed(){
return envState.seed;
}
private void setWantToQuit() {
// lock.lock();
try {
envState.quit = true;
} finally {
if(TimeHelper.SyncManager.isSynchronous()){
// We want to dsynchronize everything.
TimeHelper.SyncManager.setSynchronous(false);
}
// lock.unlock();
}
}
@Override
public void prepare(MissionInit missionInit) {
}
@Override
public void cleanup() {
}
@Override
public String getOutcome() {
return "Env quit";
}
}

View File

@@ -1,78 +0,0 @@
FROM mcr.microsoft.com/azureml/base-gpu:openmpi3.1.2-cuda10.0-cudnn7-ubuntu18.04
# Install some basic utilities
RUN apt-get update && apt-get install -y \
curl \
ca-certificates \
sudo \
cpio \
git \
bzip2 \
libx11-6 \
tmux \
htop \
gcc \
xvfb \
python-opengl \
x11-xserver-utils \
ffmpeg \
mesa-utils \
nano \
vim \
rsync \
&& rm -rf /var/lib/apt/lists/*
# Create a working directory
RUN mkdir /app
WORKDIR /app
# Create a Python 3.7 environment
RUN conda install conda-build \
&& conda create -y --name py37 python=3.7.3 \
&& conda clean -ya
ENV CONDA_DEFAULT_ENV=py37
# Install Minecraft needed libraries
RUN mkdir -p /usr/share/man/man1 && \
sudo apt-get update && \
sudo apt-get install -y \
openjdk-8-jre-headless=8u162-b12-1 \
openjdk-8-jdk-headless=8u162-b12-1 \
openjdk-8-jre=8u162-b12-1 \
openjdk-8-jdk=8u162-b12-1
RUN pip install --upgrade --user minerl
# PyTorch with CUDA 10 installation
RUN conda install -y -c pytorch \
cuda100=1.0 \
magma-cuda100=2.4.0 \
"pytorch=1.1.0=py3.7_cuda10.0.130_cudnn7.5.1_0" \
torchvision=0.3.0 \
&& conda clean -ya
RUN pip install \
pandas \
matplotlib \
numpy \
scipy \
azureml-defaults \
tensorboardX \
tensorflow-gpu==1.15rc2 \
GPUtil \
tabulate \
dm_tree \
lz4 \
ray==0.8.3 \
ray[rllib]==0.8.3 \
ray[tune]==0.8.3
COPY patch_files/* /root/.local/lib/python3.7/site-packages/minerl/env/Malmo/Minecraft/src/main/java/com/microsoft/Malmo/Client/
# Start minerl to pre-fetch minerl files (saves time when starting minerl during training)
RUN xvfb-run -a -s "-screen 0 1400x900x24" python -c "import gym; import minerl; env = gym.make('MineRLTreechop-v0'); env.close();"
RUN pip install --index-url https://test.pypi.org/simple/ malmo && \
python -c "import malmo.minecraftbootstrap; malmo.minecraftbootstrap.download();"
ENV MALMO_XSD_PATH="/app/MalmoPlatform/Schemas"

View File

@@ -1,939 +0,0 @@
// --------------------------------------------------------------------------------------------------
// Copyright (c) 2016 Microsoft Corporation
//
// Permission is hereby granted, free of charge, to any person obtaining a copy of this software and
// associated documentation files (the "Software"), to deal in the Software without restriction,
// including without limitation the rights to use, copy, modify, merge, publish, distribute,
// sublicense, and/or l copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all copies or
// substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT
// NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
// NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,
// DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
// --------------------------------------------------------------------------------------------------
package com.microsoft.Malmo.Client;
import com.microsoft.Malmo.MalmoMod;
import com.microsoft.Malmo.MissionHandlerInterfaces.IWantToQuit;
import com.microsoft.Malmo.Schemas.MissionInit;
import com.microsoft.Malmo.Utils.TCPUtils;
import net.minecraft.profiler.Profiler;
import com.microsoft.Malmo.Utils.TimeHelper;
import net.minecraftforge.common.config.Configuration;
import java.io.*;
import java.net.ServerSocket;
import java.net.Socket;
import java.nio.charset.Charset;
import java.util.Arrays;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.locks.Condition;
import java.util.concurrent.locks.Lock;
import java.util.concurrent.locks.ReentrantLock;
import java.util.Hashtable;
import com.microsoft.Malmo.Utils.TCPInputPoller;
import java.util.logging.Level;
import java.util.LinkedList;
import java.util.List;
/**
* MalmoEnvServer - service supporting OpenAI gym "environment" for multi-agent Malmo missions.
*/
public class MalmoEnvServer implements IWantToQuit {
private static Profiler profiler = new Profiler();
private static int nsteps = 0;
private static boolean debug = false;
private static String hello = "<MalmoEnv" ;
private class EnvState {
// Mission parameters:
String missionInit = null;
String token = null;
String experimentId = null;
int agentCount = 0;
int reset = 0;
boolean quit = false;
boolean synchronous = false;
Long seed = null;
// OpenAI gym state:
boolean done = false;
double reward = 0.0;
byte[] obs = null;
String info = "";
LinkedList<String> commands = new LinkedList<String>();
}
private static boolean envPolicy = false; // Are we configured by config policy?
// Synchronize on EnvStateasd
private Lock lock = new ReentrantLock();
private Condition cond = lock.newCondition();
private EnvState envState = new EnvState();
private Hashtable<String, Integer> initTokens = new Hashtable<String, Integer>();
static final long COND_WAIT_SECONDS = 3; // Max wait in seconds before timing out (and replying to RPC).
static final int BYTES_INT = 4;
static final int BYTES_DOUBLE = 8;
private static final Charset utf8 = Charset.forName("UTF-8");
// Service uses a single per-environment client connection - initiated by the remote environment.
private int port;
private TCPInputPoller missionPoller; // Used for command parsing and not actual communication.
private String version;
// AOG: From running experiments, I've found that MineRL can get stuck resetting the
// environment which causes huge delays while we wait for the Python side to time
// out and restart the Minecraft instace. Minecraft itself is normally in a recoverable
// state, but the MalmoEnvServer instance will be blocked in a tight spin loop trying
// handling a Peek request from the Python client. To unstick things, I've added this
// flag that can be set when we know things are in a bad state to abort the peek request.
// WARNING: THIS IS ONLY TREATING THE SYMPTOM AND NOT THE ROOT CAUSE
// The reason things are getting stuck is because the player is either dying or we're
// receiving a quit request while an episode reset is in progress.
private boolean abortRequest;
public void abort() {
System.out.println("AOG: MalmoEnvServer.abort");
abortRequest = true;
}
/***
* Malmo "Env" service.
* @param port the port the service listens on.
* @param missionPoller for plugging into existing comms handling.
*/
public MalmoEnvServer(String version, int port, TCPInputPoller missionPoller) {
this.version = version;
this.missionPoller = missionPoller;
this.port = port;
// AOG - Assume we don't wan't to be aborting in the first place
this.abortRequest = false;
}
/** Initialize malmo env configuration. For now either on or "legacy" AgentHost protocol.*/
static public void update(Configuration configs) {
envPolicy = configs.get(MalmoMod.ENV_CONFIGS, "env", "false").getBoolean();
}
public static boolean isEnv() {
return envPolicy;
}
/**
* Start servicing the MalmoEnv protocol.
* @throws IOException
*/
public void serve() throws IOException {
ServerSocket serverSocket = new ServerSocket(port);
serverSocket.setPerformancePreferences(0,2,1);
while (true) {
try {
final Socket socket = serverSocket.accept();
socket.setTcpNoDelay(true);
Thread thread = new Thread("EnvServerSocketHandler") {
public void run() {
boolean running = false;
try {
checkHello(socket);
while (true) {
DataInputStream din = new DataInputStream(socket.getInputStream());
int hdr = din.readInt();
byte[] data = new byte[hdr];
din.readFully(data);
String command = new String(data, utf8);
if (command.startsWith("<Step")) {
profiler.startSection("root");
long start = System.nanoTime();
step(command, socket, din);
profiler.endSection();
if (nsteps % 100 == 0 && debug){
List<Profiler.Result> dat = profiler.getProfilingData("root");
for(int qq = 0; qq < dat.size(); qq++){
Profiler.Result res = dat.get(qq);
System.out.println(res.profilerName + " " + res.totalUsePercentage + " "+ res.usePercentage);
}
}
} else if (command.startsWith("<Peek")) {
peek(command, socket, din);
} else if (command.startsWith("<Init")) {
init(command, socket);
} else if (command.startsWith("<Find")) {
find(command, socket);
} else if (command.startsWith("<MissionInit")) {
if (missionInit(din, command, socket))
{
running = true;
}
} else if (command.startsWith("<Quit")) {
quit(command, socket);
profiler.profilingEnabled = false;
} else if (command.startsWith("<Exit")) {
exit(command, socket);
profiler.profilingEnabled = false;
} else if (command.startsWith("<Close")) {
close(command, socket);
profiler.profilingEnabled = false;
} else if (command.startsWith("<Status")) {
status(command, socket);
} else if (command.startsWith("<Echo")) {
command = "<Echo>" + command + "</Echo>";
data = command.getBytes(utf8);
hdr = data.length;
DataOutputStream dout = new DataOutputStream(socket.getOutputStream());
dout.writeInt(hdr);
dout.write(data, 0, hdr);
dout.flush();
} else {
throw new IOException("Unknown env service command");
}
}
} catch (IOException ioe) {
// ioe.printStackTrace();
TCPUtils.Log(Level.SEVERE, "MalmoEnv socket error: " + ioe + " (can be on disconnect)");
// System.out.println("[ERROR] " + "MalmoEnv socket error: " + ioe + " (can be on disconnect)");
// TimeHelper.SyncManager.debugLog("[MALMO_ENV_SERVER] MalmoEnv socket error");
try {
if (running) {
TCPUtils.Log(Level.INFO,"Want to quit on disconnect.");
System.out.println("[LOGTOPY] " + "Want to quit on disconnect.");
setWantToQuit();
}
socket.close();
} catch (IOException ioe2) {
}
}
}
};
thread.start();
} catch (IOException ioe) {
TCPUtils.Log(Level.SEVERE, "MalmoEnv service exits on " + ioe);
}
}
}
private void checkHello(Socket socket) throws IOException {
DataInputStream din = new DataInputStream(socket.getInputStream());
int hdr = din.readInt();
if (hdr <= 0 || hdr > hello.length() + 8) // Version number may be somewhat longer in future.
throw new IOException("Invalid MalmoEnv hello header length");
byte[] data = new byte[hdr];
din.readFully(data);
if (!new String(data).startsWith(hello + version))
throw new IOException("MalmoEnv invalid protocol or version - expected " + hello + version);
}
// Handler for <MissionInit> messages.
private boolean missionInit(DataInputStream din, String command, Socket socket) throws IOException {
String ipOriginator = socket.getInetAddress().getHostName();
int hdr;
byte[] data;
hdr = din.readInt();
data = new byte[hdr];
din.readFully(data);
String id = new String(data, utf8);
TCPUtils.Log(Level.INFO,"Mission Init" + id);
String[] token = id.split(":");
String experimentId = token[0];
int role = Integer.parseInt(token[1]);
int reset = Integer.parseInt(token[2]);
int agentCount = Integer.parseInt(token[3]);
Boolean isSynchronous = Boolean.parseBoolean(token[4]);
Long seed = null;
if(token.length > 5)
seed = Long.parseLong(token[5]);
if(isSynchronous && agentCount > 1){
throw new IOException("Synchronous mode currently does not support multiple agents.");
}
port = -1;
boolean allTokensConsumed = true;
boolean started = false;
lock.lock();
try {
if (role == 0) {
String previousToken = experimentId + ":0:" + (reset - 1);
initTokens.remove(previousToken);
String myToken = experimentId + ":0:" + reset;
if (!initTokens.containsKey(myToken)) {
TCPUtils.Log(Level.INFO,"(Pre)Start " + role + " reset " + reset);
started = startUp(command, ipOriginator, experimentId, reset, agentCount, myToken, seed, isSynchronous);
if (started)
initTokens.put(myToken, 0);
} else {
started = true; // Pre-started previously.
}
// Check that all previous tokens have been consumed. If not don't proceed to mission.
allTokensConsumed = areAllTokensConsumed(experimentId, reset, agentCount);
if (!allTokensConsumed) {
try {
cond.await(COND_WAIT_SECONDS, TimeUnit.SECONDS);
} catch (InterruptedException ie) {
}
allTokensConsumed = areAllTokensConsumed(experimentId, reset, agentCount);
}
} else {
TCPUtils.Log(Level.INFO, "Start " + role + " reset " + reset);
started = startUp(command, ipOriginator, experimentId, reset, agentCount, experimentId + ":" + role + ":" + reset, seed, isSynchronous);
}
} finally {
lock.unlock();
}
DataOutputStream dout = new DataOutputStream(socket.getOutputStream());
dout.writeInt(BYTES_INT);
dout.writeInt(allTokensConsumed && started ? 1 : 0);
dout.flush();
dout.flush();
return allTokensConsumed && started;
}
private boolean areAllTokensConsumed(String experimentId, int reset, int agentCount) {
boolean allTokensConsumed = true;
for (int i = 1; i < agentCount; i++) {
String tokenForAgent = experimentId + ":" + i + ":" + (reset - 1);
if (initTokens.containsKey(tokenForAgent)) {
TCPUtils.Log(Level.FINE,"Mission init - unconsumed " + tokenForAgent);
allTokensConsumed = false;
}
}
return allTokensConsumed;
}
private boolean startUp(String command, String ipOriginator, String experimentId, int reset, int agentCount, String myToken, Long seed, Boolean isSynchronous) throws IOException {
// Clear out mission state
envState.reward = 0.0;
envState.commands.clear();
envState.obs = null;
envState.info = "";
envState.missionInit = command;
envState.done = false;
envState.quit = false;
envState.token = myToken;
envState.experimentId = experimentId;
envState.agentCount = agentCount;
envState.reset = reset;
envState.synchronous = isSynchronous;
envState.seed = seed;
return startUpMission(command, ipOriginator);
}
private boolean startUpMission(String command, String ipOriginator) throws IOException {
if (missionPoller == null)
return false;
ByteArrayOutputStream baos = new ByteArrayOutputStream();
DataOutputStream dos = new DataOutputStream(baos);
missionPoller.commandReceived(command, ipOriginator, dos);
dos.flush();
byte[] reply = baos.toByteArray();
ByteArrayInputStream bais = new ByteArrayInputStream(reply);
DataInputStream dis = new DataInputStream(bais);
int hdr = dis.readInt();
byte[] replyBytes = new byte[hdr];
dis.readFully(replyBytes);
String replyStr = new String(replyBytes);
if (replyStr.equals("MALMOOK")) {
TCPUtils.Log(Level.INFO, "MalmoEnvServer Mission starting ...");
return true;
} else if (replyStr.equals("MALMOBUSY")) {
TCPUtils.Log(Level.INFO, "MalmoEnvServer Busy - I want to quit");
this.envState.quit = true;
}
return false;
}
private static final int stepTagLength = "<Step_>".length(); // Step with option code.
private synchronized void stepSync(String command, Socket socket, DataInputStream din) throws IOException
{
// TimeHelper.SyncManager.debugLog("[MALMO_ENV_SERVER] <STEP> Entering synchronous step.");
nsteps += 1;
profiler.startSection("commandProcessing");
String actions = command.substring(stepTagLength, command.length() - (stepTagLength + 2));
int options = Character.getNumericValue(command.charAt(stepTagLength - 2));
boolean withInfo = options == 0 || options == 2;
// Prepare to write data to the client.
DataOutputStream dout = new DataOutputStream(socket.getOutputStream());
double reward = 0.0;
boolean done;
byte[] obs;
String info = "";
boolean sent = false;
// TimeHelper.SyncManager.debugLog("[MALMO_ENV_SERVER] <STEP> Acquiring lock for synchronous step.");
lock.lock();
try {
// TimeHelper.SyncManager.debugLog("[MALMO_ENV_SERVER] <STEP> Lock is acquired.");
done = envState.done;
// TODO Handle when the environment is done.
// Process the actions.
if (actions.contains("\n")) {
String[] cmds = actions.split("\\n");
for(String cmd : cmds) {
envState.commands.add(cmd);
}
} else {
if (!actions.isEmpty())
envState.commands.add(actions);
}
sent = true;
profiler.endSection(); //cmd
profiler.startSection("requestTick");
// TimeHelper.SyncManager.debugLog("[MALMO_ENV_SERVER] <STEP> Received: " + actions);
// TimeHelper.SyncManager.debugLog("[MALMO_ENV_SERVER] <STEP> Requesting tick.");
// Now wait to run a tick
// If synchronous mode is off then we should see if want to quit is true.
while(!TimeHelper.SyncManager.requestTick() && !done ){Thread.yield();}
// TimeHelper.SyncManager.debugLog("[MALMO_ENV_SERVER] <STEP> Tick request granted.");
profiler.endSection();
profiler.startSection("waitForTick");
// TimeHelper.SyncManager.debugLog("[MALMO_ENV_SERVER] <STEP> Waiting for tick.");
// Then wait until the tick is finished
while(!TimeHelper.SyncManager.isTickCompleted() && !done ){ Thread.yield();}
// TimeHelper.SyncManager.debugLog("[MALMO_ENV_SERVER] <STEP> TICK DONE. Getting observation.");
profiler.endSection();
profiler.startSection("getObservation");
// After which, get the observations.
obs = getObservation(done);
// TimeHelper.SyncManager.debugLog("[MALMO_ENV_SERVER] <STEP> Observation received. Getting info.");
profiler.endSection();
profiler.startSection("getInfo");
// Pick up rewards.
reward = envState.reward;
if (withInfo) {
info = envState.info;
// if(info == null)
// TimeHelper.SyncManager.debugLog("[MALMO_ENV_SERVER] <STEP> FILLING INFO: NULL");
// else
// TimeHelper.SyncManager.debugLog("[MALMO_ENV_SERVER] <STEP> FILLING " + info.toString());
}
done = envState.done;
// TimeHelper.SyncManager.debugLog("[MALMO_ENV_SERVER] <STEP> STATUS " + Boolean.toString(done));
envState.info = null;
envState.obs = null;
envState.reward = 0.0;
// TimeHelper.SyncManager.debugLog("[MALMO_ENV_SERVER] <STEP> Info received..");
profiler.endSection();
} finally {
lock.unlock();
}
// TimeHelper.SyncManager.debugLog("[MALMO_ENV_SERVER] <STEP> Lock released. Writing observation, info, done.");
profiler.startSection("writeObs");
dout.writeInt(obs.length);
dout.write(obs);
dout.writeInt(BYTES_DOUBLE + 2);
dout.writeDouble(reward);
dout.writeByte(done ? 1 : 0);
dout.writeByte(sent ? 1 : 0);
if (withInfo) {
byte[] infoBytes = info.getBytes(utf8);
dout.writeInt(infoBytes.length);
dout.write(infoBytes);
}
profiler.endSection(); //write obs
profiler.startSection("flush");
// TimeHelper.SyncManager.debugLog("[MALMO_ENV_SERVER] <STEP> Packets written. Flushing.");
dout.flush();
profiler.endSection(); // flush
// TimeHelper.SyncManager.debugLog("[MALMO_ENV_SERVER] <STEP> Done with step.");
}
// Handler for <Step_> messages. Single digit option code after _ specifies if turnkey and info are included in message.
private void step(String command, Socket socket, DataInputStream din) throws IOException {
if(envState.synchronous){
stepSync(command, socket, din);
}
else{
System.out.println("[ERROR] Asynchronous stepping is not supported in MineRL.");
}
}
// Handler for <Peek> messages.
private void peek(String command, Socket socket, DataInputStream din) throws IOException {
DataOutputStream dout = new DataOutputStream(socket.getOutputStream());
byte[] obs;
boolean done;
String info = "";
// AOG - As we've only seen issues with the peek reqest, I've focused my changes to just
// this function. Initially we want to be optimistic and assume we're not going to abort
// the request and my observations of event timings indicate that there is plenty of time
// between the peek request being received and the reset failing, so a race condition is
// unlikely.
abortRequest = false;
lock.lock();
try {
// TimeHelper.SyncManager.debugLog("[MALMO_ENV_SERVER] <PEEK> Waiting for pistol to fire.");
while(!TimeHelper.SyncManager.hasServerFiredPistol() && !abortRequest){
// Now wait to run a tick
while(!TimeHelper.SyncManager.requestTick() && !abortRequest){Thread.yield();}
// Then wait until the tick is finished
while(!TimeHelper.SyncManager.isTickCompleted() && !abortRequest){ Thread.yield();}
Thread.yield();
}
if (abortRequest) {
System.out.println("AOG: Aborting peek request");
// AOG - We detect the lack of observation within our Python wrapper and throw a slightly
// diferent exception that by-passes MineRLs automatic clean up code. If we were to report
// 'done', the MineRL detects this as a runtime error and kills the Minecraft process
// triggering a lengthy restart. So far from testing, Minecraft itself is fine can we can
// retry the reset, it's only the tight loops above that were causing things to stall and
// timeout.
// No observation
dout.writeInt(0);
// No info
dout.writeInt(0);
// Done
dout.writeInt(1);
dout.writeByte(0);
dout.flush();
return;
}
// TimeHelper.SyncManager.debugLog("[MALMO_ENV_SERVER] <PEEK> Pistol fired!.");
// Wait two ticks for the first observation from server to be propagated.
while(!TimeHelper.SyncManager.requestTick() ){Thread.yield();}
// Then wait until the tick is finished
while(!TimeHelper.SyncManager.isTickCompleted()){ Thread.yield();}
while(!TimeHelper.SyncManager.requestTick() ){Thread.yield();}
// Then wait until the tick is finished
while(!TimeHelper.SyncManager.isTickCompleted()){ Thread.yield();}
// TimeHelper.SyncManager.debugLog("[MALMO_ENV_SERVER] <PEEK> Getting observation.");
obs = getObservation(false);
// TimeHelper.SyncManager.debugLog("[MALMO_ENV_SERVER] <PEEK> Observation acquired.");
done = envState.done;
info = envState.info;
} finally {
lock.unlock();
}
dout.writeInt(obs.length);
dout.write(obs);
byte[] infoBytes = info.getBytes(utf8);
dout.writeInt(infoBytes.length);
dout.write(infoBytes);
dout.writeInt(1);
dout.writeByte(done ? 1 : 0);
dout.flush();
}
// Get the current observation. If none and not done wait for a short time.
public byte[] getObservation(boolean done) {
byte[] obs = envState.obs;
if (obs == null){
System.out.println("[ERROR] Video observation is null; please notify the developer.");
}
return obs;
}
// Handler for <Find> messages - used by non-zero roles to discover integrated server port from primary (role 0) service.
private final static int findTagLength = "<Find>".length();
private void find(String command, Socket socket) throws IOException {
Integer port;
lock.lock();
try {
String token = command.substring(findTagLength, command.length() - (findTagLength + 1));
TCPUtils.Log(Level.INFO, "Find token? " + token);
// Purge previous token.
String[] tokenSplits = token.split(":");
String experimentId = tokenSplits[0];
int role = Integer.parseInt(tokenSplits[1]);
int reset = Integer.parseInt(tokenSplits[2]);
String previousToken = experimentId + ":" + role + ":" + (reset - 1);
initTokens.remove(previousToken);
cond.signalAll();
// Check for next token. Wait for a short time if not already produced.
port = initTokens.get(token);
if (port == null) {
try {
cond.await(COND_WAIT_SECONDS, TimeUnit.SECONDS);
} catch (InterruptedException ie) {
}
port = initTokens.get(token);
if (port == null) {
port = 0;
TCPUtils.Log(Level.INFO,"Role " + role + " reset " + reset + " waiting for token.");
}
}
} finally {
lock.unlock();
}
DataOutputStream dout = new DataOutputStream(socket.getOutputStream());
dout.writeInt(BYTES_INT);
dout.writeInt(port);
dout.flush();
}
public boolean isSynchronous(){
return envState.synchronous;
}
// Handler for <Init> messages. These reset the service so use with care!
private void init(String command, Socket socket) throws IOException {
lock.lock();
try {
initTokens = new Hashtable<String, Integer>();
DataOutputStream dout = new DataOutputStream(socket.getOutputStream());
dout.writeInt(BYTES_INT);
dout.writeInt(1);
dout.flush();
} finally {
lock.unlock();
}
}
// Handler for <Quit> (quit mission) messages.
private void quit(String command, Socket socket) throws IOException {
lock.lock();
try {
if (!envState.done){
envState.quit = true;
}
// TimeHelper.SyncManager.debugLog("[MALMO_ENV_SERVER] <PEEK> Pistol fired!.");
// Wait two ticks for the first observation from server to be propagated.
while(!TimeHelper.SyncManager.requestTick() ){Thread.yield();}
// Then wait until the tick is finished
while(!TimeHelper.SyncManager.isTickCompleted()){ Thread.yield();}
DataOutputStream dout = new DataOutputStream(socket.getOutputStream());
dout.writeInt(BYTES_INT);
dout.writeInt(envState.done ? 1 : 0);
dout.flush();
} finally {
lock.unlock();
}
}
private final static int closeTagLength = "<Close>".length();
// Handler for <Close> messages.
private void close(String command, Socket socket) throws IOException {
lock.lock();
try {
String token = command.substring(closeTagLength, command.length() - (closeTagLength + 1));
initTokens.remove(token);
DataOutputStream dout = new DataOutputStream(socket.getOutputStream());
dout.writeInt(BYTES_INT);
dout.writeInt(1);
dout.flush();
} finally {
lock.unlock();
}
}
// Handler for <Status> messages.
private void status(String command, Socket socket) throws IOException {
lock.lock();
try {
String status = "{}"; // TODO Possibly have something more interesting to report.
DataOutputStream dout = new DataOutputStream(socket.getOutputStream());
byte[] statusBytes = status.getBytes(utf8);
dout.writeInt(statusBytes.length);
dout.write(statusBytes);
dout.flush();
} finally {
lock.unlock();
}
}
// Handler for <Exit> messages. These "kill the service" temporarily so use with care!f
private void exit(String command, Socket socket) throws IOException {
// lock.lock();
try {
// We may exit before we get a chance to reply.
TimeHelper.SyncManager.setSynchronous(false);
DataOutputStream dout = new DataOutputStream(socket.getOutputStream());
dout.writeInt(BYTES_INT);
dout.writeInt(1);
dout.flush();
ClientStateMachine.exitJava();
} finally {
// lock.unlock();
}
}
// Malmo client state machine interface methods:
public String getCommand() {
try {
String command = envState.commands.poll();
if (command == null)
return "";
else
return command;
} finally {
}
}
public void endMission() {
// lock.lock();
try {
// AOG - If the mission is ending, we always want to abort requests and they won't
// be able to progress to completion and will stall.
System.out.println("AOG: MalmoEnvServer.endMission");
abort();
envState.done = true;
envState.quit = false;
envState.missionInit = null;
if (envState.token != null) {
initTokens.remove(envState.token);
envState.token = null;
envState.experimentId = null;
envState.agentCount = 0;
envState.reset = 0;
// cond.signalAll();
}
// lock.unlock();
} finally {
}
}
// Record a Malmo "observation" json - as the env info since an environment "obs" is a video frame.
public void observation(String info) {
// Parsing obs as JSON would be slower but less fragile than extracting the turn_key using string search.
// lock.lock();
try {
// TimeHelper.SyncManager.debugLog("[MALMO_ENV_SERVER] <OBSERVATION> Inserting: " + info);
envState.info = info;
// cond.signalAll();
} finally {
// lock.unlock();
}
}
public void addRewards(double rewards) {
// lock.lock();
try {
envState.reward += rewards;
} finally {
// lock.unlock();
}
}
public void addFrame(byte[] frame) {
// lock.lock();
try {
envState.obs = frame; // Replaces current.
// cond.signalAll();
} finally {
// lock.unlock();
}
}
public void notifyIntegrationServerStarted(int integrationServerPort) {
lock.lock();
try {
if (envState.token != null) {
TCPUtils.Log(Level.INFO,"Integration server start up - token: " + envState.token);
addTokens(integrationServerPort, envState.token, envState.experimentId, envState.agentCount, envState.reset);
cond.signalAll();
} else {
TCPUtils.Log(Level.WARNING,"No mission token on integration server start up!");
}
} finally {
lock.unlock();
}
}
private void addTokens(int integratedServerPort, String myToken, String experimentId, int agentCount, int reset) {
initTokens.put(myToken, integratedServerPort);
// Place tokens for other agents to find.
for (int i = 1; i < agentCount; i++) {
String tokenForAgent = experimentId + ":" + i + ":" + reset;
initTokens.put(tokenForAgent, integratedServerPort);
}
}
// IWantToQuit implementation.
@Override
public boolean doIWantToQuit(MissionInit missionInit) {
// lock.lock();
try {
return envState.quit;
} finally {
// lock.unlock();
}
}
public Long getSeed(){
return envState.seed;
}
private void setWantToQuit() {
// lock.lock();
try {
envState.quit = true;
} finally {
if(TimeHelper.SyncManager.isSynchronous()){
// We want to dsynchronize everything.
TimeHelper.SyncManager.setSynchronous(false);
}
// lock.unlock();
}
}
@Override
public void prepare(MissionInit missionInit) {
}
@Override
public void cleanup() {
}
@Override
public String getOutcome() {
return "Env quit";
}
}

View File

@@ -1,173 +0,0 @@
import time
import glob
import pathlib
from malmo import MalmoPython, malmoutils
from malmo.launch_minecraft_in_background import launch_minecraft_in_background
class MalmoVideoRecorder:
DEFAULT_RECORDINGS_DIR = './logs/videos'
def __init__(self):
self.agent_host_bot = None
self.agent_host_camera = None
self.client_pool = None
self.is_malmo_initialized = False
def init_malmo(self, recordings_directory=DEFAULT_RECORDINGS_DIR):
if self.is_malmo_initialized:
return
launch_minecraft_in_background(
'/app/MalmoPlatform/Minecraft',
ports=[10000, 10001])
# Set up two agent hosts
self.agent_host_bot = MalmoPython.AgentHost()
self.agent_host_camera = MalmoPython.AgentHost()
# Create list of Minecraft clients to attach to. The agents must
# have been launched before calling record_malmo_video using
# init_malmo()
self.client_pool = MalmoPython.ClientPool()
self.client_pool.add(MalmoPython.ClientInfo('127.0.0.1', 10000))
self.client_pool.add(MalmoPython.ClientInfo('127.0.0.1', 10001))
# Use bot's agenthost to hold the command-line options
malmoutils.parse_command_line(
self.agent_host_bot,
['--record_video', '--recording_dir', recordings_directory])
self.is_malmo_initialized = True
def _start_mission(self, agent_host, mission, recording_spec, role):
used_attempts = 0
max_attempts = 5
while True:
try:
agent_host.startMission(
mission,
self.client_pool,
recording_spec,
role,
'')
break
except MalmoPython.MissionException as e:
errorCode = e.details.errorCode
if errorCode == (MalmoPython.MissionErrorCode
.MISSION_SERVER_WARMING_UP):
time.sleep(2)
elif errorCode == (MalmoPython.MissionErrorCode
.MISSION_INSUFFICIENT_CLIENTS_AVAILABLE):
print('Not enough Minecraft instances running.')
used_attempts += 1
if used_attempts < max_attempts:
print('Will wait in case they are starting up.')
time.sleep(300)
elif errorCode == (MalmoPython.MissionErrorCode
.MISSION_SERVER_NOT_FOUND):
print('Server not found.')
used_attempts += 1
if used_attempts < max_attempts:
print('Will wait and retry.')
time.sleep(2)
else:
used_attempts = max_attempts
if used_attempts >= max_attempts:
raise e
def _wait_for_start(self, agent_hosts):
start_flags = [False for a in agent_hosts]
start_time = time.time()
time_out = 120
while not all(start_flags) and time.time() - start_time < time_out:
states = [a.peekWorldState() for a in agent_hosts]
start_flags = [w.has_mission_begun for w in states]
errors = [e for w in states for e in w.errors]
if len(errors) > 0:
print("Errors waiting for mission start:")
for e in errors:
print(e.text)
raise Exception("Encountered errors while starting mission.")
if time.time() - start_time >= time_out:
raise Exception("Timed out while waiting for mission to start.")
def _get_xml(self, xml_file, seed):
with open(xml_file, 'r') as mission_file:
return mission_file.read().format(SEED_PLACEHOLDER=seed)
def _is_mission_running(self):
return self.agent_host_bot.peekWorldState().is_mission_running or \
self.agent_host_camera.peekWorldState().is_mission_running
def record_malmo_video(self, instructions, xml_file, seed):
'''
Replays a set of instructions through Malmo using two players. The
first player will navigate the specified mission based on the given
instructions. The second player observes the first player's moves,
which is captured in a video.
'''
if not self.is_malmo_initialized:
raise Exception('Malmo not initialized. Call init_malmo() first.')
# Set up the mission
my_mission = MalmoPython.MissionSpec(
self._get_xml(xml_file, seed),
True)
bot_recording_spec = MalmoPython.MissionRecordSpec()
camera_recording_spec = MalmoPython.MissionRecordSpec()
recordingsDirectory = \
malmoutils.get_recordings_directory(self.agent_host_bot)
if recordingsDirectory:
camera_recording_spec.setDestination(
recordingsDirectory + "//rollout_" + str(seed) + ".tgz")
camera_recording_spec.recordMP4(
MalmoPython.FrameType.VIDEO,
36,
2000000,
False)
# Start the agents
self._start_mission(
self.agent_host_bot,
my_mission,
bot_recording_spec,
0)
self._start_mission(
self.agent_host_camera,
my_mission,
camera_recording_spec,
1)
self._wait_for_start([self.agent_host_camera, self.agent_host_bot])
# Teleport the camera agent to the required position
self.agent_host_camera.sendCommand('tp -29 72 -6.7')
instruction_index = 0
while self._is_mission_running():
command = instructions[instruction_index]
instruction_index += 1
self.agent_host_bot.sendCommand(command)
# Pause for half a second - change this for faster/slower videos
time.sleep(0.5)
if instruction_index == len(instructions):
self.agent_host_bot.sendCommand("jump 1")
time.sleep(2)
self.agent_host_bot.sendCommand("quit")
# Wait a little for Malmo to reset before the
# next mission is started
time.sleep(2)
print("Video recorded.")

View File

@@ -1,180 +0,0 @@
import json
import logging
import gym
import minerl.env.core
import minerl.env.comms
import numpy as np
from ray.rllib.env.atari_wrappers import FrameStack
from minerl.env.malmo import InstanceManager
# Modify the MineRL timeouts to detect common errors
# quicker and speed up recovery
minerl.env.core.SOCKTIME = 60.0
minerl.env.comms.retry_timeout = 1
class EnvWrapper(minerl.env.core.MineRLEnv):
def __init__(self, xml, port):
InstanceManager.configure_malmo_base_port(port)
self.action_to_command_array = [
'move 1',
'camera 0 270',
'camera 0 90']
super().__init__(
xml,
gym.spaces.Box(low=0, high=255, shape=(84, 84, 3), dtype=np.uint8),
gym.spaces.Discrete(3)
)
self.metadata['video.frames_per_second'] = 2
def _setup_spaces(self, observation_space, action_space):
self.observation_space = observation_space
self.action_space = action_space
def _process_action(self, action_in) -> str:
assert self.action_space.contains(action_in)
assert action_in <= len(
self.action_to_command_array) - 1, 'action index out of bounds.'
return self.action_to_command_array[action_in]
def _process_observation(self, pov, info):
'''
Overwritten to simplify: returns only `pov` and
not as the MineRLEnv an obs_dict (observation directory)
'''
pov = np.frombuffer(pov, dtype=np.uint8)
if pov is None or len(pov) == 0:
raise Exception('Invalid observation, probably an aborted peek')
else:
pov = pov.reshape(
(self.height, self.width, self.depth)
)[::-1, :, :]
assert self.observation_space.contains(pov)
self._last_pov = pov
return pov
class TrackingEnv(gym.Wrapper):
def __init__(self, env):
super().__init__(env)
self._actions = [
self._forward,
self._turn_left,
self._turn_right
]
def _reset_state(self):
self._facing = (1, 0)
self._position = (0, 0)
self._visited = {}
self._update_visited()
def _forward(self):
self._position = (
self._position[0] + self._facing[0],
self._position[1] + self._facing[1]
)
def _turn_left(self):
self._facing = (self._facing[1], -self._facing[0])
def _turn_right(self):
self._facing = (-self._facing[1], self._facing[0])
def _encode_state(self):
return self._position
def _update_visited(self):
state = self._encode_state()
value = self._visited.get(state, 0)
self._visited[state] = value + 1
return value
def reset(self):
self._reset_state()
return super().reset()
def step(self, action):
o, r, d, i = super().step(action)
self._actions[action]()
revisit_count = self._update_visited()
if revisit_count == 0:
r += 0.1
return o, r, d, i
class TrajectoryWrapper(gym.Wrapper):
def __init__(self, env):
super().__init__(env)
self._trajectory = []
self._action_to_malmo_command_array = ['move 1', 'turn -1', 'turn 1']
def get_trajectory(self):
return self._trajectory
def _to_malmo_action(self, action_index):
return self._action_to_malmo_command_array[action_index]
def step(self, action):
self._trajectory.append(self._to_malmo_action(action))
o, r, d, i = super().step(action)
return o, r, d, i
class DummyEnv(gym.Env):
def __init__(self):
self.observation_space = gym.spaces.Box(
low=0,
high=255,
shape=(84, 84, 6),
dtype=np.uint8)
self.action_space = gym.spaces.Discrete(3)
# Define a function to create a MineRL environment
def create_env(config):
mission = config["mission"]
port = 1000 * config.worker_index + config.vector_index
print('*********************************************')
print(f'* Worker {config.worker_index} creating from \
mission: {mission}, port {port}')
print('*********************************************')
if config.worker_index == 0:
# The first environment is only used for checking the action
# and observation space. By using a dummy environment, there's
# no need to spin up a Minecraft instance behind it saving some
# CPU resources on the head node.
return DummyEnv()
env = EnvWrapper(mission, port)
env = TrackingEnv(env)
env = FrameStack(env, 2)
return env
def create_env_for_rollout(config):
mission = config['mission']
port = 1000 * config.worker_index + config.vector_index
print('*********************************************')
print(f'* Worker {config.worker_index} creating from \
mission: {mission}, port {port}')
print('*********************************************')
env = EnvWrapper(mission, port)
env = TrackingEnv(env)
env = FrameStack(env, 2)
env = TrajectoryWrapper(env)
return env

View File

@@ -1,95 +0,0 @@
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<Mission xmlns="http://ProjectMalmo.microsoft.com" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<About>
<Summary>$(ENV_NAME)</Summary>
</About>
<ModSettings>
<MsPerTick>50</MsPerTick>
</ModSettings>
<ServerSection>
<ServerInitialConditions>
<Time>
<StartTime>6000</StartTime>
<AllowPassageOfTime>false</AllowPassageOfTime>
</Time>
<Weather>clear</Weather>
<AllowSpawning>false</AllowSpawning>
</ServerInitialConditions>
<ServerHandlers>
<FlatWorldGenerator generatorString="3;7,220*1,5*3,2;3;,biome_1"/>
<DrawingDecorator>
<DrawSphere x="-29" y="70" z="-2" radius="100" type="air"/>
<DrawCuboid x1="-34" y1="70" z1="-7" x2="-24" y2="70" z2="3" type="lava" />
</DrawingDecorator>
<MazeDecorator>
<Seed>random</Seed>
<SizeAndPosition width="5" length="5" height="10" xOrigin="-32" yOrigin="69" zOrigin="-5"/>
<StartBlock type="emerald_block" fixedToEdge="false"/>
<EndBlock type="lapis_block" fixedToEdge="false"/>
<PathBlock type="grass"/>
<FloorBlock type="air"/>
<GapBlock type="lava"/>
<GapProbability>0.6</GapProbability>
<AllowDiagonalMovement>false</AllowDiagonalMovement>
</MazeDecorator>
<ServerQuitFromTimeUp timeLimitMs="300000" description="out_of_time"/>
<ServerQuitWhenAnyAgentFinishes/>
</ServerHandlers>
</ServerSection>
<AgentSection mode="Survival">
<Name>AML_Bot</Name>
<AgentStart>
<Placement x="-28.5" y="71.0" z="-1.5" pitch="70" yaw="0"/>
</AgentStart>
<AgentHandlers>
<VideoProducer want_depth="false">
<Width>84</Width>
<Height>84</Height>
</VideoProducer>
<FileBasedPerformanceProducer/>
<ObservationFromFullInventory flat="false"/>
<ObservationFromFullStats/>
<HumanLevelCommands>
<ModifierList type="deny-list">
<command>moveMouse</command>
<command>inventory</command>
</ModifierList>
</HumanLevelCommands>
<CameraCommands/>
<ObservationFromCompass/>
<DiscreteMovementCommands/>
<RewardForMissionEnd>
<Reward description="out_of_time" reward="-1" />
</RewardForMissionEnd>
<RewardForTouchingBlockType>
<Block reward="-1.0" type="lava" behaviour="onceOnly"/>
<Block reward="1.0" type="lapis_block" behaviour="onceOnly"/>
</RewardForTouchingBlockType>
<RewardForSendingCommand reward="-0.02"/>
<AgentQuitFromTouchingBlockType>
<Block type="lava" />
<Block type="lapis_block" />
</AgentQuitFromTouchingBlockType>
<PauseCommand/>
<AgentQuitFromReachingCommandQuota total="50"/>
</AgentHandlers>
</AgentSection>
</Mission>

View File

@@ -1,95 +0,0 @@
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<Mission xmlns="http://ProjectMalmo.microsoft.com" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<About>
<Summary>$(ENV_NAME)</Summary>
</About>
<ModSettings>
<MsPerTick>50</MsPerTick>
</ModSettings>
<ServerSection>
<ServerInitialConditions>
<Time>
<StartTime>6000</StartTime>
<AllowPassageOfTime>false</AllowPassageOfTime>
</Time>
<Weather>clear</Weather>
<AllowSpawning>false</AllowSpawning>
</ServerInitialConditions>
<ServerHandlers>
<FlatWorldGenerator generatorString="3;7,220*1,5*3,2;3;,biome_1"/>
<DrawingDecorator>
<DrawSphere x="-29" y="70" z="-2" radius="100" type="air"/>
<DrawCuboid x1="-34" y1="70" z1="-7" x2="-24" y2="70" z2="3" type="lava" />
</DrawingDecorator>
<MazeDecorator>
<Seed>{SEED_PLACEHOLDER}</Seed>
<SizeAndPosition width="6" length="6" height="10" xOrigin="-32" yOrigin="69" zOrigin="-5"/>
<StartBlock type="emerald_block" fixedToEdge="false"/>
<EndBlock type="lapis_block" fixedToEdge="false"/>
<PathBlock type="grass"/>
<FloorBlock type="air"/>
<GapBlock type="lava"/>
<GapProbability>0.6</GapProbability>
<AllowDiagonalMovement>false</AllowDiagonalMovement>
</MazeDecorator>
<ServerQuitFromTimeUp timeLimitMs="300000" description="out_of_time"/>
<ServerQuitWhenAnyAgentFinishes/>
</ServerHandlers>
</ServerSection>
<AgentSection mode="Survival">
<Name>AML_Bot</Name>
<AgentStart>
<Placement x="-28.5" y="71.0" z="-1.5" pitch="70" yaw="0"/>
</AgentStart>
<AgentHandlers>
<VideoProducer want_depth="false">
<Width>84</Width>
<Height>84</Height>
</VideoProducer>
<FileBasedPerformanceProducer/>
<ObservationFromFullInventory flat="false"/>
<ObservationFromFullStats/>
<HumanLevelCommands>
<ModifierList type="deny-list">
<command>moveMouse</command>
<command>inventory</command>
</ModifierList>
</HumanLevelCommands>
<CameraCommands/>
<ObservationFromCompass/>
<DiscreteMovementCommands/>
<RewardForMissionEnd>
<Reward description="out_of_time" reward="-1" />
</RewardForMissionEnd>
<RewardForTouchingBlockType>
<Block reward="-1.0" type="lava" behaviour="onceOnly"/>
<Block reward="1.0" type="lapis_block" behaviour="onceOnly"/>
</RewardForTouchingBlockType>
<RewardForSendingCommand reward="-0.02"/>
<AgentQuitFromTouchingBlockType>
<Block type="lava" />
<Block type="lapis_block" />
</AgentQuitFromTouchingBlockType>
<PauseCommand/>
<AgentQuitFromReachingCommandQuota total="50"/>
</AgentHandlers>
</AgentSection>
</Mission>

View File

@@ -1,74 +0,0 @@
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<Mission xmlns="http://ProjectMalmo.microsoft.com" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<About>
<Summary>AML-Video-Gatherer</Summary>
</About>
<ModSettings>
<MsPerTick>50</MsPerTick>
</ModSettings>
<ServerSection>
<ServerInitialConditions>
<Time>
<StartTime>6000</StartTime>
<AllowPassageOfTime>false</AllowPassageOfTime>
</Time>
<Weather>clear</Weather>
<AllowSpawning>false</AllowSpawning>
</ServerInitialConditions>
<ServerHandlers>
<FlatWorldGenerator generatorString="3;7,220*1,5*3,2;3;,biome_1"/>
<MazeDecorator>
<Seed>{SEED_PLACEHOLDER}</Seed>
<SizeAndPosition width="6" length="6" height="10" xOrigin="-32" yOrigin="69" zOrigin="-5"/>
<StartBlock type="emerald_block" fixedToEdge="false"/>
<EndBlock type="lapis_block" fixedToEdge="false"/>
<PathBlock type="grass"/>
<FloorBlock type="air"/>
<GapBlock type="lava"/>
<GapProbability>0.6</GapProbability>
<AllowDiagonalMovement>false</AllowDiagonalMovement>
</MazeDecorator>
<ServerQuitFromTimeUp timeLimitMs="300000" description="out_of_time"/>
<ServerQuitWhenAnyAgentFinishes/>
</ServerHandlers>
</ServerSection>
<AgentSection mode="Survival">
<Name>Agent</Name>
<AgentStart>
<Placement x="-28.5" y="71.0" z="-1.5" yaw="0"/>
</AgentStart>
<AgentHandlers>
<HumanLevelCommands>
<ModifierList type="deny-list">
<command>moveMouse</command>
<command>inventory</command>
</ModifierList>
</HumanLevelCommands>
<DiscreteMovementCommands/>
<MissionQuitCommands/>
<AgentQuitFromReachingCommandQuota total="50"/>
</AgentHandlers>
</AgentSection>
<AgentSection mode="Spectator">
<Name>Camera_Bot</Name>
<AgentStart>
<Placement x="-29" y="72" z="-6.7" pitch="16" yaw="0"/>
</AgentStart>
<AgentHandlers>
<VideoProducer want_depth="false">
<Width>860</Width>
<Height>480</Height>
</VideoProducer>
<AbsoluteMovementCommands/>
</AgentHandlers>
</AgentSection>
</Mission>

View File

@@ -1,130 +0,0 @@
import argparse
import os
import re
from azureml.core import Run
from azureml.core.model import Model
from minecraft_environment import create_env_for_rollout
from malmo_video_recorder import MalmoVideoRecorder
from gym import wrappers
import ray
import ray.tune as tune
from ray.rllib import rollout
from ray.tune.registry import get_trainable_cls
def write_mission_file_for_seed(mission_file, seed):
with open(mission_file, 'r') as base_file:
mission_file_path = mission_file.replace('v0', seed)
content = base_file.read().format(SEED_PLACEHOLDER=seed)
mission_file = open(mission_file_path, 'w')
mission_file.writelines(content)
mission_file.close()
return mission_file_path
def run_rollout(trainable_type, mission_file, seed):
# Writes the mission file for minerl
mission_file_path = write_mission_file_for_seed(mission_file, seed)
# Instantiate the agent. Note: the IMPALA trainer implementation in
# Ray uses an AsyncSamplesOptimizer. Under the hood, this starts a
# LearnerThread which will wait for training samples. This will fail
# after a timeout, but has no influence on the rollout. See
# https://github.com/ray-project/ray/blob/708dff6d8f7dd6f7919e06c1845f1fea0cca5b89/rllib/optimizers/aso_learner.py#L66
config = {
"env_config": {
"mission": mission_file_path,
"is_rollout": True,
"seed": seed
},
"num_workers": 0
}
cls = get_trainable_cls(args.run)
agent = cls(env="Minecraft", config=config)
# The optimizer is not needed during a rollout
agent.optimizer.stop()
# Load state from checkpoint
agent.restore(f'{checkpoint_path}/{checkpoint_file}')
# Get a reference to the environment
env = agent.workers.local_worker().env
# Let the agent choose actions until the game is over
obs = env.reset()
done = False
total_reward = 0
while not done:
action = agent.compute_action(obs)
obs, reward, done, info = env.step(action)
total_reward += reward
print(f'Total reward using seed {seed}: {total_reward}')
# This avoids a sigterm trace in the logs, see minerl.env.malmo.Instance
env.instance.watcher_process.kill()
env.close()
agent.stop()
return env.get_trajectory()
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('--model_name', required=True)
parser.add_argument('--run', required=False, default="IMPALA")
args = parser.parse_args()
# Register custom Minecraft environment
tune.register_env("Minecraft", create_env_for_rollout)
ray.init(address='auto')
# Download the model files (contains a checkpoint)
ws = Run.get_context().experiment.workspace
model = Model(ws, args.model_name)
checkpoint_path = model.download(exist_ok=True)
files_ = os.listdir(checkpoint_path)
cp_pattern = re.compile('^checkpoint-\\d+$')
checkpoint_file = None
for f_ in files_:
if cp_pattern.match(f_):
checkpoint_file = f_
if checkpoint_file is None:
raise Exception("No checkpoint file found.")
# These are the Minecraft mission seeds for the rollouts
rollout_seeds = ['1234', '43289', '65224', '983341']
# Initialize the Malmo video recorder
video_recorder = MalmoVideoRecorder()
video_recorder.init_malmo()
# Path references to the mission files
base_training_mission_file = \
'minecraft_missions/lava_maze_rollout-v0.xml'
base_video_recording_mission_file = \
'minecraft_missions/lava_maze_rollout_video.xml'
for seed in rollout_seeds:
trajectory = run_rollout(
args.run,
base_training_mission_file,
seed)
video_recorder.record_malmo_video(
trajectory,
base_video_recording_mission_file,
seed)

View File

@@ -1,49 +0,0 @@
import os
import ray
import ray.tune as tune
from utils import callbacks
from minecraft_environment import create_env
def stop(trial_id, result):
max_train_time = int(os.environ.get("AML_MAX_TRAIN_TIME_SECONDS", 5 * 60 * 60))
return result["episode_reward_mean"] >= 1 \
or result["time_total_s"] >= max_train_time
if __name__ == '__main__':
tune.register_env("Minecraft", create_env)
ray.init(address='auto')
tune.run(
run_or_experiment="IMPALA",
config={
"env": "Minecraft",
"env_config": {
"mission": "minecraft_missions/lava_maze-v0.xml"
},
"num_workers": 10,
"num_cpus_per_worker": 2,
"rollout_fragment_length": 50,
"train_batch_size": 1024,
"replay_buffer_num_slots": 4000,
"replay_proportion": 10,
"learner_queue_timeout": 900,
"num_sgd_iter": 2,
"num_data_loader_buffers": 2,
"exploration_config": {
"type": "EpsilonGreedy",
"initial_epsilon": 1.0,
"final_epsilon": 0.02,
"epsilon_timesteps": 500000
},
"callbacks": {"on_train_result": callbacks.on_train_result},
},
stop=stop,
checkpoint_at_end=True,
local_dir='./logs'
)

View File

@@ -1,237 +0,0 @@
import sys
import csv
from azure.mgmt.network import NetworkManagementClient
def check_port_in_port_range(expected_port: str,
dest_port_range: str):
"""
Check if a port is within a port range
Port range maybe like *, 8080 or 8888-8889
"""
if dest_port_range == '*':
return True
dest_ports = dest_port_range.split('-')
if len(dest_ports) == 1 and \
int(dest_ports[0]) == int(expected_port):
return True
if len(dest_ports) == 2 and \
int(dest_ports[0]) <= int(expected_port) and \
int(dest_ports[1]) >= int(expected_port):
return True
return False
def check_port_in_destination_port_ranges(expected_port: str,
dest_port_ranges: list):
"""
Check if a port is within a given list of port ranges
i.e. check if port 8080 is in port ranges of 22,80,8080-8090,443
"""
for dest_port_range in dest_port_ranges:
if check_port_in_port_range(expected_port, dest_port_range) is True:
return True
return False
def check_ports_in_destination_port_ranges(expected_ports: list,
dest_port_ranges: list):
"""
Check if all ports in a given port list are within a given list
of port ranges
i.e. check if port 8080,8081 are in port ranges of 22,80,8080-8090,443
"""
for expected_port in expected_ports:
if check_port_in_destination_port_ranges(
expected_port, dest_port_ranges) is False:
return False
return True
def check_source_address_prefix(source_address_prefix: str):
"""Check if source address prefix is BatchNodeManagement or default"""
required_prefix = 'BatchNodeManagement'
default_prefix = 'default'
if source_address_prefix.lower() == required_prefix.lower() or \
source_address_prefix.lower() == default_prefix.lower():
return True
return False
def check_protocol(protocol: str):
"""Check if protocol is supported - Tcp/Any"""
required_protocol = 'Tcp'
any_protocol = 'Any'
if required_protocol.lower() == protocol.lower() or \
any_protocol.lower() == protocol.lower():
return True
return False
def check_direction(direction: str):
"""Check if port direction is inbound"""
required_direction = 'Inbound'
if required_direction.lower() == direction.lower():
return True
return False
def check_provisioning_state(provisioning_state: str):
"""Check if the provisioning state is succeeded"""
required_provisioning_state = 'Succeeded'
if required_provisioning_state.lower() == provisioning_state.lower():
return True
return False
def check_rule_for_Azure_ML(rule):
"""Check if the ports required for Azure Machine Learning are open"""
required_ports = ['29876', '29877']
if check_source_address_prefix(rule.source_address_prefix) is False:
return False
if check_protocol(rule.protocol) is False:
return False
if check_direction(rule.direction) is False:
return False
if check_provisioning_state(rule.provisioning_state) is False:
return False
if rule.destination_port_range is not None:
if check_ports_in_destination_port_ranges(
required_ports,
[rule.destination_port_range]) is False:
return False
else:
if check_ports_in_destination_port_ranges(
required_ports,
rule.destination_port_ranges) is False:
return False
return True
def check_vnet_security_rules(auth_object,
vnet_subscription_id,
vnet_resource_group,
vnet_name,
save_to_file=False):
"""
Check all the rules of virtual network if required ports for Azure Machine
Learning are open
"""
network_client = NetworkManagementClient(
auth_object,
vnet_subscription_id)
# get the vnet
vnet = network_client.virtual_networks.get(
resource_group_name=vnet_resource_group,
virtual_network_name=vnet_name)
vnet_location = vnet.location
vnet_info = []
if vnet.subnets is None or len(vnet.subnets) == 0:
print('WARNING: No subnet found for VNet:', vnet_name)
# for each subnet of the vnet
for subnet in vnet.subnets:
if subnet.network_security_group is None:
print('WARNING: No network security group found for subnet.',
'Subnet',
subnet.id.split("/")[-1])
else:
# get all the rules
network_security_group_name = \
subnet.network_security_group.id.split("/")[-1]
network_security_group_resource_group_name = \
subnet.network_security_group.id.split("/")[4]
network_security_group_subscription_id = \
subnet.network_security_group.id.split("/")[2]
security_rules = list(network_client.security_rules.list(
network_security_group_resource_group_name,
network_security_group_name))
rule_matched = None
for rule in security_rules:
rule_info = []
# add vnet details
rule_info.append(vnet_name)
rule_info.append(vnet_subscription_id)
rule_info.append(vnet_resource_group)
rule_info.append(vnet_location)
# add subnet details
rule_info.append(subnet.id.split("/")[-1])
rule_info.append(network_security_group_name)
rule_info.append(network_security_group_subscription_id)
rule_info.append(network_security_group_resource_group_name)
# add rule details
rule_info.append(rule.priority)
rule_info.append(rule.name)
rule_info.append(rule.source_address_prefix)
if rule.destination_port_range is not None:
rule_info.append(rule.destination_port_range)
else:
rule_info.append(rule.destination_port_ranges)
rule_info.append(rule.direction)
rule_info.append(rule.provisioning_state)
vnet_info.append(rule_info)
if check_rule_for_Azure_ML(rule) is True:
rule_matched = rule
if rule_matched is not None:
print("INFORMATION: Rule matched with required ports. Subnet:",
subnet.id.split("/")[-1], "Rule:", rule.name)
else:
print("WARNING: No rule matched with required ports. Subnet:",
subnet.id.split("/")[-1])
if save_to_file is True:
file_name = vnet_name + ".csv"
with open(file_name, mode='w') as vnet_rule_file:
vnet_rule_file_writer = csv.writer(
vnet_rule_file,
delimiter=',',
quotechar='"',
quoting=csv.QUOTE_MINIMAL)
header = ['VNet_Name', 'VNet_Subscription_ID',
'VNet_Resource_Group', 'VNet_Location',
'Subnet_Name', 'NSG_Name',
'NSG_Subscription_ID', 'NSG_Resource_Group',
'Rule_Priority', 'Rule_Name', 'Rule_Source',
'Rule_Destination_Ports', 'Rule_Direction',
'Rule_Provisioning_State']
vnet_rule_file_writer.writerow(header)
vnet_rule_file_writer.writerows(vnet_info)
print("INFORMATION: Network security group rules for your virtual \
network are saved in file", file_name)

View File

@@ -1,18 +0,0 @@
'''RLlib callbacks module:
Common callback methods to be passed to RLlib trainer.
'''
from azureml.core import Run
def on_train_result(info):
'''Callback on train result to record metrics returned by trainer.
'''
run = Run.get_context()
run.log(
name='episode_reward_mean',
value=info["result"]["episode_reward_mean"])
run.log(
name='episodes_total',
value=info["result"]["episodes_total"])

View File

@@ -1,8 +0,0 @@
name: minecraft
dependencies:
- pip:
- azureml-sdk
- azureml-contrib-reinforcementlearning
- azureml-widgets
- tensorboard
- azureml-tensorboard

View File

@@ -0,0 +1,17 @@
# AzureML Responsible AI
AzureML Responsible AI empowers data scientists and developers to innovate responsibly with a growing set of tools including model interpretability and fairness.
Follow these sample notebooks to learn about the model interpretability and fairness integration in Azure:
<a name="samples"></a>
# Responsible AI Sample Notebooks
- **Visualize fairness metrics and model explanations**
- Dataset: [UCI Adult](https://archive.ics.uci.edu/ml/datasets/Adult)
- **[Jupyter Notebook](visualize-upload-loan-decision/rai-loan-decision.ipynb)**
- Train a model to predict annual income
- Generate fairness and interpretability explanations for the trained model
- Visualize the explanations in the notebook widget dashboard
- Upload the explanations to Azure to be viewed in AzureML studio

View File

@@ -0,0 +1,722 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Copyright (c) Microsoft Corporation. All rights reserved.\n",
"\n",
"Licensed under the MIT License."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/responsible-ai/visualize-upload-loan-decision/rai-loan-decision.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Assess Fairness, Explore Interpretability, and Mitigate Fairness Issues \n",
"\n",
"This notebook demonstrates how to use [InterpretML](interpret.ml), [Fairlearn](fairlearn.org), and the [Responsible AI Widget's](https://github.com/microsoft/responsible-ai-widgets/) Fairness and Interpretability dashboards to understand a model trained on the Census dataset. This dataset is a classification problem - given a range of data about 32,000 individuals, predict whether their annual income is above or below fifty thousand dollars per year.\n",
"\n",
"For the purposes of this notebook, we shall treat this as a loan decision problem. We will pretend that the label indicates whether or not each individual repaid a loan in the past. We will use the data to train a predictor to predict whether previously unseen individuals will repay a loan or not. The assumption is that the model predictions are used to decide whether an individual should be offered a loan.\n",
"\n",
"We will first train a fairness-unaware predictor, load its global and local explanations, and use the interpretability and fairness dashboards to demonstrate how this model leads to unfair decisions (under a specific notion of fairness called *demographic parity*). We then mitigate unfairness by applying the `GridSearch` algorithm from `Fairlearn` package.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Install required packages\n",
"\n",
"This notebook works with Fairlearn v0.4.6, and not later versions. If needed, please uncomment and run the following cell:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# %pip install --upgrade fairlearn==0.4.6"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After installing packages, you must close and reopen the notebook as well as restarting the kernel."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load and preprocess the dataset\n",
"\n",
"For simplicity, we import the dataset from the `shap` package, which contains the data in a cleaned format. We start by importing the various modules we're going to use:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from fairlearn.reductions import GridSearch\n",
"from fairlearn.reductions import DemographicParity, ErrorRate\n",
"\n",
"from sklearn import svm, neighbors, tree\n",
"from sklearn.compose import ColumnTransformer, make_column_selector\n",
"from sklearn.preprocessing import LabelEncoder,StandardScaler\n",
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.pipeline import Pipeline\n",
"from sklearn.impute import SimpleImputer\n",
"from sklearn.preprocessing import StandardScaler, OneHotEncoder\n",
"from sklearn.svm import SVC\n",
"from sklearn.metrics import accuracy_score\n",
"from sklearn.datasets import fetch_openml\n",
"\n",
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"# SHAP Tabular Explainer\n",
"from interpret.ext.blackbox import KernelExplainer\n",
"from interpret.ext.blackbox import MimicExplainer\n",
"from interpret.ext.glassbox import LGBMExplainableModel"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can now load and inspect the data:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dataset = fetch_openml(data_id=1590, as_frame=True)\n",
"X_raw, y = dataset['data'], dataset['target']\n",
"X_raw[\"race\"].value_counts().to_dict()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We are going to treat the sex of each individual as a protected attribute (where 0 indicates female and 1 indicates male), and in this particular case we are going separate this attribute out and drop it from the main data. We then perform some standard data preprocessing steps to convert the data into a format suitable for the ML algorithms"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sensitive_features = X_raw[['sex','race']]\n",
"\n",
"le = LabelEncoder()\n",
"y = le.fit_transform(y)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, we split the data into training and test sets:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.model_selection import train_test_split\n",
"X_train, X_test, y_train, y_test, sensitive_features_train, sensitive_features_test = \\\n",
" train_test_split(X_raw, y, sensitive_features,\n",
" test_size = 0.2, random_state=0, stratify=y)\n",
"\n",
"# Work around indexing bug\n",
"X_train = X_train.reset_index(drop=True)\n",
"sensitive_features_train = sensitive_features_train.reset_index(drop=True)\n",
"X_test = X_test.reset_index(drop=True)\n",
"sensitive_features_test = sensitive_features_test.reset_index(drop=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Training a fairness-unaware predictor\n",
"\n",
"To show the effect of `Fairlearn` we will first train a standard ML predictor that does not incorporate fairness. For speed of demonstration, we use a simple logistic regression estimator from `sklearn`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"numeric_transformer = Pipeline(\n",
" steps=[\n",
" (\"impute\", SimpleImputer()),\n",
" (\"scaler\", StandardScaler()),\n",
" ]\n",
")\n",
"categorical_transformer = Pipeline(\n",
" [\n",
" (\"impute\", SimpleImputer(strategy=\"most_frequent\")),\n",
" (\"ohe\", OneHotEncoder(handle_unknown=\"ignore\")),\n",
" ]\n",
")\n",
"preprocessor = ColumnTransformer(\n",
" transformers=[\n",
" (\"num\", numeric_transformer, make_column_selector(dtype_exclude=\"category\")),\n",
" (\"cat\", categorical_transformer, make_column_selector(dtype_include=\"category\")),\n",
" ]\n",
")\n",
"\n",
"model = Pipeline(\n",
" steps=[\n",
" (\"preprocessor\", preprocessor),\n",
" (\n",
" \"classifier\",\n",
" LogisticRegression(solver=\"liblinear\", fit_intercept=True),\n",
" ),\n",
" ]\n",
")\n",
"\n",
"model.fit(X_train, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Generate model explanations"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Using SHAP KernelExplainer\n",
"# clf.steps[-1][1] returns the trained classification model\n",
"explainer = MimicExplainer(model.steps[-1][1], \n",
" X_train,\n",
" LGBMExplainableModel,\n",
" features=X_raw.columns, \n",
" classes=['Rejected', 'Approved'],\n",
" transformations=preprocessor)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Generate global explanations\n",
"Explain overall model predictions (global explanation)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Explain the model based on a subset of 1000 rows\n",
"global_explanation = explainer.explain_global(X_test[:1000])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"global_explanation.get_feature_importance_dict()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Generate local explanations\n",
"Explain local data points (individual instances)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# You can pass a specific data point or a group of data points to the explain_local function\n",
"# E.g., Explain the first data point in the test set\n",
"instance_num = 1\n",
"local_explanation = explainer.explain_local(X_test[:instance_num])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Get the prediction for the first member of the test set and explain why model made that prediction\n",
"prediction_value = model.predict(X_test)[instance_num]\n",
"\n",
"sorted_local_importance_values = local_explanation.get_ranked_local_values()[prediction_value]\n",
"sorted_local_importance_names = local_explanation.get_ranked_local_names()[prediction_value]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print('local importance values: {}'.format(sorted_local_importance_values))\n",
"print('local importance names: {}'.format(sorted_local_importance_names))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Visualize model explanations\n",
"Load the interpretability visualization dashboard"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from raiwidgets import ExplanationDashboard"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ExplanationDashboard(global_explanation, model, dataset=X_test[:1000], true_y=y_test[:1000])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can load this predictor into the Fairness dashboard, and examine how it is unfair:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Assess model fairness \n",
"Load the fairness visualization dashboard"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from fairlearn.widget import FairlearnDashboard\n",
"\n",
"y_pred = model.predict(X_test)\n",
"\n",
"FairlearnDashboard(sensitive_features=sensitive_features_test,\n",
" y_true=y_test,\n",
" y_pred=y_pred)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Looking at the disparity in accuracy, we see that males have an error rate about three times greater than the females. More interesting is the disparity in opportunitiy - males are offered loans at three times the rate of females.\n",
"\n",
"Despite the fact that we removed the feature from the training data, our predictor still discriminates based on sex. This demonstrates that simply ignoring a protected attribute when fitting a predictor rarely eliminates unfairness. There will generally be enough other features correlated with the removed attribute to lead to disparate impact."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Mitigation with Fairlearn (GridSearch)\n",
"\n",
"The `GridSearch` class in `Fairlearn` implements a simplified version of the exponentiated gradient reduction of [Agarwal et al. 2018](https://arxiv.org/abs/1803.02453). The user supplies a standard ML estimator, which is treated as a blackbox. `GridSearch` works by generating a sequence of relabellings and reweightings, and trains a predictor for each.\n",
"\n",
"For this example, we specify demographic parity (on the protected attribute of sex) as the fairness metric. Demographic parity requires that individuals are offered the opportunity (are approved for a loan in this example) independent of membership in the protected class (i.e., females and males should be offered loans at the same rate). We are using this metric for the sake of simplicity; in general, the appropriate fairness metric will not be obvious."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Fairlearn is not yet fully compatible with Pipelines, so we have to pass the estimator only\n",
"X_train_prep = preprocessor.transform(X_train).toarray()\n",
"X_test_prep = preprocessor.transform(X_test).toarray()\n",
"\n",
"sweep = GridSearch(LogisticRegression(solver=\"liblinear\", fit_intercept=True),\n",
" constraints=DemographicParity(),\n",
" grid_size=70)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Our algorithms provide `fit()` and `predict()` methods, so they behave in a similar manner to other ML packages in Python. We do however have to specify two extra arguments to `fit()` - the column of protected attribute labels, and also the number of predictors to generate in our sweep.\n",
"\n",
"After `fit()` completes, we extract the full set of predictors from the `GridSearch` object."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sweep.fit(X_train_prep, y_train,\n",
" sensitive_features=sensitive_features_train.sex)\n",
"\n",
"predictors = sweep._predictors"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We could load these predictors into the Fairness dashboard now. However, the plot would be somewhat confusing due to their number. In this case, we are going to remove the predictors which are dominated in the error-disparity space by others from the sweep (note that the disparity will only be calculated for the sensitive feature). In general, one might not want to do this, since there may be other considerations beyond the strict optimization of error and disparity (of the given protected attribute)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from fairlearn.metrics import demographic_parity_difference\n",
"\n",
"accuracies, disparities = [], []\n",
"\n",
"for predictor in predictors:\n",
" y_pred = predictor.predict(X_train_prep)\n",
" # accuracy_metric_frame = MetricFrame(accuracy_score, y_train, predictor.predict(X_train_prep), sensitive_features=sensitive_features_train.sex)\n",
" # selection_rate_metric_frame = MetricFrame(selection_rate, y_train, predictor.predict(X_train_prep), sensitive_features=sensitive_features_train.sex)\n",
" accuracies.append(accuracy_score(y_train, y_pred))\n",
" disparities.append(demographic_parity_difference(y_train,\n",
" y_pred,\n",
" sensitive_features=sensitive_features_train.sex))\n",
" \n",
"all_results = pd.DataFrame({\"predictor\": predictors, \"accuracy\": accuracies, \"disparity\": disparities})\n",
"\n",
"all_models_dict = {\"unmitigated\": model.steps[-1][1]}\n",
"dominant_models_dict = {\"unmitigated\": model.steps[-1][1]}\n",
"base_name_format = \"grid_{0}\"\n",
"row_id = 0\n",
"for row in all_results.itertuples():\n",
" model_name = base_name_format.format(row_id)\n",
" all_models_dict[model_name] = row.predictor\n",
" accuracy_for_lower_or_eq_disparity = all_results[\"accuracy\"][all_results[\"disparity\"] <= row.disparity]\n",
" if row.accuracy >= accuracy_for_lower_or_eq_disparity.max():\n",
" dominant_models_dict[model_name] = row.predictor\n",
" row_id = row_id + 1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can construct predictions for all the models, and also for the dominant models:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dashboard_all = {}\n",
"for name, predictor in all_models_dict.items():\n",
" value = predictor.predict(X_test_prep)\n",
" dashboard_all[name] = value\n",
" \n",
"dominant_all = {}\n",
"for name, predictor in dominant_models_dict.items():\n",
" dominant_all[name] = predictor.predict(X_test_prep)\n",
"\n",
"FairlearnDashboard(sensitive_features=sensitive_features_test, \n",
" y_true=y_test,\n",
" y_pred=dominant_all)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can look at just the dominant models in the dashboard:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We see a Pareto front forming - the set of predictors which represent optimal tradeoffs between accuracy and disparity in predictions. In the ideal case, we would have a predictor at (1,0) - perfectly accurate and without any unfairness under demographic parity (with respect to the protected attribute \"sex\"). The Pareto front represents the closest we can come to this ideal based on our data and choice of estimator. Note the range of the axes - the disparity axis covers more values than the accuracy, so we can reduce disparity substantially for a small loss in accuracy.\n",
"\n",
"By clicking on individual models on the plot, we can inspect their metrics for disparity and accuracy in greater detail. In a real example, we would then pick the model which represented the best trade-off between accuracy and disparity given the relevant business constraints."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# AzureML integration\n",
"\n",
"We will now go through a brief example of the AzureML integration.\n",
"\n",
"The required package can be installed via:\n",
"\n",
"```\n",
"pip install azureml-contrib-fairness\n",
"pip install azureml-interpret\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Connect to workspace\n",
"\n",
"Just like in the previous tutorials, we will need to connect to a [workspace](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.workspace(class)?view=azure-ml-py).\n",
"\n",
"The following code will allow you to create a workspace if you don't already have one created. You must have an Azure subscription to create a workspace:\n",
"\n",
"```python\n",
"from azureml.core import Workspace\n",
"ws = Workspace.create(name='myworkspace',\n",
" subscription_id='<azure-subscription-id>',\n",
" resource_group='myresourcegroup',\n",
" create_resource_group=True,\n",
" location='eastus2')\n",
"```\n",
"\n",
"**If you are running this on a Notebook VM, you can import the existing workspace.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core import Workspace\n",
"\n",
"ws = Workspace.from_config()\n",
"print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep='\\n')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Registering models\n",
"\n",
"The fairness dashboard is designed to integrate with registered models, so we need to do this for the models we want in the Studio portal. The assumption is that the names of the models specified in the dashboard dictionary correspond to the `id`s (i.e. `<name>:<version>` pairs) of registered models in the workspace."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we register each of the models in the `dominant_all` dictionary into the workspace. For this, we have to save each model to a file, and then register that file:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import joblib\n",
"import os\n",
"from azureml.core import Model, Experiment, Run\n",
"\n",
"os.makedirs('models', exist_ok=True)\n",
"def register_model(name, model):\n",
" print(\"Registering \", name)\n",
" model_path = \"models/{0}.pkl\".format(name)\n",
" joblib.dump(value=model, filename=model_path)\n",
" registered_model = Model.register(model_path=model_path,\n",
" model_name=name,\n",
" workspace=ws)\n",
" print(\"Registered \", registered_model.id)\n",
" return registered_model.id\n",
"\n",
"model_name_id_mapping = dict()\n",
"for name, model in dominant_all.items():\n",
" m_id = register_model(name, model)\n",
" model_name_id_mapping[name] = m_id"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, produce new predictions dictionaries, with the updated names:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dominant_all_ids = dict()\n",
"for name, y_pred in dominant_all.items():\n",
" dominant_all_ids[model_name_id_mapping[name]] = y_pred"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Uploading a dashboard\n",
"\n",
"We create a _dashboard dictionary_ using Fairlearn's `metrics` package. The `_create_group_metric_set` method has arguments similar to the Dashboard constructor, except that the sensitive features are passed as a dictionary (to ensure that names are available), and we must specify the type of prediction. Note that we use the `dashboard_registered` dictionary we just created:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sf = { 'sex': sensitive_features_test.sex, 'race': sensitive_features_test.race }\n",
"\n",
"from fairlearn.metrics._group_metric_set import _create_group_metric_set\n",
"\n",
"dash_dict_all = _create_group_metric_set(y_true=y_test,\n",
" predictions=dominant_all_ids,\n",
" sensitive_features=sf,\n",
" prediction_type='binary_classification')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we import our `contrib` package which contains the routine to perform the upload:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.contrib.fairness import upload_dashboard_dictionary, download_dashboard_by_upload_id"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can create an Experiment, then a Run, and upload our dashboard to it:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"exp = Experiment(ws, 'responsible-ai-loan-decision')\n",
"print(exp)\n",
"\n",
"run = exp.start_logging()\n",
"try:\n",
" dashboard_title = \"Upload MultiAsset from Grid Search with Census Data Notebook\"\n",
" upload_id = upload_dashboard_dictionary(run,\n",
" dash_dict_all,\n",
" dashboard_name=dashboard_title)\n",
" print(\"\\nUploaded to id: {0}\\n\".format(upload_id))\n",
"\n",
" downloaded_dict = download_dashboard_by_upload_id(run, upload_id)\n",
"finally:\n",
" run.complete()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Uploading explanations\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.interpret import ExplanationClient\n",
"\n",
"client = ExplanationClient.from_run(run)\n",
"client.upload_model_explanation(global_explanation, comment = \"census data global explanation\")"
]
}
],
"metadata": {
"authors": [
{
"name": "chgrego"
}
],
"kernelspec": {
"display_name": "Python 3.6",
"language": "python",
"name": "python36"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.9"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

View File

@@ -0,0 +1,12 @@
name: rai-loan-decision
dependencies:
- pip:
- azureml-sdk
- azureml-interpret
- azureml-contrib-fairness
- interpret-community[visualization]
- fairlearn==0.4.6
- matplotlib
- azureml-dataset-runtime
- ipywidgets
- raiwidgets

View File

@@ -100,7 +100,7 @@
"\n",
"# Check core SDK version number\n",
"\n",
"print(\"This notebook was created using SDK version 1.20.0, you are currently running version\", azureml.core.VERSION)"
"print(\"This notebook was created using SDK version 1.24.0, you are currently running version\", azureml.core.VERSION)"
]
},
{

View File

@@ -98,7 +98,7 @@
"metadata": {},
"outputs": [],
"source": [
"experiment_name = \"experiment-with-mlflow\"\n",
"experiment_name = \"LocalTrain-with-mlflow-sample\"\n",
"mlflow.set_experiment(experiment_name)"
]
},

View File

@@ -123,7 +123,7 @@
"source": [
"from azureml.core import Experiment\n",
"\n",
"experiment_name = \"experiment-with-mlflow\"\n",
"experiment_name = \"RemoteTrain-with-mlflow-sample\"\n",
"exp = Experiment(workspace=ws, name=experiment_name)"
]
},

View File

@@ -7,6 +7,6 @@ Follow these sample notebooks to learn:
3. [Train on remote VM](train-on-remote-vm): train a model using a remote Azure VM as compute target.
4. [Train on ML Compute](train-on-amlcompute): train a model using an ML Compute cluster as compute target.
5. [Train in an HDI Spark cluster](train-in-spark): train a Spark ML model using an HDInsight Spark cluster as compute target.
6. [Train and hyperparameter tune on Iris Dataset with Scikit-learn](train-hyperparameter-tune-deploy-with-sklearn): train a model using the Scikit-learn estimator and tune hyperparameters with Hyperdrive.
![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/training/README.png)

View File

@@ -30,7 +30,6 @@ Machine Learning notebook samples and encourage efficient retrieval of topics an
| :star:[Azure Machine Learning Pipeline with DataTranferStep](https://github.com/Azure/MachineLearningNotebooks/blob/master//how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-data-transfer.ipynb) | Demonstrates the use of DataTranferStep | Custom | ADF | None | Azure ML | None |
| [Getting Started with Azure Machine Learning Pipelines](https://github.com/Azure/MachineLearningNotebooks/blob/master//how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-getting-started.ipynb) | Getting Started notebook for ANML Pipelines | Custom | AML Compute | None | Azure ML | None |
| [Azure Machine Learning Pipeline with AzureBatchStep](https://github.com/Azure/MachineLearningNotebooks/blob/master//how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-how-to-use-azurebatch-to-run-a-windows-executable.ipynb) | Demonstrates the use of AzureBatchStep | Custom | Azure Batch | None | Azure ML | None |
| [Azure Machine Learning Pipeline with EstimatorStep](https://github.com/Azure/MachineLearningNotebooks/blob/master//how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-how-to-use-estimatorstep.ipynb) | Demonstrates the use of EstimatorStep | Custom | AML Compute | None | Azure ML | None |
| :star:[How to use ModuleStep with AML Pipelines](https://github.com/Azure/MachineLearningNotebooks/blob/master//how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-how-to-use-modulestep.ipynb) | Demonstrates the use of ModuleStep | Custom | AML Compute | None | Azure ML | None |
| :star:[How to use Pipeline Drafts to create a Published Pipeline](https://github.com/Azure/MachineLearningNotebooks/blob/master//how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-how-to-use-pipeline-drafts.ipynb) | Demonstrates the use of Pipeline Drafts | Custom | AML Compute | None | Azure ML | None |
| :star:[Azure Machine Learning Pipeline with HyperDriveStep](https://github.com/Azure/MachineLearningNotebooks/blob/master//how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-parameter-tuning-with-hyperdrive.ipynb) | Demonstrates the use of HyperDriveStep | Custom | AML Compute | None | Azure ML | None |
@@ -43,6 +42,8 @@ Machine Learning notebook samples and encourage efficient retrieval of topics an
| :star:[How to use DatabricksStep with AML Pipelines](https://github.com/Azure/MachineLearningNotebooks/blob/master//how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-use-databricks-as-compute-target.ipynb) | Demonstrates the use of DatabricksStep | Custom | Azure Databricks | None | Azure ML, Azure Databricks | None |
| :star:[How to use KustoStep with AML Pipelines](https://github.com/Azure/MachineLearningNotebooks/blob/master//how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-use-kusto-as-compute-target.ipynb) | Demonstrates the use of KustoStep | Custom | Kusto | None | Azure ML, Kusto | None |
| :star:[How to use AutoMLStep with AML Pipelines](https://github.com/Azure/MachineLearningNotebooks/blob/master//how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-with-automated-machine-learning-step.ipynb) | Demonstrates the use of AutoMLStep | Custom | AML Compute | None | Automated Machine Learning | None |
| [Azure Machine Learning Pipeline with CommandStep for R](https://github.com/Azure/MachineLearningNotebooks/blob/master//how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-with-commandstep-r.ipynb) | Demonstrates the use of CommandStep for running R scripts | Custom | AML Compute | None | Azure ML | None |
| [Azure Machine Learning Pipeline with CommandStep](https://github.com/Azure/MachineLearningNotebooks/blob/master//how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-with-commandstep.ipynb) | Demonstrates the use of CommandStep | Custom | AML Compute | None | Azure ML | None |
| :star:[Azure Machine Learning Pipelines with Data Dependency](https://github.com/Azure/MachineLearningNotebooks/blob/master//how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-with-data-dependency-steps.ipynb) | Demonstrates how to construct a Pipeline with data dependency between steps | Custom | AML Compute | None | Azure ML | None |
| [How to use run a notebook as a step in AML Pipelines](https://github.com/Azure/MachineLearningNotebooks/blob/master//how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-with-notebook-runner-step.ipynb) | Demonstrates the use of NotebookRunnerStep | Custom | AML Compute | None | Azure ML | None |
| [Use MLflow with Azure Machine Learning to Train and Deploy Keras Image Classifier](https://github.com/Azure/MachineLearningNotebooks/blob/master//how-to-use-azureml/ml-frameworks/using-mlflow/train-and-deploy-keras-auto-logging/train-and-deploy-keras-auto-logging.ipynb) | Use MLflow with Azure Machine Learning to Train and Deploy Keras Image Classifier, leveraging MLflow auto logging | MNIST | Local, AML Compute | Azure Container Instance | Keras | mlflow, keras |
@@ -59,8 +60,8 @@ Machine Learning notebook samples and encourage efficient retrieval of topics an
| [Train a model with hyperparameter tuning](https://github.com/Azure/MachineLearningNotebooks/blob/master//how-to-use-azureml/ml-frameworks/chainer/train-hyperparameter-tune-deploy-with-chainer/train-hyperparameter-tune-deploy-with-chainer.ipynb) | Train a Convolutional Neural Network (CNN) | MNIST | AML Compute | Azure Container Instance | Chainer | None |
| [Train a model with a custom Docker image](https://github.com/Azure/MachineLearningNotebooks/blob/master//how-to-use-azureml/ml-frameworks/fastai/fastai-with-custom-docker/fastai-with-custom-docker.ipynb) | Train with custom Docker image | Oxford IIIT Pet | AML Compute | None | Pytorch | None |
| [Train a DNN using hyperparameter tuning and deploying with Keras](https://github.com/Azure/MachineLearningNotebooks/blob/master//how-to-use-azureml/ml-frameworks/keras/train-hyperparameter-tune-deploy-with-keras/train-hyperparameter-tune-deploy-with-keras.ipynb) | Create a multi-class classifier | MNIST | AML Compute | Azure Container Instance | TensorFlow | None |
| [Distributed training with PyTorch](https://github.com/Azure/MachineLearningNotebooks/blob/master//how-to-use-azureml/ml-frameworks/pytorch/distributed-pytorch-with-distributeddataparallel/distributed-pytorch-with-distributeddataparallel.ipynb) | Train a model using distributed training via PyTorch DistributedDataParallel | CIFAR-10 | AML Compute | None | PyTorch | None |
| [Distributed PyTorch](https://github.com/Azure/MachineLearningNotebooks/blob/master//how-to-use-azureml/ml-frameworks/pytorch/distributed-pytorch-with-horovod/distributed-pytorch-with-horovod.ipynb) | Train a model using the distributed training via Horovod | MNIST | AML Compute | None | PyTorch | None |
| [Distributed training with PyTorch](https://github.com/Azure/MachineLearningNotebooks/blob/master//how-to-use-azureml/ml-frameworks/pytorch/distributed-pytorch-with-nccl-gloo/distributed-pytorch-with-nccl-gloo.ipynb) | Train a model using distributed training via Nccl/Gloo | MNIST | AML Compute | None | PyTorch | None |
| [Training with hyperparameter tuning using PyTorch](https://github.com/Azure/MachineLearningNotebooks/blob/master//how-to-use-azureml/ml-frameworks/pytorch/train-hyperparameter-tune-deploy-with-pytorch/train-hyperparameter-tune-deploy-with-pytorch.ipynb) | Train an image classification model using transfer learning with the PyTorch estimator | ImageNet | AML Compute | Azure Container Instance | PyTorch | None |
| [Training and hyperparameter tuning with Scikit-learn](https://github.com/Azure/MachineLearningNotebooks/blob/master//how-to-use-azureml/ml-frameworks/scikit-learn/train-hyperparameter-tune-deploy-with-sklearn/train-hyperparameter-tune-deploy-with-sklearn.ipynb) | Train a support vector machine (SVM) to perform classification | Iris | AML Compute | None | Scikit-learn | None |
| [Distributed training using TensorFlow with Horovod](https://github.com/Azure/MachineLearningNotebooks/blob/master//how-to-use-azureml/ml-frameworks/tensorflow/distributed-tensorflow-with-horovod/distributed-tensorflow-with-horovod.ipynb) | Use the TensorFlow estimator to train a word2vec model | None | AML Compute | None | TensorFlow | None |
@@ -126,8 +127,8 @@ Machine Learning notebook samples and encourage efficient retrieval of topics an
| [pong_rllib](https://github.com/Azure/MachineLearningNotebooks/blob/master//how-to-use-azureml/reinforcement-learning/atari-on-distributed-compute/pong_rllib.ipynb) | | | | | | |
| [cartpole_ci](https://github.com/Azure/MachineLearningNotebooks/blob/master//how-to-use-azureml/reinforcement-learning/cartpole-on-compute-instance/cartpole_ci.ipynb) | | | | | | |
| [cartpole_sc](https://github.com/Azure/MachineLearningNotebooks/blob/master//how-to-use-azureml/reinforcement-learning/cartpole-on-single-compute/cartpole_sc.ipynb) | | | | | | |
| [minecraft](https://github.com/Azure/MachineLearningNotebooks/blob/master//how-to-use-azureml/reinforcement-learning/minecraft-on-distributed-compute/minecraft.ipynb) | | | | | | |
| [particle](https://github.com/Azure/MachineLearningNotebooks/blob/master//how-to-use-azureml/reinforcement-learning/multiagent-particle-envs/particle.ipynb) | | | | | | |
| [rai-loan-decision](https://github.com/Azure/MachineLearningNotebooks/blob/master//how-to-use-azureml/responsible-ai/visualize-upload-loan-decision/rai-loan-decision.ipynb) | | | | | | |
| [Logging APIs](https://github.com/Azure/MachineLearningNotebooks/blob/master//how-to-use-azureml/track-and-monitor-experiments/logging-api/logging-api.ipynb) | Logging APIs and analyzing results | None | None | None | None | None |
| [configuration](https://github.com/Azure/MachineLearningNotebooks/blob/master//setup-environment/configuration.ipynb) | | | | | | |
| [tutorial-1st-experiment-sdk-train](https://github.com/Azure/MachineLearningNotebooks/blob/master//tutorials/create-first-ml-experiment/tutorial-1st-experiment-sdk-train.ipynb) | | | | | | |

Some files were not shown because too many files have changed in this diff Show More