Update README.md

Merge pull request #1228 from Azure/release_update/Release-76
update samples from Release-76 as a part of SDK release
2020-11-03 17:16:28 -08:00 · 2020-11-03 16:12:15 -08:00 · 2020-11-03 22:31:02 +00:00 · 2020-11-03 11:25:10 -08:00 · 2020-11-03 11:17:41 -08:00 · 2020-11-03 11:11:11 -08:00
10 changed files with 834 additions and 66 deletions
--- a/README.md
+++ b/README.md
@@ -1,5 +1,7 @@
 # Azure Machine Learning service example notebooks

+> a community-driven repository of examples using mlflow for tracking can be found at https://github.com/Azure/azureml-examples
+
 This repository contains example notebooks demonstrating the [Azure Machine Learning](https://azure.microsoft.com/en-us/services/machine-learning-service/) Python SDK which allows you to build, train, deploy and manage machine learning solutions using Azure.  The AML SDK allows you the choice of using local or cloud compute resources, while managing and maintaining the complete data science workflow from the cloud.

 ![Azure ML Workflow](https://raw.githubusercontent.com/MicrosoftDocs/azure-docs/master/articles/machine-learning/media/concept-azure-machine-learning-architecture/workflow.png)
--- a/how-to-use-azureml/automated-machine-learning/README.md
+++ b/how-to-use-azureml/automated-machine-learning/README.md
@@ -106,52 +106,87 @@ jupyter notebook
 <a name="samples"></a>
 # Automated ML SDK Sample Notebooks

- [auto-ml-classification-credit-card-fraud.ipynb](classification-credit-card-fraud/auto-ml-classification-credit-card-fraud.ipynb)
-    - Dataset: Kaggle's [credit card fraud detection dataset](https://www.kaggle.com/mlg-ulb/creditcardfraud)
-    - Simple example of using automated ML for classification to fraudulent credit card transactions
-    - Uses azure compute for training
+## Classification
+- **Classify Credit Card Fraud**
+    - Dataset: [Kaggle's credit card fraud detection dataset](https://www.kaggle.com/mlg-ulb/creditcardfraud)
+      - **[Jupyter Notebook (remote run)](classification-credit-card-fraud/auto-ml-classification-credit-card-fraud.ipynb)**
+          - run the experiment remotely on AML Compute cluster
+          - test the performance of the best model in the local environment
+      - **[Jupyter Notebook (local run)](local-run-classification-credit-card-fraud/auto-ml-classification-credit-card-fraud-local.ipynb)**
+          - run experiment in the local environment
+          - use Mimic Explainer for computing feature importance
+          - deploy the best model along with the explainer to an Azure Kubernetes (AKS) cluster, which will compute the raw and engineered feature importances at inference time
+- **Predict Term Deposit Subscriptions in a Bank**
+    - Dataset: [UCI's bank marketing dataset](https://www.kaggle.com/janiobachmann/bank-marketing-dataset)
+        - **[Jupyter Notebook](classification-bank-marketing-all-features/auto-ml-classification-bank-marketing-all-features.ipynb)**
+          - run experiment remotely on AML Compute cluster to generate ONNX compatible models
+          - view the featurization steps that were applied during training
+          - view feature importance for the best model
+          - download the best model in ONNX format and use it for inferencing using ONNXRuntime
+          - deploy the best model in PKL format to Azure Container Instance (ACI)
+- **Predict Newsgroup based on Text from News Article**
+    - Dataset: [20 newsgroups text dataset](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html)
+        - **[Jupyter Notebook](classification-text-dnn/auto-ml-classification-text-dnn.ipynb)**
+          - AutoML highlights here include using deep neural networks (DNNs) to create embedded features from text data
+          - AutoML will use Bidirectional Encoder Representations from Transformers (BERT) when a GPU compute is used
+          - Bidirectional Long-Short Term neural network (BiLSTM) will be utilized when a CPU compute is used, thereby optimizing the choice of DNN

- [auto-ml-regression.ipynb](regression/auto-ml-regression.ipynb)
+## Regression
+- **Predict Performance of Hardware Parts**
    - Dataset: Hardware Performance Dataset
-    - Simple example of using automated ML for regression
-    - Uses azure compute for training
+        - **[Jupyter Notebook](regression/auto-ml-regression.ipynb)**
+            - run the experiment remotely on AML Compute cluster
+            - get best trained model for a different metric than the one the experiment was optimized for
+            - test the performance of the best model in the local environment
+        - **[Jupyter Notebook (advanced)](regression/auto-ml-regression.ipynb)**
+            - run the experiment remotely on AML Compute cluster
+            - customize featurization: override column purpose within the dataset, configure transformer parameters
+            - get best trained model for a different metric than the one the experiment was optimized for
+            - run a model explanation experiment on the remote cluster
+            - deploy the model along the explainer and run online inferencing

- [auto-ml-regression-explanation-featurization.ipynb](regression-explanation-featurization/auto-ml-regression-explanation-featurization.ipynb)
-    - Dataset: Hardware Performance Dataset
-    - Shows featurization and excplanation
-    - Uses azure compute for training
-
- [auto-ml-forecasting-energy-demand.ipynb](forecasting-energy-demand/auto-ml-forecasting-energy-demand.ipynb)
-    - Dataset: [NYC energy demand data](forecasting-a/nyc_energy.csv)
-    - Example of using automated ML for training a forecasting model
-
- [auto-ml-classification-credit-card-fraud-local.ipynb](local-run-classification-credit-card-fraud/auto-ml-classification-credit-card-fraud-local.ipynb)
-    - Dataset: Kaggle's [credit card fraud detection dataset](https://www.kaggle.com/mlg-ulb/creditcardfraud)
-    - Simple example of using automated ML for classification to fraudulent credit card transactions
-    - Uses local compute for training
-
- [auto-ml-classification-bank-marketing-all-features.ipynb](classification-bank-marketing-all-features/auto-ml-classification-bank-marketing-all-features.ipynb)
-    - Dataset: UCI's [bank marketing dataset](https://www.kaggle.com/janiobachmann/bank-marketing-dataset)
-    - Simple example of using automated ML for classification to predict term deposit subscriptions for a bank
-    - Uses azure compute for training
-
- [auto-ml-forecasting-orange-juice-sales.ipynb](forecasting-orange-juice-sales/auto-ml-forecasting-orange-juice-sales.ipynb)
-    - Dataset: [Dominick's grocery sales of orange juice](forecasting-b/dominicks_OJ.csv)
-    - Example of training an automated ML forecasting model on multiple time-series
-
- [auto-ml-forecasting-bike-share.ipynb](forecasting-bike-share/auto-ml-forecasting-bike-share.ipynb)
-    - Dataset: forecasting for a bike-sharing
-    - Example of training an automated ML forecasting model on multiple time-series
-
- [auto-ml-forecasting-function.ipynb](forecasting-forecast-function/auto-ml-forecasting-function.ipynb)
-    - Example of training an automated ML forecasting model on multiple time-series
-
- [auto-ml-forecasting-beer-remote.ipynb](forecasting-beer-remote/auto-ml-forecasting-beer-remote.ipynb)
-    - Example of training an automated ML forecasting model on multiple time-series
-    - Beer Production Forecasting
-
- [auto-ml-continuous-retraining.ipynb](continuous-retraining/auto-ml-continuous-retraining.ipynb)
-    - Continuous retraining using Pipelines and Time-Series TabularDataset
+## Time Series Forecasting
+- **Forecast Energy Demand**
+    - Dataset: [NYC energy demand data](http://mis.nyiso.com/public/P-58Blist.htm)
+        - **[Jupyter Notebook](forecasting-energy-demand/auto-ml-forecasting-energy-demand.ipynb)**
+          - run experiment remotely on AML Compute cluster
+          - use lags and rolling window features
+          - view the featurization steps that were applied during training
+          - get the best model, use it to forecast on test data and compare the accuracy of predictions against real data
+- **Forecast Orange Juice Sales (Multi-Series)**
+    - Dataset: [Dominick's grocery sales of orange juice](forecasting-orange-juice-sales/dominicks_OJ.csv)
+        - **[Jupyter Notebook](forecasting-orange-juice-sales/dominicks_OJ.csv)**
+          - run experiment remotely on AML Compute cluster
+          - customize time-series featurization, change column purpose and override transformer hyper parameters
+          - evaluate locally the performance of the generated best model
+          - deploy the best model as a webservice on Azure Container Instance (ACI)
+          - get online predictions from the deployed model
+- **Forecast Demand of a Bike-Sharing Service**
+    - Dataset: [Bike demand data](forecasting-bike-share/bike-no.csv)
+        - **[Jupyter Notebook](forecasting-bike-share/auto-ml-forecasting-bike-share.ipynb)**
+          - run experiment remotely on AML Compute cluster
+          - integrate holiday features
+          - run rolling forecast for test set that is longer than the forecast horizon
+          - compute metrics on the predictions from the remote forecast
+- **The Forecast Function Interface**
+    - Dataset: Generated for sample purposes
+        - **[Jupyter Notebook](forecasting-forecast-function/auto-ml-forecasting-function.ipynb)**
+          - train a forecaster using a remote AML Compute cluster
+          - capabilities of forecast function (e.g. forecast farther into the horizon)
+          - generate confidence intervals
+- **Forecast Beverage Production**
+    - Dataset: [Monthly beer production data](forecasting-beer-remote/Beer_no_valid_split_train.csv)
+        - **[Jupyter Notebook](forecasting-beer-remote/auto-ml-forecasting-beer-remote.ipynb)**
+          - train using a remote AML Compute cluster
+          - enable the DNN learning model
+          - forecast on a remote compute cluster and compare different model performance
+- **Continuous Retraining with NOAA Weather Data**
+    - Dataset: [NOAA weather data from Azure Open Datasets](https://azure.microsoft.com/en-us/services/open-datasets/)
+        - **[Jupyter Notebook](continuous-retraining/auto-ml-continuous-retraining.ipynb)**
+          - continuously retrain a model using Pipelines and AutoML
+          - create a Pipeline to upload a time series dataset to an Azure blob
+          - create a Pipeline to run an AutoML experiment and register the best resulting model in the Workspace
+          - publish the training pipeline created and schedule it to run daily

 <a name="documentation"></a>
 See [Configure automated machine learning experiments](https://docs.microsoft.com/azure/machine-learning/service/how-to-configure-auto-train) to learn how more about the the settings and features available for automated machine learning experiments.
--- a/how-to-use-azureml/automated-machine-learning/classification-text-dnn/auto-ml-classification-text-dnn.ipynb
+++ b/how-to-use-azureml/automated-machine-learning/classification-text-dnn/auto-ml-classification-text-dnn.ipynb
@@ -0,0 +1,592 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "Copyright (c) Microsoft Corporation. All rights reserved.\n",
+        "\n",
+        "Licensed under the MIT License."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/classification-text-dnn/auto-ml-classification-text-dnn.png)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "# Automated Machine Learning\n",
+        "_**Text Classification Using Deep Learning**_\n",
+        "\n",
+        "## Contents\n",
+        "1. [Introduction](#Introduction)\n",
+        "1. [Setup](#Setup)\n",
+        "1. [Data](#Data)\n",
+        "1. [Train](#Train)\n",
+        "1. [Evaluate](#Evaluate)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Introduction\n",
+        "This notebook demonstrates classification with text data using deep learning in AutoML.\n",
+        "\n",
+        "AutoML highlights here include using deep neural networks (DNNs) to create embedded features from text data. Depending on the compute cluster the user provides, AutoML tried out Bidirectional Encoder Representations from Transformers (BERT) when a GPU compute is used, and Bidirectional Long-Short Term neural network (BiLSTM) when a CPU compute is used, thereby optimizing the choice of DNN for the uesr's setup.\n",
+        "\n",
+        "Make sure you have executed the [configuration](../../../configuration.ipynb) before running this notebook.\n",
+        "\n",
+        "An Enterprise workspace is required for this notebook. To learn more about creating an Enterprise workspace or upgrading to an Enterprise workspace from the Azure portal, please visit our [Workspace page](https://docs.microsoft.com/azure/machine-learning/service/concept-workspace#upgrade).\n",
+        "\n",
+        "Notebook synopsis:\n",
+        "1. Creating an Experiment in an existing Workspace\n",
+        "2. Configuration and remote run of AutoML for a text dataset (20 Newsgroups dataset from scikit-learn) for classification\n",
+        "3. Registering the best model for future use\n",
+        "4. Evaluating the final model on a test set"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Setup"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "import logging\n",
+        "import os\n",
+        "import shutil\n",
+        "\n",
+        "import pandas as pd\n",
+        "\n",
+        "import azureml.core\n",
+        "from azureml.core.experiment import Experiment\n",
+        "from azureml.core.workspace import Workspace\n",
+        "from azureml.core.dataset import Dataset\n",
+        "from azureml.core.compute import AmlCompute\n",
+        "from azureml.core.compute import ComputeTarget\n",
+        "from azureml.core.run import Run\n",
+        "from azureml.widgets import RunDetails\n",
+        "from azureml.core.model import Model \n",
+        "from helper import run_inference, get_result_df\n",
+        "from azureml.train.automl import AutoMLConfig\n",
+        "from sklearn.datasets import fetch_20newsgroups"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "This sample notebook may use features that are not available in previous versions of the Azure ML SDK."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "print(\"This notebook was created using version 1.17.0 of the Azure ML SDK\")\n",
+        "print(\"You are currently using version\", azureml.core.VERSION, \"of the Azure ML SDK\")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "As part of the setup you have already created a <b>Workspace</b>. To run AutoML, you also need to create an <b>Experiment</b>. An Experiment corresponds to a prediction problem you are trying to solve, while a Run corresponds to a specific approach to the problem."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "ws = Workspace.from_config()\n",
+        "\n",
+        "# Choose an experiment name.\n",
+        "experiment_name = 'automl-classification-text-dnn'\n",
+        "\n",
+        "experiment = Experiment(ws, experiment_name)\n",
+        "\n",
+        "output = {}\n",
+        "output['Subscription ID'] = ws.subscription_id\n",
+        "output['Workspace Name'] = ws.name\n",
+        "output['Resource Group'] = ws.resource_group\n",
+        "output['Location'] = ws.location\n",
+        "output['Experiment Name'] = experiment.name\n",
+        "pd.set_option('display.max_colwidth', -1)\n",
+        "outputDf = pd.DataFrame(data = output, index = [''])\n",
+        "outputDf.T"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Set up a compute cluster\n",
+        "This section uses a user-provided compute cluster (named \"dnntext-cluster\" in this example). If a cluster with this name does not exist in the user's workspace, the below code will create a new cluster. You can choose the parameters of the cluster as mentioned in the comments.\n",
+        "\n",
+        "Whether you provide/select a CPU or GPU cluster, AutoML will choose the appropriate DNN for that setup - BiLSTM or BERT text featurizer will be included in the candidate featurizers on CPU and GPU respectively.  If your goal is to obtain the most accurate model, we recommend you use GPU clusters since BERT featurizers usually outperform BiLSTM featurizers."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "from azureml.core.compute import ComputeTarget, AmlCompute\n",
+        "from azureml.core.compute_target import ComputeTargetException\n",
+        "\n",
+        "num_nodes = 2\n",
+        "\n",
+        "# Choose a name for your cluster.\n",
+        "amlcompute_cluster_name = \"dnntext-cluster\"\n",
+        "\n",
+        "# Verify that cluster does not exist already\n",
+        "try:\n",
+        "    compute_target = ComputeTarget(workspace=ws, name=amlcompute_cluster_name)\n",
+        "    print('Found existing cluster, use it.')\n",
+        "except ComputeTargetException:\n",
+        "    compute_config = AmlCompute.provisioning_configuration(vm_size = \"STANDARD_NC6\", # CPU for BiLSTM, such as \"STANDARD_D2_V2\" \n",
+        "                                                           # To use BERT (this is recommended for best performance), select a GPU such as \"STANDARD_NC6\" \n",
+        "                                                           # or similar GPU option\n",
+        "                                                           # available in your workspace\n",
+        "                                                           max_nodes = num_nodes)\n",
+        "    compute_target = ComputeTarget.create(ws, amlcompute_cluster_name, compute_config)\n",
+        "\n",
+        "compute_target.wait_for_completion(show_output=True)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "### Get data\n",
+        "For this notebook we will use 20 Newsgroups data from scikit-learn. We filter the data to contain four classes and take a sample as training data. Please note that for accuracy improvement, more data is needed. For this notebook we provide a small-data example so that you can use this template to use with your larger sized data."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "data_dir = \"text-dnn-data\" # Local directory to store data\n",
+        "blobstore_datadir = data_dir # Blob store directory to store data in\n",
+        "target_column_name = 'y'\n",
+        "feature_column_name = 'X'\n",
+        "\n",
+        "def get_20newsgroups_data():\n",
+        "    '''Fetches 20 Newsgroups data from scikit-learn\n",
+        "       Returns them in form of pandas dataframes\n",
+        "    '''\n",
+        "    remove = ('headers', 'footers', 'quotes')\n",
+        "    categories = [\n",
+        "        'rec.sport.baseball',\n",
+        "        'rec.sport.hockey',\n",
+        "        'comp.graphics',\n",
+        "        'sci.space',\n",
+        "        ]\n",
+        "\n",
+        "    data = fetch_20newsgroups(subset = 'train', categories = categories,\n",
+        "                                    shuffle = True, random_state = 42,\n",
+        "                                    remove = remove)\n",
+        "    data = pd.DataFrame({feature_column_name: data.data, target_column_name: data.target})\n",
+        "\n",
+        "    data_train = data[:200]\n",
+        "    data_test = data[200:300]    \n",
+        "\n",
+        "    data_train = remove_blanks_20news(data_train, feature_column_name, target_column_name)\n",
+        "    data_test = remove_blanks_20news(data_test, feature_column_name, target_column_name)\n",
+        "    \n",
+        "    return data_train, data_test\n",
+        "    \n",
+        "def remove_blanks_20news(data, feature_column_name, target_column_name):\n",
+        "    \n",
+        "    data[feature_column_name] = data[feature_column_name].replace(r'\\n', ' ', regex=True).apply(lambda x: x.strip())\n",
+        "    data = data[data[feature_column_name] != '']\n",
+        "    \n",
+        "    return data"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "#### Fetch data and upload to datastore for use in training"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "data_train, data_test = get_20newsgroups_data()\n",
+        "\n",
+        "if not os.path.isdir(data_dir):\n",
+        "    os.mkdir(data_dir)\n",
+        "    \n",
+        "train_data_fname = data_dir + '/train_data.csv'\n",
+        "test_data_fname = data_dir + '/test_data.csv'\n",
+        "\n",
+        "data_train.to_csv(train_data_fname, index=False)\n",
+        "data_test.to_csv(test_data_fname, index=False)\n",
+        "\n",
+        "datastore = ws.get_default_datastore()\n",
+        "datastore.upload(src_dir=data_dir, target_path=blobstore_datadir,\n",
+        "                    overwrite=True)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "train_dataset = Dataset.Tabular.from_delimited_files(path = [(datastore, blobstore_datadir + '/train_data.csv')])"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "### Prepare AutoML run"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "This step requires an Enterprise workspace to gain access to this feature. To learn more about creating an Enterprise workspace or upgrading to an Enterprise workspace from the Azure portal, please visit our [Workspace page](https://docs.microsoft.com/azure/machine-learning/service/concept-workspace#upgrade).\n",
+        "\n",
+        "This notebook uses the blocked_models parameter to exclude some models that can take a longer time to train on some text datasets. You can choose to remove models from the blocked_models list but you may need to increase the experiment_timeout_hours parameter value to get results."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "automl_settings = {\n",
+        "    \"experiment_timeout_minutes\": 20,\n",
+        "    \"primary_metric\": 'accuracy',\n",
+        "    \"max_concurrent_iterations\": num_nodes, \n",
+        "    \"max_cores_per_iteration\": -1,\n",
+        "    \"enable_dnn\": True,\n",
+        "    \"enable_early_stopping\": True,\n",
+        "    \"validation_size\": 0.3,\n",
+        "    \"verbosity\": logging.INFO,\n",
+        "    \"enable_voting_ensemble\": False,\n",
+        "    \"enable_stack_ensemble\": False,\n",
+        "}\n",
+        "\n",
+        "automl_config = AutoMLConfig(task = 'classification',\n",
+        "                             debug_log = 'automl_errors.log',\n",
+        "                             compute_target=compute_target,\n",
+        "                             training_data=train_dataset,\n",
+        "                             label_column_name=target_column_name,\n",
+        "                             blocked_models = ['LightGBM'],\n",
+        "                             **automl_settings\n",
+        "                            )"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "#### Submit AutoML Run"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "automl_run = experiment.submit(automl_config, show_output=True)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "automl_run"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "Displaying the run objects gives you links to the visual tools in the Azure Portal. Go try them!"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "### Retrieve the Best Model\n",
+        "Below we select the best model pipeline from our iterations, use it to test on test data on the same compute cluster."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "You can test the model locally to get a feel of the input/output. When the model contains BERT, this step will require pytorch and pytorch-transformers installed in your local environment. The exact versions of these packages can be found in the **automl_env.yml** file located in the local copy of your MachineLearningNotebooks folder here:\n",
+        "MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/automl_env.yml"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "best_run, fitted_model = automl_run.get_output()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "You can now see what text transformations are used to convert text data to features for this dataset, including deep learning transformations based on BiLSTM or Transformer (BERT is one implementation of a Transformer) models."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "text_transformations_used = []\n",
+        "for column_group in fitted_model.named_steps['datatransformer'].get_featurization_summary():\n",
+        "    text_transformations_used.extend(column_group['Transformations'])\n",
+        "text_transformations_used"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "### Registering the best model\n",
+        "We now register the best fitted model from the AutoML Run for use in future deployments.  "
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "Get results stats, extract the best model from AutoML run, download and register the resultant best model"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "summary_df = get_result_df(automl_run)\n",
+        "best_dnn_run_id = summary_df['run_id'].iloc[0]\n",
+        "best_dnn_run = Run(experiment, best_dnn_run_id)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "model_dir = 'Model' # Local folder where the model will be stored temporarily\n",
+        "if not os.path.isdir(model_dir):\n",
+        "    os.mkdir(model_dir)\n",
+        "    \n",
+        "best_dnn_run.download_file('outputs/model.pkl', model_dir + '/model.pkl')"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "Register the model in your Azure Machine Learning Workspace. If you previously registered a model, please make sure to delete it so as to replace it with this new model."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# Register the model\n",
+        "model_name = 'textDNN-20News'\n",
+        "model = Model.register(model_path = model_dir + '/model.pkl',\n",
+        "                       model_name = model_name,\n",
+        "                       tags=None,\n",
+        "                       workspace=ws)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Evaluate on Test Data"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "We now use the best fitted model from the AutoML Run to make predictions on the test set.  \n",
+        "\n",
+        "Test set schema should match that of the training set."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "test_dataset = Dataset.Tabular.from_delimited_files(path = [(datastore, blobstore_datadir + '/test_data.csv')])\n",
+        "\n",
+        "# preview the first 3 rows of the dataset\n",
+        "test_dataset.take(3).to_pandas_dataframe()"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "test_experiment = Experiment(ws, experiment_name + \"_test\")"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "script_folder = os.path.join(os.getcwd(), 'inference')\n",
+        "os.makedirs(script_folder, exist_ok=True)\n",
+        "shutil.copy('infer.py', script_folder)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "test_run = run_inference(test_experiment, compute_target, script_folder, best_dnn_run,\n",
+        "                         train_dataset, test_dataset, target_column_name, model_name)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "Display computed metrics"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "test_run"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "RunDetails(test_run).show()"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "test_run.wait_for_completion()"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "pd.Series(test_run.get_metrics())"
+      ]
+    }
+  ],
+  "metadata": {
+    "authors": [
+      {
+        "name": "anshirga"
+      }
+    ],
+    "compute": [
+      "AML Compute"
+    ],
+    "datasets": [
+      "None"
+    ],
+    "deployment": [
+      "None"
+    ],
+    "exclude_from_index": false,
+    "framework": [
+      "None"
+    ],
+    "friendly_name": "DNN Text Featurization",
+    "index_order": 2,
+    "kernelspec": {
+      "display_name": "Python 3.6",
+      "language": "python",
+      "name": "python36"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.6.7"
+    },
+    "tags": [
+      "None"
+    ],
+    "task": "Text featurization using DNNs for classification"
+  },
+  "nbformat": 4,
+  "nbformat_minor": 2
+}
--- a/how-to-use-azureml/automated-machine-learning/classification-text-dnn/auto-ml-classification-text-dnn.yml
+++ b/how-to-use-azureml/automated-machine-learning/classification-text-dnn/auto-ml-classification-text-dnn.yml
@@ -0,0 +1,4 @@
+name: auto-ml-classification-text-dnn
+dependencies:
+- pip:
+  - azureml-sdk
--- a/how-to-use-azureml/automated-machine-learning/classification-text-dnn/helper.py
+++ b/how-to-use-azureml/automated-machine-learning/classification-text-dnn/helper.py
@@ -0,0 +1,56 @@
+import pandas as pd
+from azureml.core import Environment
+from azureml.train.estimator import Estimator
+from azureml.core.run import Run
+
+
+def run_inference(test_experiment, compute_target, script_folder, train_run,
+                  train_dataset, test_dataset, target_column_name, model_name):
+
+    inference_env = train_run.get_environment()
+
+    est = Estimator(source_directory=script_folder,
+                    entry_script='infer.py',
+                    script_params={
+                        '--target_column_name': target_column_name,
+                        '--model_name': model_name
+                    },
+                    inputs=[
+                        train_dataset.as_named_input('train_data'),
+                        test_dataset.as_named_input('test_data')
+                    ],
+                    compute_target=compute_target,
+                    environment_definition=inference_env)
+
+    run = test_experiment.submit(
+        est, tags={
+            'training_run_id': train_run.id,
+            'run_algorithm': train_run.properties['run_algorithm'],
+            'valid_score': train_run.properties['score'],
+            'primary_metric': train_run.properties['primary_metric']
+        })
+
+    run.log("run_algorithm", run.tags['run_algorithm'])
+    return run
+
+
+def get_result_df(remote_run):
+
+    children = list(remote_run.get_children(recursive=True))
+    summary_df = pd.DataFrame(index=['run_id', 'run_algorithm',
+                                     'primary_metric', 'Score'])
+    goal_minimize = False
+    for run in children:
+        if('run_algorithm' in run.properties and 'score' in run.properties):
+            summary_df[run.id] = [run.id, run.properties['run_algorithm'],
+                                  run.properties['primary_metric'],
+                                  float(run.properties['score'])]
+            if('goal' in run.properties):
+                goal_minimize = run.properties['goal'].split('_')[-1] == 'min'
+
+    summary_df = summary_df.T.sort_values(
+        'Score',
+        ascending=goal_minimize).drop_duplicates(['run_algorithm'])
+    summary_df = summary_df.set_index('run_algorithm')
+
+    return summary_df
--- a/how-to-use-azureml/automated-machine-learning/classification-text-dnn/infer.py
+++ b/how-to-use-azureml/automated-machine-learning/classification-text-dnn/infer.py
@@ -0,0 +1,60 @@
+import argparse
+
+import numpy as np
+
+from sklearn.externals import joblib
+
+from azureml.automl.runtime.shared.score import scoring, constants
+from azureml.core import Run
+from azureml.core.model import Model
+
+
+parser = argparse.ArgumentParser()
+parser.add_argument(
+    '--target_column_name', type=str, dest='target_column_name',
+    help='Target Column Name')
+parser.add_argument(
+    '--model_name', type=str, dest='model_name',
+    help='Name of registered model')
+
+args = parser.parse_args()
+target_column_name = args.target_column_name
+model_name = args.model_name
+
+print('args passed are: ')
+print('Target column name: ', target_column_name)
+print('Name of registered model: ', model_name)
+
+model_path = Model.get_model_path(model_name)
+# deserialize the model file back into a sklearn model
+model = joblib.load(model_path)
+
+run = Run.get_context()
+# get input dataset by name
+test_dataset = run.input_datasets['test_data']
+train_dataset = run.input_datasets['train_data']
+
+X_test_df = test_dataset.drop_columns(columns=[target_column_name]) \
+                        .to_pandas_dataframe()
+y_test_df = test_dataset.with_timestamp_columns(None) \
+                        .keep_columns(columns=[target_column_name]) \
+                        .to_pandas_dataframe()
+y_train_df = test_dataset.with_timestamp_columns(None) \
+                         .keep_columns(columns=[target_column_name]) \
+                         .to_pandas_dataframe()
+
+predicted = model.predict_proba(X_test_df)
+
+# Use the AutoML scoring module
+class_labels = np.unique(np.concatenate((y_train_df.values, y_test_df.values)))
+train_labels = model.classes_
+classification_metrics = list(constants.CLASSIFICATION_SCALAR_SET)
+scores = scoring.score_classification(y_test_df.values, predicted,
+                                      classification_metrics,
+                                      class_labels, train_labels)
+
+print("scores:")
+print(scores)
+
+for key, value in scores.items():
+    run.log(key, value)
--- a/how-to-use-azureml/automated-machine-learning/forecasting-beer-remote/infer.py
+++ b/how-to-use-azureml/automated-machine-learning/forecasting-beer-remote/infer.py
@@ -1,4 +1,5 @@
 import argparse
+import os

 import numpy as np
 import pandas as pd
@@ -10,6 +11,13 @@ from sklearn.metrics import mean_absolute_error, mean_squared_error
 from azureml.automl.runtime.shared.score import scoring, constants
 from azureml.core import Run

+try:
+    import torch
+
+    _torch_present = True
+except ImportError:
+    _torch_present = False
+

 def align_outputs(y_predicted, X_trans, X_test, y_test,
                  predicted_column_name='predicted',
@@ -48,7 +56,7 @@ def align_outputs(y_predicted, X_trans, X_test, y_test,
    # or at edges of time due to lags/rolling windows
    clean = together[together[[target_column_name,
                               predicted_column_name]].notnull().all(axis=1)]
-    return(clean)
+    return (clean)


 def do_rolling_forecast_with_lookback(fitted_model, X_test, y_test,
@@ -83,8 +91,7 @@ def do_rolling_forecast_with_lookback(fitted_model, X_test, y_test,
        if origin_time != X[time_column_name].min():
            # Set the context by including actuals up-to the origin time
            test_context_expand_wind = (X[time_column_name] < origin_time)
-            context_expand_wind = (
-                X_test_expand[time_column_name] < origin_time)
+            context_expand_wind = (X_test_expand[time_column_name] < origin_time)
            y_query_expand[context_expand_wind] = y[test_context_expand_wind]

        # Print some debug info
@@ -115,8 +122,7 @@ def do_rolling_forecast_with_lookback(fitted_model, X_test, y_test,
        # Align forecast with test set for dates within
        # the current rolling window
        trans_tindex = X_trans.index.get_level_values(time_column_name)
-        trans_roll_wind = (trans_tindex >= origin_time) & (
-            trans_tindex < horizon_time)
+        trans_roll_wind = (trans_tindex >= origin_time) & (trans_tindex < horizon_time)
        test_roll_wind = expand_wind & (X[time_column_name] >= origin_time)
        df_list.append(align_outputs(
            y_fcst[trans_roll_wind], X_trans[trans_roll_wind],
@@ -155,8 +161,7 @@ def do_rolling_forecast(fitted_model, X_test, y_test, max_horizon, freq='D'):
        if origin_time != X_test[time_column_name].min():
            # Set the context by including actuals up-to the origin time
            test_context_expand_wind = (X_test[time_column_name] < origin_time)
-            context_expand_wind = (
-                X_test_expand[time_column_name] < origin_time)
+            context_expand_wind = (X_test_expand[time_column_name] < origin_time)
            y_query_expand[context_expand_wind] = y_test[
                test_context_expand_wind]

@@ -186,10 +191,8 @@ def do_rolling_forecast(fitted_model, X_test, y_test, max_horizon, freq='D'):
        # Align forecast with test set for dates within the
        # current rolling window
        trans_tindex = X_trans.index.get_level_values(time_column_name)
-        trans_roll_wind = (trans_tindex >= origin_time) & (
-            trans_tindex < horizon_time)
-        test_roll_wind = expand_wind & (
-            X_test[time_column_name] >= origin_time)
+        trans_roll_wind = (trans_tindex >= origin_time) & (trans_tindex < horizon_time)
+        test_roll_wind = expand_wind & (X_test[time_column_name] >= origin_time)
        df_list.append(align_outputs(y_fcst[trans_roll_wind],
                                     X_trans[trans_roll_wind],
                                     X_test[test_roll_wind],
@@ -221,6 +224,10 @@ def MAPE(actual, pred):
    return np.mean(APE(actual_safe, pred_safe))


+def map_location_cuda(storage, loc):
+    return storage.cuda()
+
+
 parser = argparse.ArgumentParser()
 parser.add_argument(
    '--max_horizon', type=int, dest='max_horizon',
@@ -238,7 +245,6 @@ parser.add_argument(
    '--model_path', type=str, dest='model_path',
    default='model.pkl', help='Filename of model to be loaded')

-
 args = parser.parse_args()
 max_horizon = args.max_horizon
 target_column_name = args.target_column_name
@@ -246,7 +252,6 @@ time_column_name = args.time_column_name
 freq = args.freq
 model_path = args.model_path

-
 print('args passed are: ')
 print(max_horizon)
 print(target_column_name)
@@ -274,8 +279,19 @@ X_lookback_df = lookback_dataset.drop_columns(columns=[target_column_name])
 y_lookback_df = lookback_dataset.with_timestamp_columns(
    None).keep_columns(columns=[target_column_name])

-fitted_model = joblib.load(model_path)
-
+_, ext = os.path.splitext(model_path)
+if ext == '.pt':
+    # Load the fc-tcn torch model.
+    assert _torch_present
+    if torch.cuda.is_available():
+        map_location = map_location_cuda
+    else:
+        map_location = 'cpu'
+    with open(model_path, 'rb') as fh:
+        fitted_model = torch.load(fh, map_location=map_location)
+else:
+    # Load the sklearn pipeline.
+    fitted_model = joblib.load(model_path)

 if hasattr(fitted_model, 'get_lookback'):
    lookback = fitted_model.get_lookback()
--- a/how-to-use-azureml/machine-learning-pipelines/nyc-taxi-data-regression-model-building/nyc-taxi-data-regression-model-building.ipynb
+++ b/how-to-use-azureml/machine-learning-pipelines/nyc-taxi-data-regression-model-building/nyc-taxi-data-regression-model-building.ipynb
@@ -460,8 +460,8 @@
        "    name=\"Merge Taxi Data\",\n",
        "    script_name=\"merge.py\", \n",
        "    arguments=[\"--output_merge\", merged_data],\n",
-        "    inputs=[cleansed_green_data.parse_parquet_files(file_extension=None),\n",
-        "            cleansed_yellow_data.parse_parquet_files(file_extension=None)],\n",
+        "    inputs=[cleansed_green_data.parse_parquet_files(),\n",
+        "            cleansed_yellow_data.parse_parquet_files()],\n",
        "    outputs=[merged_data],\n",
        "    compute_target=aml_compute,\n",
        "    runconfig=aml_run_config,\n",
@@ -497,7 +497,7 @@
        "    name=\"Filter Taxi Data\",\n",
        "    script_name=\"filter.py\", \n",
        "    arguments=[\"--output_filter\", filtered_data],\n",
-        "    inputs=[merged_data.parse_parquet_files(file_extension=None)],\n",
+        "    inputs=[merged_data.parse_parquet_files()],\n",
        "    outputs=[filtered_data],\n",
        "    compute_target=aml_compute,\n",
        "    runconfig = aml_run_config,\n",
@@ -533,7 +533,7 @@
        "    name=\"Normalize Taxi Data\",\n",
        "    script_name=\"normalize.py\", \n",
        "    arguments=[\"--output_normalize\", normalized_data],\n",
-        "    inputs=[filtered_data.parse_parquet_files(file_extension=None)],\n",
+        "    inputs=[filtered_data.parse_parquet_files()],\n",
        "    outputs=[normalized_data],\n",
        "    compute_target=aml_compute,\n",
        "    runconfig = aml_run_config,\n",
@@ -574,7 +574,7 @@
        "    name=\"Transform Taxi Data\",\n",
        "    script_name=\"transform.py\", \n",
        "    arguments=[\"--output_transform\", transformed_data],\n",
-        "    inputs=[normalized_data.parse_parquet_files(file_extension=None)],\n",
+        "    inputs=[normalized_data.parse_parquet_files()],\n",
        "    outputs=[transformed_data],\n",
        "    compute_target=aml_compute,\n",
        "    runconfig = aml_run_config,\n",
@@ -614,7 +614,7 @@
        "    script_name=\"train_test_split.py\", \n",
        "    arguments=[\"--output_split_train\", output_split_train,\n",
        "               \"--output_split_test\", output_split_test],\n",
-        "    inputs=[transformed_data.parse_parquet_files(file_extension=None)],\n",
+        "    inputs=[transformed_data.parse_parquet_files()],\n",
        "    outputs=[output_split_train, output_split_test],\n",
        "    compute_target=aml_compute,\n",
        "    runconfig = aml_run_config,\n",
@@ -690,7 +690,7 @@
        "    \"n_cross_validations\": 5\n",
        "}\n",
        "\n",
-        "training_dataset = output_split_train.parse_parquet_files(file_extension=None).keep_columns(['pickup_weekday','pickup_hour', 'distance','passengers', 'vendor', 'cost'])\n",
+        "training_dataset = output_split_train.parse_parquet_files().keep_columns(['pickup_weekday','pickup_hour', 'distance','passengers', 'vendor', 'cost'])\n",
        "\n",
        "automl_config = AutoMLConfig(task = 'regression',\n",
        "                             debug_log = 'automated_ml_errors.log',\n",
--- a/how-to-use-azureml/ml-frameworks/keras/train-hyperparameter-tune-deploy-with-keras/train-hyperparameter-tune-deploy-with-keras.ipynb
+++ b/how-to-use-azureml/ml-frameworks/keras/train-hyperparameter-tune-deploy-with-keras/train-hyperparameter-tune-deploy-with-keras.ipynb
@@ -429,7 +429,8 @@
        "dependencies:\n",
        "- python=3.6.2\n",
        "- pip:\n",
-        "  - azureml-defaults==1.13.0\n",
+        "  - h5py<=2.10.0\n",
+        "  - azureml-defaults\n",
        "  - tensorflow-gpu==2.0.0\n",
        "  - keras<=2.3.1\n",
        "  - matplotlib"
@@ -981,6 +982,7 @@
        "\n",
        "cd = CondaDependencies.create()\n",
        "cd.add_tensorflow_conda_package()\n",
+        "cd.add_conda_package('h5py<=2.10.0')\n",
        "cd.add_conda_package('keras<=2.3.1')\n",
        "cd.add_pip_package(\"azureml-defaults\")\n",
        "cd.save_to_file(base_directory='./', conda_file_path='myenv.yml')\n",
--- a/index.md
+++ b/index.md
@@ -97,6 +97,7 @@ Machine Learning notebook samples and encourage efficient retrieval of topics an
 ## Other Notebooks
 |Title| Task | Dataset | Training Compute | Deployment Target | ML Framework | Tags |
 |:----|:-----|:-------:|:----------------:|:-----------------:|:------------:|:------------:|
+| [DNN Text Featurization](https://github.com/Azure/MachineLearningNotebooks/blob/master//how-to-use-azureml/automated-machine-learning/classification-text-dnn/auto-ml-classification-text-dnn.ipynb) | Text featurization using DNNs for classification | None | AML Compute | None | None | None |
 | [configuration](https://github.com/Azure/MachineLearningNotebooks/blob/master/configuration.ipynb) |  |  |  |  |  |  |
 | [fairlearn-azureml-mitigation](https://github.com/Azure/MachineLearningNotebooks/blob/master//contrib/fairness/fairlearn-azureml-mitigation.ipynb) |  |  |  |  |  |  |
 | [upload-fairness-dashboard](https://github.com/Azure/MachineLearningNotebooks/blob/master//contrib/fairness/upload-fairness-dashboard.ipynb) |  |  |  |  |  |  |
Author	SHA1	Message	Date
Cody	ba741fb18d	Update README.md	2020-11-03 17:16:28 -08:00
Harneet Virk	ac0ad8d487	Merge pull request #1228 from Azure/release_update/Release-76 update samples from Release-76 as a part of SDK release	2020-11-03 16:12:15 -08:00
amlrelsa-ms	5019ad6c5a	update samples from Release-76 as a part of SDK release	2020-11-03 22:31:02 +00:00
Cody	41a2ebd2b3	Merge pull request #1226 from Azure/lostmygithubaccount-patch-3 Update README.md	2020-11-03 11:25:10 -08:00
Cody	53e3283d1d	Update README.md	2020-11-03 11:17:41 -08:00
Harneet Virk	ba9c4c5465	Merge pull request #1225 from Azure/release_update/Release-75 update samples from Release-75 as a part of SDK release	2020-11-03 11:11:11 -08:00
amlrelsa-ms	a6c65f00ec	update samples from Release-75 as a part of SDK release	2020-11-03 19:07:12 +00:00
Cody	95072eabc2	Merge pull request #1221 from Azure/lostmygithubaccount-patch-2 Update README.md	2020-11-02 11:52:05 -08:00
Cody	12905ef254	Update README.md	2020-11-02 06:59:44 -08:00
Harneet Virk	4cf56eee91	Merge pull request #1217 from Azure/release_update/Release-74 update samples from Release-74 as a part of SDK release	2020-10-30 17:27:02 -07:00
amlrelsa-ms	d345ff6c37	update samples from Release-74 as a part of SDK release	2020-10-30 22:20:10 +00:00
Harneet Virk	560dcac0a0	Merge pull request #1214 from Azure/release_update/Release-73 update samples from Release-73 as a part of SDK release	2020-10-29 23:38:02 -07:00
amlrelsa-ms	322087a58c	update samples from Release-73 as a part of SDK release	2020-10-30 06:37:05 +00:00
Harneet Virk	e255c000ab	Merge pull request #1211 from Azure/release_update/Release-72 update samples from Release-72 as a part of SDK release	2020-10-28 14:30:50 -07:00
amlrelsa-ms	7871e37ec0	update samples from Release-72 as a part of SDK release	2020-10-28 21:24:40 +00:00
Cody	58e584e7eb	Update README.md (#1209 )	2020-10-27 21:00:38 -04:00