MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/experimental/autofeaturization-codegen/codegen-for-autofeaturization.ipynb

{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Copyright (c) Microsoft Corporation. All rights reserved.\n",
        "\n",
        "Licensed under the MIT License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/classification-credit-card-fraud/custom-model-training-from-autofeaturization-run.png)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Automated Machine Learning - Codegen for AutoFeaturization \n",
        "_**Autofeaturization of credit card fraudulent transactions dataset on remote compute and codegen functionality**_\n",
        "\n",
        "## Contents\n",
        "1. [Introduction](#Introduction)\n",
        "1. [Setup](#Setup)\n",
        "1. [Data](#Data)\n",
        "1. [Autofeaturization](#Autofeaturization)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<a id='Introduction'></a>\n",
        "## Introduction"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "**Autofeaturization** lets you run an AutoML experiment to only featurize the datasets. These datasets along with the transformer are stored in AML Storage and linked to the run which can later be retrieved and used to train models. \n",
        "\n",
        "**To run Autofeaturization, set the number of iterations to zero and featurization as auto.**\n",
        "\n",
        "Please refer to [Autofeaturization and custom model training](../autofeaturization-custom-model-training/custom-model-training-from-autofeaturization-run.ipynb) for more details on the same.\n",
        "\n",
        "[Codegen](https://github.com/Azure/automl-codegen-preview) is a feature, which when enabled, provides a user with the script of the underlying functionality and a notebook to tweak inputs or code and rerun the same.\n",
        "\n",
        "In this example we use the credit card fraudulent transactions dataset to showcase how you can use AutoML for autofeaturization and further how you can enable the `Codegen` feature.\n",
        "\n",
        "This notebook is using remote compute to complete the featurization.\n",
        "\n",
        "If you are using an Azure Machine Learning Compute Instance, you are all set. Otherwise, go through the [configuration](../../configuration.ipynb) notebook first if you haven't already, to establish your connection to the AzureML Workspace. \n",
        "\n",
        "Here you will learn how to create an autofeaturization experiment using an existing workspace with codegen feature enabled."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<a id='Setup'></a>\n",
        "## Setup\n",
        "\n",
        "As part of the setup you have already created an Azure ML `Workspace` object. For Automated ML you will need to create an `Experiment` object, which is a named object in a `Workspace` used to run experiments."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import logging\n",
        "import pandas as pd\n",
        "import azureml.core\n",
        "from azureml.core.experiment import Experiment\n",
        "from azureml.core.workspace import Workspace\n",
        "from azureml.core.dataset import Dataset\n",
        "from azureml.train.automl import AutoMLConfig"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "This sample notebook may use features that are not available in previous versions of the Azure ML SDK."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "print(\"This notebook was created using version 1.55.0 of the Azure ML SDK\")\n",
        "print(\"You are currently using version\", azureml.core.VERSION, \"of the Azure ML SDK\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "ws = Workspace.from_config()\n",
        "\n",
        "# choose a name for experiment\n",
        "experiment_name = 'automl-autofeaturization-ccard-codegen-remote'\n",
        "\n",
        "experiment=Experiment(ws, experiment_name)\n",
        "\n",
        "output = {}\n",
        "output['Subscription ID'] = ws.subscription_id\n",
        "output['Workspace'] = ws.name\n",
        "output['Resource Group'] = ws.resource_group\n",
        "output['Location'] = ws.location\n",
        "output['Experiment Name'] = experiment.name\n",
        "outputDf = pd.DataFrame(data = output, index = [''])\n",
        "outputDf.T"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Create or Attach existing AmlCompute\n",
        "A compute target is required to execute the Automated ML run. In this tutorial, you create AmlCompute as your training compute resource.\n",
        "\n",
        "> Note that if you have an AzureML Data Scientist role, you will not have permission to create compute resources. Talk to your workspace or IT admin to create the compute targets described in this section, if they do not already exist.\n",
        "\n",
        "#### Creation of AmlCompute takes approximately 5 minutes. \n",
        "If the AmlCompute with that name is already in your workspace this code will skip the creation process.\n",
        "As with other Azure services, there are limits on certain resources (e.g. AmlCompute) associated with the Azure Machine Learning service. Please read [this article](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-quotas) on the default limits and how to request more quota."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.core.compute import ComputeTarget, AmlCompute\n",
        "from azureml.core.compute_target import ComputeTargetException\n",
        "\n",
        "# Choose a name for your CPU cluster\n",
        "cpu_cluster_name = \"cpu-cluster\"\n",
        "\n",
        "# Verify that cluster does not exist already\n",
        "try:\n",
        "    compute_target = ComputeTarget(workspace=ws, name=cpu_cluster_name)\n",
        "    print('Found existing cluster, use it.')\n",
        "except ComputeTargetException:\n",
        "    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_DS12_V2',\n",
        "                                                           max_nodes=6)\n",
        "    compute_target = ComputeTarget.create(ws, cpu_cluster_name, compute_config)\n",
        "\n",
        "compute_target.wait_for_completion(show_output=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<a id='Data'></a>\n",
        "## Data"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Load Data\n",
        "\n",
        "Load the credit card fraudulent transactions dataset from a CSV file, containing both training features and labels. The features are inputs to the model, while the training labels represent the expected output of the model. \n",
        "\n",
        "Here the autofeaturization run will featurize the training data passed in."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "##### Training Dataset"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "training_data = \"https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/creditcard_train.csv\"\n",
        "training_dataset = Dataset.Tabular.from_delimited_files(training_data) # Tabular dataset\n",
        "\n",
        "label_column_name = 'Class' # output label"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<a id='Autofeaturization'></a>\n",
        "## AutoFeaturization\n",
        "\n",
        "Instantiate an AutoMLConfig object. This defines the settings and data used to run the autofeaturization experiment.\n",
        "\n",
        "|Property|Description|\n",
        "|-|-|\n",
        "|**task**|classification or regression or forecasting|\n",
        "|**training_data**|Input training dataset, containing both features and label column.|\n",
        "|**iterations**|For an autofeaturization run, iterations will be 0.|\n",
        "|**featurization**|For an autofeaturization run, featurization can be 'auto' or 'custom'.|\n",
        "|**label_column_name**|The name of the label column.|\n",
        "|**enable_code_generation**|For enabling codegen for the run, value would be True|\n",
        "\n",
        "**_You can find more information about primary metrics_** [here](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-configure-auto-train#primary-metric)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "automl_config = AutoMLConfig(task = 'classification',\n",
        "                             debug_log = 'automl_errors.log',\n",
        "                             iterations = 0, # autofeaturization run can be triggered by setting iterations to 0\n",
        "                             compute_target = compute_target,\n",
        "                             training_data = training_dataset,\n",
        "                             label_column_name = label_column_name,\n",
        "                             featurization = 'auto',\n",
        "                             verbosity = logging.INFO,\n",
        "                             enable_code_generation = True # enable codegen\n",
        "                            )"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Call the `submit` method on the experiment object and pass the run configuration. Depending on the data this can run for a while. Validation errors and current status will be shown when setting `show_output=True` and the execution will be synchronous."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "remote_run = experiment.submit(automl_config, show_output = False)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Results"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "#### Widget for Monitoring Runs\n",
        "\n",
        "The widget will first report a \"loading\" status while running the first iteration. After completing the first iteration, an auto-updating graph and table will be shown. The widget will refresh once per minute, so you should see the graph update as child runs complete.\n",
        "\n",
        "**Note:** The widget displays a link at the bottom. Use this link to open a web interface to explore the individual run details"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.widgets import RunDetails\n",
        "RunDetails(remote_run).show()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "remote_run.wait_for_completion(show_output=False)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Codegen Script and Notebook"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Codegen script and notebook can be found under the `Outputs + logs` section from the details page of the remote run. Please check for the `autofeaturization_notebook.ipynb` under `/outputs/generated_code`. To modify the featurization code, open `script.py` and make changes. The codegen notebook can be run with the same environment configuration as the above AutoML run."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "#### Experiment Complete!"
      ]
    }
  ],
  "metadata": {
    "authors": [
      {
        "name": "bhavanatumma"
      }
    ],
    "interpreter": {
      "hash": "adb464b67752e4577e3dc163235ced27038d19b7d88def00d75d1975bde5d9ab"
    },
    "kernelspec": {
      "display_name": "Python 3.8 - AzureML",
      "language": "python",
      "name": "python38-azureml"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.9"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 2
}