MachineLearningNotebooks/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-with-automated-machine-learning-step.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copyright (c) Microsoft Corporation. All rights reserved.  \n",
    "Licensed under the MIT License."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-with-automated-machine-learning-step.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Azure Machine Learning Pipeline with AutoMLStep\n",
    "This notebook demonstrates the use of AutoMLStep in Azure Machine Learning Pipeline."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction\n",
    "In this example we showcase how you can use AzureML Dataset to load data for AutoML via AML Pipeline. \n",
    "\n",
    "If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, make sure you have executed the [configuration](https://aka.ms/pl-config) before running this notebook, please also take a look at the [Automated ML setup-using-a-local-conda-environment](https://github.com/Azure/MachineLearningNotebooks/tree/master/how-to-use-azureml/automated-machine-learning#setup-using-a-local-conda-environment) section to setup the environment.\n",
    "\n",
    "In this notebook you will learn how to:\n",
    "1. Create an `Experiment` in an existing `Workspace`.\n",
    "2. Create or Attach existing AmlCompute to a workspace.\n",
    "3. Define data loading in a `TabularDataset`.\n",
    "4. Configure AutoML using `AutoMLConfig`.\n",
    "5. Use AutoMLStep\n",
    "6. Train the model using AmlCompute\n",
    "7. Explore the results.\n",
    "8. Test the best fitted model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Azure Machine Learning and Pipeline SDK-specific imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import logging\n",
    "import os\n",
    "import csv\n",
    "\n",
    "from matplotlib import pyplot as plt\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from sklearn import datasets\n",
    "import pkg_resources\n",
    "\n",
    "import azureml.core\n",
    "from azureml.core.experiment import Experiment\n",
    "from azureml.core.workspace import Workspace\n",
    "from azureml.train.automl import AutoMLConfig\n",
    "from azureml.core.dataset import Dataset\n",
    "\n",
    "from azureml.pipeline.steps import AutoMLStep\n",
    "\n",
    "# Check core SDK version number\n",
    "print(\"SDK version:\", azureml.core.VERSION)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Initialize Workspace\n",
    "Initialize a workspace object from persisted configuration. Make sure the config file is present at .\\config.json"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ws = Workspace.from_config()\n",
    "print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\\n')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create an Azure ML experiment\n",
    "Let's create an experiment named \"automlstep-sample\" and a folder to hold the training scripts. The script runs will be recorded under the experiment in Azure.\n",
    "\n",
    "The best practice is to use separate folders for scripts and its dependent files for each step and specify that folder as the `source_directory` for the step. This helps reduce the size of the snapshot created for the step (only the specific folder is snapshotted). Since changes in any files in the `source_directory` would trigger a re-upload of the snapshot, this helps keep the reuse of the step when there are no changes in the `source_directory` of the step."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Choose a name for the run history container in the workspace.\n",
    "experiment_name = 'automlstep-sample'\n",
    "project_folder = './project'\n",
    "\n",
    "experiment = Experiment(ws, experiment_name)\n",
    "experiment"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create or Attach an AmlCompute cluster\n",
    "You will need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for your AutoML run. In this tutorial, you get the default `AmlCompute` as your training compute resource.\n",
    "\n",
    "> Note that if you have an AzureML Data Scientist role, you will not have permission to create compute resources. Talk to your workspace or IT admin to create the compute targets described in this section, if they do not already exist."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from azureml.core.compute import AmlCompute\n",
    "from azureml.core.compute import ComputeTarget\n",
    "from azureml.core.compute_target import ComputeTargetException\n",
    "\n",
    "# Choose a name for your CPU cluster\n",
    "amlcompute_cluster_name = \"cpu-cluster\"\n",
    "\n",
    "# Verify that cluster does not exist already\n",
    "try:\n",
    "    compute_target = ComputeTarget(workspace=ws, name=amlcompute_cluster_name)\n",
    "    print('Found existing cluster, use it.')\n",
    "except ComputeTargetException:\n",
    "    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_DS12_V2',# for GPU, use \"STANDARD_NC6\"\n",
    "                                                           #vm_priority = 'lowpriority', # optional\n",
    "                                                           max_nodes=4)\n",
    "    compute_target = ComputeTarget.create(ws, amlcompute_cluster_name, compute_config)\n",
    "\n",
    "compute_target.wait_for_completion(show_output=True, min_node_count = 1, timeout_in_minutes = 10)\n",
    "# For a more detailed view of current AmlCompute status, use get_status()."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Try to load the dataset from the Workspace. Otherwise, create it from the file\n",
    "found = False\n",
    "key = \"Crime-Dataset\"\n",
    "description_text = \"Crime Dataset (used in the the aml-pipelines-with-automated-machine-learning-step.ipynb notebook)\"\n",
    "\n",
    "if key in ws.datasets.keys(): \n",
    "        found = True\n",
    "        dataset = ws.datasets[key] \n",
    "\n",
    "if not found:\n",
    "        # Create AML Dataset and register it into Workspace\n",
    "        # The data referenced here was a 1MB simple random sample of the Chicago Crime data into a local temporary directory.\n",
    "        example_data = 'https://dprepdata.blob.core.windows.net/demo/crime0-random.csv'\n",
    "        dataset = Dataset.Tabular.from_delimited_files(example_data)\n",
    "        dataset = dataset.drop_columns(['FBI Code'])\n",
    "        \n",
    "        #Register Dataset in Workspace\n",
    "        dataset = dataset.register(workspace=ws,\n",
    "                                   name=key,\n",
    "                                   description=description_text)\n",
    "\n",
    "\n",
    "df = dataset.to_pandas_dataframe()\n",
    "df.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Review the Dataset Result\n",
    "\n",
    "You can peek the result of a TabularDataset at any range using `skip(i)` and `take(j).to_pandas_dataframe()`. Doing so evaluates only `j` records for all the steps in the TabularDataset, which makes it fast even against large datasets.\n",
    "\n",
    "`TabularDataset` objects are composed of a list of transformation steps (optional)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset.take(5).to_pandas_dataframe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Train\n",
    "This creates a general AutoML settings object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "automl_settings = {\n",
    "    \"experiment_timeout_minutes\": 20,\n",
    "    \"max_concurrent_iterations\": 4,\n",
    "    \"primary_metric\" : 'AUC_weighted'\n",
    "}\n",
    "automl_config = AutoMLConfig(compute_target=compute_target,\n",
    "                             task = \"classification\",\n",
    "                             training_data=dataset,\n",
    "                             label_column_name=\"Primary Type\",   \n",
    "                             path = project_folder,\n",
    "                             enable_early_stopping= True,\n",
    "                             featurization= 'auto',\n",
    "                             debug_log = \"automl_errors.log\",\n",
    "                             **automl_settings\n",
    "                            )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Create Pipeline and AutoMLStep\n",
    "\n",
    "You can define outputs for the AutoMLStep using TrainingOutput."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from azureml.pipeline.core import PipelineData, TrainingOutput\n",
    "\n",
    "ds = ws.get_default_datastore()\n",
    "metrics_output_name = 'metrics_output'\n",
    "best_model_output_name = 'best_model_output'\n",
    "\n",
    "metrics_data = PipelineData(name='metrics_data',\n",
    "                           datastore=ds,\n",
    "                           pipeline_output_name=metrics_output_name,\n",
    "                           training_output=TrainingOutput(type='Metrics'))\n",
    "model_data = PipelineData(name='model_data',\n",
    "                           datastore=ds,\n",
    "                           pipeline_output_name=best_model_output_name,\n",
    "                           training_output=TrainingOutput(type='Model'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create an AutoMLStep."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "automlstep-remarks-sample1"
    ]
   },
   "outputs": [],
   "source": [
    "automl_step = AutoMLStep(\n",
    "    name='automl_module',\n",
    "    automl_config=automl_config,\n",
    "    outputs=[metrics_data, model_data],\n",
    "    allow_reuse=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "automlstep-remarks-sample2"
    ]
   },
   "outputs": [],
   "source": [
    "from azureml.pipeline.core import Pipeline\n",
    "pipeline = Pipeline(\n",
    "    description=\"pipeline_with_automlstep\",\n",
    "    workspace=ws,    \n",
    "    steps=[automl_step])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pipeline_run = experiment.submit(pipeline)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "##PUBLISHONLY\n",
    "#from azureml.widgets import RunDetails\n",
    "#RunDetails(pipeline_run).show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "##PUBLISHONLY\n",
    "#pipeline_run.wait_for_completion()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Examine Results\n",
    "\n",
    "### Retrieve the metrics of all child runs\n",
    "Outputs of above run can be used as inputs of other steps in pipeline. In this tutorial, we will examine the outputs by retrieve output data and running some tests."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "##PUBLISHONLY\n",
    "#metrics_output = pipeline_run.get_pipeline_output(metrics_output_name)\n",
    "#num_file_downloaded = metrics_output.download('.', show_progress=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "##PUBLISHONLY\n",
    "#import json\n",
    "#with open(metrics_output._path_on_datastore) as f:\n",
    "#    metrics_output_result = f.read()\n",
    "#    \n",
    "#deserialized_metrics_output = json.loads(metrics_output_result)\n",
    "#df = pd.DataFrame(deserialized_metrics_output)\n",
    "#df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Retrieve the Best Model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "##PUBLISHONLY\n",
    "## Retrieve best model from Pipeline Run\n",
    "#best_model_output = pipeline_run.get_pipeline_output(best_model_output_name)\n",
    "#num_file_downloaded = best_model_output.download('.', show_progress=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "##PUBLISHONLY\n",
    "#import pickle\n",
    "\n",
    "#with open(best_model_output._path_on_datastore, \"rb\" ) as f:\n",
    "#    best_model = pickle.load(f)\n",
    "#best_model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "##PUBLISHONLY\n",
    "#best_model.steps"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Test the Model\n",
    "#### Load Test Data\n",
    "For the test data, it should have the same preparation step as the train data. Otherwise it might get failed at the preprocessing step."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "##PUBLISHONLY\n",
    "#dataset_test = Dataset.Tabular.from_delimited_files(path='https://dprepdata.blob.core.windows.net/demo/crime0-test.csv')\n",
    "#df_test = dataset_test.to_pandas_dataframe()\n",
    "#df_test = df_test[pd.notnull(df_test['Primary Type'])]\n",
    "#\n",
    "#y_test = df_test['Primary Type']\n",
    "#X_test = df_test.drop(['Primary Type', 'FBI Code'], axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Testing Our Best Fitted Model\n",
    "\n",
    "We will use confusion matrix to see how our model works."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "##PUBLISHONLY\n",
    "#from sklearn.metrics import confusion_matrix\n",
    "#ypred = best_model.predict(X_test)\n",
    "#cm = confusion_matrix(y_test, ypred)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "##PUBLISHONLY\n",
    "## Visualize the confusion matrix\n",
    "#pd.DataFrame(cm).style.background_gradient(cmap='Blues', low=0, high=0.9)"
   ]
  }
 ],
 "metadata": {
  "authors": [
   {
    "name": "anshirga"
   }
  ],
  "category": "tutorial",
  "compute": [
   "AML Compute"
  ],
  "datasets": [
   "Custom"
  ],
  "deployment": [
   "None"
  ],
  "exclude_from_index": false,
  "framework": [
   "Automated Machine Learning"
  ],
  "friendly_name": "How to use AutoMLStep with AML Pipelines",
  "kernelspec": {
   "display_name": "Python 3.6",
   "language": "python",
   "name": "python36"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  },
  "order_index": 11,
  "star_tag": [
   "featured"
  ],
  "tags": [
   "None"
  ],
  "task": "Demonstrates the use of AutoMLStep"
 },
 "nbformat": 4,
 "nbformat_minor": 2
}