MachineLearningNotebooks/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-with-automated-machine-learning-step.ipynb

{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Copyright (c) Microsoft Corporation. All rights reserved.  \n",
        "Licensed under the MIT License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-with-automated-machine-learning-step.png)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Azure Machine Learning Pipeline with AutoMLStep\n",
        "This notebook demonstrates the use of AutoMLStep in Azure Machine Learning Pipeline."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Introduction\n",
        "In this example we use the scikit-learn's [digit dataset](http://scikit-learn.org/stable/datasets/index.html#optical-recognition-of-handwritten-digits-dataset) to showcase how you can use AutoML for a simple classification problem.\n",
        "\n",
        "If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, make sure you have executed the [configuration](../../../configuration.ipynb) before running this notebook.\n",
        "\n",
        "In this notebook you would see\n",
        "1. Create an `Experiment` in an existing `Workspace`.\n",
        "2. Create or Attach existing AmlCompute to a workspace.\n",
        "3. Configure AutoML using `AutoMLConfig`.\n",
        "4. Use AutoMLStep\n",
        "5. Train the model using AmlCompute\n",
        "6. Explore the results.\n",
        "7. Test the best fitted model.\n",
        "\n",
        "In addition this notebook showcases the following features\n",
        "- **Parallel** executions for iterations\n",
        "- **Asynchronous** tracking of progress\n",
        "- Retrieving models for any iteration or logged metric\n",
        "- Specifying AutoML settings as `**kwargs`"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Azure Machine Learning and Pipeline SDK-specific imports"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import logging\n",
        "import os\n",
        "import csv\n",
        "\n",
        "from matplotlib import pyplot as plt\n",
        "import numpy as np\n",
        "import pandas as pd\n",
        "from sklearn import datasets\n",
        "\n",
        "import azureml.core\n",
        "from azureml.core.experiment import Experiment\n",
        "from azureml.core.workspace import Workspace\n",
        "from azureml.train.automl import AutoMLConfig\n",
        "from azureml.core.compute import AmlCompute\n",
        "from azureml.core.compute import ComputeTarget\n",
        "from azureml.core.runconfig import RunConfiguration\n",
        "from azureml.core.conda_dependencies import CondaDependencies\n",
        "\n",
        "from azureml.train.automl import AutoMLStep\n",
        "\n",
        "# Check core SDK version number\n",
        "print(\"SDK version:\", azureml.core.VERSION)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Initialize Workspace\n",
        "Initialize a workspace object from persisted configuration. Make sure the config file is present at .\\config.json"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "ws = Workspace.from_config()\n",
        "print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\\n')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Create an Azure ML experiment\n",
        "Let's create an experiment named \"automl-classification\" and a folder to hold the training scripts. The script runs will be recorded under the experiment in Azure.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Choose a name for the run history container in the workspace.\n",
        "experiment_name = 'automlstep-classification'\n",
        "project_folder = './project'\n",
        "\n",
        "experiment = Experiment(ws, experiment_name)\n",
        "experiment"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Create or Attach existing AmlCompute\n",
        "You will need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for your AutoML run. In this tutorial, you get the default `AmlCompute` as your training compute resource."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "compute_target = ws.get_default_compute_target(\"CPU\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Prepare and Point to Data\n",
        "For remote executions, you need to make the data accessible from the remote compute.\n",
        "This can be done by uploading the data to DataStore.\n",
        "In this example, we upload scikit-learn's [load_digits](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html) data."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "data_train = datasets.load_digits()\n",
        "\n",
        "if not os.path.isdir('data'):\n",
        "    os.mkdir('data')\n",
        "    \n",
        "if not os.path.exists(project_folder):\n",
        "    os.makedirs(project_folder)\n",
        "    \n",
        "pd.DataFrame(data_train.data).to_csv(\"data/X_train.tsv\", index=False, header=False, quoting=csv.QUOTE_ALL, sep=\"\\t\")\n",
        "pd.DataFrame(data_train.target).to_csv(\"data/y_train.tsv\", index=False, header=False, sep=\"\\t\")\n",
        "\n",
        "ds = ws.get_default_datastore()\n",
        "ds.upload(src_dir='./data', target_path='bai_data', overwrite=True, show_progress=True)\n",
        "\n",
        "from azureml.data.data_reference import DataReference      \n",
        "input_data = DataReference(datastore=ds, \n",
        "                           data_reference_name=\"input_data_reference\",\n",
        "                           path_on_datastore='bai_data',\n",
        "                           mode='download',\n",
        "                           path_on_compute='/tmp/azureml_runs',\n",
        "                           overwrite=False)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# create a new RunConfig object\n",
        "conda_run_config = RunConfiguration(framework=\"python\")\n",
        "\n",
        "# Set compute target to AmlCompute\n",
        "#conda_run_config.target = compute_target\n",
        "\n",
        "conda_run_config.environment.docker.enabled = True\n",
        "conda_run_config.environment.docker.base_image = azureml.core.runconfig.DEFAULT_CPU_IMAGE\n",
        "\n",
        "cd = CondaDependencies.create(pip_packages=['azureml-sdk[automl]'], \n",
        "                              conda_packages=['numpy', 'py-xgboost'], \n",
        "                              pin_sdk_version=False)\n",
        "conda_run_config.environment.python.conda_dependencies = cd\n",
        "\n",
        "print('run config is ready')"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "%%writefile $project_folder/get_data.py\n",
        "\n",
        "import pandas as pd\n",
        "\n",
        "def get_data():\n",
        "    X_train = pd.read_csv(\"/tmp/azureml_runs/bai_data/X_train.tsv\", delimiter=\"\\t\", header=None, quotechar='\"')\n",
        "    y_train = pd.read_csv(\"/tmp/azureml_runs/bai_data/y_train.tsv\", delimiter=\"\\t\", header=None, quotechar='\"')\n",
        "\n",
        "    return { \"X\" : X_train.values, \"y\" : y_train.values.flatten() }\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Set up AutoMLConfig for Training\n",
        "\n",
        "You can specify `automl_settings` as `**kwargs` as well. Also note that you can use a `get_data()` function for local excutions too.\n",
        "\n",
        "**Note:** When using AmlCompute, you can't pass Numpy arrays directly to the fit method.\n",
        "\n",
        "|Property|Description|\n",
        "|-|-|\n",
        "|**primary_metric**|This is the metric that you want to optimize. Classification supports the following primary metrics: <br><i>accuracy</i><br><i>AUC_weighted</i><br><i>average_precision_score_weighted</i><br><i>norm_macro_recall</i><br><i>precision_score_weighted</i>|\n",
        "|**iteration_timeout_minutes**|Time limit in minutes for each iteration.|\n",
        "|**iterations**|Number of iterations. In each iteration AutoML trains a specific pipeline with the data.|\n",
        "|**n_cross_validations**|Number of cross validation splits.|\n",
        "|**max_concurrent_iterations**|Maximum number of iterations that would be executed in parallel. This should be less than the number of cores on the DSVM.|"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "automl_settings = {\n",
        "    \"iteration_timeout_minutes\": 5,\n",
        "    \"iterations\": 20,\n",
        "    \"n_cross_validations\": 5,\n",
        "    \"primary_metric\": 'AUC_weighted',\n",
        "    \"preprocess\": False,\n",
        "    \"max_concurrent_iterations\": 3,\n",
        "    \"verbosity\": logging.INFO\n",
        "}\n",
        "automl_config = AutoMLConfig(task = 'classification',\n",
        "                             debug_log = 'automl_errors.log',\n",
        "                             path = project_folder,\n",
        "                             compute_target=compute_target,\n",
        "                             run_configuration=conda_run_config,\n",
        "                             data_script = project_folder + \"/get_data.py\",\n",
        "                             **automl_settings\n",
        "                            )"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Call the `submit` method on the experiment object and pass the run configuration. For remote runs the execution is asynchronous, so you will see the iterations get populated as they complete. You can interact with the widgets and models even when the experiment is running to retrieve the best model up to that point. Once you are satisfied with the model, you can cancel a particular iteration or the whole run.\n",
        "In this example, we specify `show_output = False` to suppress console output while the run is in progress."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Define AutoMLStep"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.pipeline.core import PipelineData, TrainingOutput\n",
        "\n",
        "metrics_output_name = 'metrics_output'\n",
        "best_model_output_name = 'best_model_output'\n",
        "\n",
        "metirics_data = PipelineData(name='metrics_data',\n",
        "                           datastore=ds,\n",
        "                           pipeline_output_name=metrics_output_name,\n",
        "                           training_output=TrainingOutput(type='Metrics'))\n",
        "model_data = PipelineData(name='model_data',\n",
        "                           datastore=ds,\n",
        "                           pipeline_output_name=best_model_output_name,\n",
        "                           training_output=TrainingOutput(type='Model'))"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "automl_step = AutoMLStep(\n",
        "    name='automl_module',\n",
        "    experiment=experiment,\n",
        "    automl_config=automl_config,\n",
        "    inputs=[input_data],\n",
        "    outputs=[metirics_data, model_data],\n",
        "    allow_reuse=True)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.pipeline.core import Pipeline\n",
        "pipeline = Pipeline(\n",
        "    description=\"pipeline_with_automlstep\",\n",
        "    workspace=ws,    \n",
        "    steps=[automl_step])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "pipeline_run = experiment.submit(pipeline)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.widgets import RunDetails\n",
        "RunDetails(pipeline_run).show()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "pipeline_run.wait_for_completion()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Examine Results\n",
        "\n",
        "### Retrieve the metrics of all child runs\n",
        "Outputs of above run can be used as inputs of other steps in pipeline. In this tutorial, we will examine the outputs by retrieve output data and running some tests."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "metrics_output = pipeline_run.get_pipeline_output(metrics_output_name)\n",
        "num_file_downloaded = metrics_output.download('.', show_progress=True)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import json\n",
        "with open(metrics_output._path_on_datastore) as f:  \n",
        "    metrics_output_result = f.read()\n",
        "    \n",
        "deserialized_metrics_output = json.loads(metrics_output_result)\n",
        "df = pd.DataFrame(deserialized_metrics_output)\n",
        "df"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Retrieve the Best Model"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "best_model_output = pipeline_run.get_pipeline_output(best_model_output_name)\n",
        "num_file_downloaded = best_model_output.download('.', show_progress=True)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        " import pickle\n",
        "\n",
        " with open(best_model_output._path_on_datastore, \"rb\" ) as f:\n",
        "     best_model = pickle.load(f)\n",
        " best_model"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Test the Model\n",
        "#### Load Test Data"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "digits = datasets.load_digits()\n",
        "X_test = digits.data[:10, :]\n",
        "y_test = digits.target[:10]\n",
        "images = digits.images[:10]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "#### Testing Best Model"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Randomly select digits and test.\n",
        "for index in np.random.choice(len(y_test), 3, replace = False):\n",
        "   print(index)\n",
        "   predicted = best_model.predict(X_test[index:index + 1])[0]\n",
        "   label = y_test[index]\n",
        "   title = \"Label value = %d  Predicted value = %d \" % (label, predicted)\n",
        "   fig = plt.figure(1, figsize=(3,3))\n",
        "   ax1 = fig.add_axes((0,0,.8,.8))\n",
        "   ax1.set_title(title)\n",
        "   plt.imshow(images[index], cmap = plt.cm.gray_r, interpolation = 'nearest')\n",
        "   plt.show()"
      ]
    }
  ],
  "metadata": {
    "authors": [
      {
        "name": "sanpil"
      }
    ],
    "kernelspec": {
      "display_name": "Python 3.6",
      "language": "python",
      "name": "python36"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.7"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 2
}