{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Copyright (c) Microsoft Corporation. All rights reserved.\n", "\n", "Licensed under the MIT License." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/dataprep-remote-execution/auto-ml-dataprep-remote-execution.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Automated Machine Learning\n", "_**Load Data using `TabularDataset` for Remote Execution (AmlCompute)**_\n", "\n", "## Contents\n", "1. [Introduction](#Introduction)\n", "1. [Setup](#Setup)\n", "1. [Data](#Data)\n", "1. [Train](#Train)\n", "1. [Results](#Results)\n", "1. [Test](#Test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction\n", "In this example we showcase how you can use AzureML Dataset to load data for AutoML.\n", "\n", "Make sure you have executed the [configuration](../../../configuration.ipynb) before running this notebook.\n", "\n", "In this notebook you will learn how to:\n", "1. Create a `TabularDataset` pointing to the training data.\n", "2. Pass the `TabularDataset` to AutoML for a remote run." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As part of the setup you have already created an Azure ML `Workspace` object. For AutoML you will need to create an `Experiment` object, which is a named object in a `Workspace` used to run experiments." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import logging\n", "\n", "import pandas as pd\n", "\n", "import azureml.core\n", "from azureml.core.experiment import Experiment\n", "from azureml.core.workspace import Workspace\n", "from azureml.core.dataset import Dataset\n", "from azureml.train.automl import AutoMLConfig" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ws = Workspace.from_config()\n", "\n", "# choose a name for experiment\n", "experiment_name = 'automl-dataset-remote-bai'\n", " \n", "experiment = Experiment(ws, experiment_name)\n", " \n", "output = {}\n", "output['SDK version'] = azureml.core.VERSION\n", "output['Subscription ID'] = ws.subscription_id\n", "output['Workspace Name'] = ws.name\n", "output['Resource Group'] = ws.resource_group\n", "output['Location'] = ws.location\n", "output['Experiment Name'] = experiment.name\n", "pd.set_option('display.max_colwidth', -1)\n", "outputDf = pd.DataFrame(data = output, index = [''])\n", "outputDf.T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# The data referenced here was a 1MB simple random sample of the Chicago Crime data into a local temporary directory.\n", "example_data = 'https://dprepdata.blob.core.windows.net/demo/crime0-random.csv'\n", "dataset = Dataset.Tabular.from_delimited_files(example_data)\n", "dataset.take(5).to_pandas_dataframe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Review the data\n", "\n", "You can peek the result of a `TabularDataset` at any range using `skip(i)` and `take(j).to_pandas_dataframe()`. Doing so evaluates only `j` records, which makes it fast even against large datasets.\n", "\n", "`TabularDataset` objects are immutable and are composed of a list of subsetting transformations (optional)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "training_data = dataset.drop_columns(columns=['FBI Code'])\n", "label_column_name = 'Primary Type'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Train\n", "\n", "This creates a general AutoML settings object applicable for both local and remote runs." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "automl_settings = {\n", " \"iteration_timeout_minutes\" : 10,\n", " \"iterations\" : 2,\n", " \"primary_metric\" : 'AUC_weighted',\n", " \"preprocess\" : True,\n", " \"verbosity\" : logging.INFO\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create or Attach an AmlCompute cluster" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from azureml.core.compute import AmlCompute\n", "from azureml.core.compute import ComputeTarget\n", "\n", "# Choose a name for your cluster.\n", "amlcompute_cluster_name = \"automlc2\"\n", "\n", "found = False\n", "\n", "# Check if this compute target already exists in the workspace.\n", "\n", "cts = ws.compute_targets\n", "if amlcompute_cluster_name in cts and cts[amlcompute_cluster_name].type == 'AmlCompute':\n", " found = True\n", " print('Found existing compute target.')\n", " compute_target = cts[amlcompute_cluster_name]\n", "\n", "if not found:\n", " print('Creating a new compute target...')\n", " provisioning_config = AmlCompute.provisioning_configuration(vm_size = \"STANDARD_D2_V2\", # for GPU, use \"STANDARD_NC6\"\n", " #vm_priority = 'lowpriority', # optional\n", " max_nodes = 6)\n", "\n", " # Create the cluster.\\n\",\n", " compute_target = ComputeTarget.create(ws, amlcompute_cluster_name, provisioning_config)\n", "\n", "print('Checking cluster status...')\n", "# Can poll for a minimum number of nodes and for a specific timeout.\n", "# If no min_node_count is provided, it will use the scale settings for the cluster.\n", "compute_target.wait_for_completion(show_output = True, min_node_count = None, timeout_in_minutes = 20)\n", "\n", "# For a more detailed view of current AmlCompute status, use get_status()." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from azureml.core.runconfig import RunConfiguration\n", "from azureml.core.conda_dependencies import CondaDependencies\n", "import pkg_resources\n", "\n", "# create a new RunConfig object\n", "conda_run_config = RunConfiguration(framework=\"python\")\n", "\n", "# Set compute target to AmlCompute\n", "conda_run_config.target = compute_target\n", "conda_run_config.environment.docker.enabled = True\n", "\n", "cd = CondaDependencies.create(conda_packages=['numpy','py-xgboost<=0.80'])\n", "conda_run_config.environment.python.conda_dependencies = cd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Pass Data with `TabularDataset` Objects\n", "\n", "The `TabularDataset` objects captured above can also be passed to the `submit` method for a remote run. AutoML will serialize the `TabularDataset` object and send it to the remote compute target. The `TabularDataset` will not be evaluated locally." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "automl_config = AutoMLConfig(task = 'classification',\n", " debug_log = 'automl_errors.log',\n", " run_configuration=conda_run_config,\n", " training_data = training_data,\n", " label_column_name = label_column_name,\n", " **automl_settings)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "remote_run = experiment.submit(automl_config, show_output = True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "remote_run" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Pre-process cache cleanup\n", "The preprocess data gets cache at user default file store. When the run is completed the cache can be cleaned by running below cell" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "remote_run.clean_preprocessor_cache()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Cancelling Runs\n", "You can cancel ongoing remote runs using the `cancel` and `cancel_iteration` functions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Cancel the ongoing experiment and stop scheduling new iterations.\n", "# remote_run.cancel()\n", "\n", "# Cancel iteration 1 and move onto iteration 2.\n", "# remote_run.cancel_iteration(1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Widget for Monitoring Runs\n", "\n", "The widget will first report a \"loading\" status while running the first iteration. After completing the first iteration, an auto-updating graph and table will be shown. The widget will refresh once per minute, so you should see the graph update as child runs complete.\n", "\n", "**Note:** The widget displays a link at the bottom. Use this link to open a web interface to explore the individual run details." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from azureml.widgets import RunDetails\n", "RunDetails(remote_run).show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Retrieve All Child Runs\n", "You can also use SDK methods to fetch all the child runs and see individual metrics that we log." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "children = list(remote_run.get_children())\n", "metricslist = {}\n", "for run in children:\n", " properties = run.get_properties()\n", " metrics = {k: v for k, v in run.get_metrics().items() if isinstance(v, float)}\n", " metricslist[int(properties['iteration'])] = metrics\n", " \n", "rundata = pd.DataFrame(metricslist).sort_index(1)\n", "rundata" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Retrieve the Best Model\n", "\n", "Below we select the best pipeline from our iterations. The `get_output` method returns the best run and the fitted model. Overloads on `get_output` allow you to retrieve the best run and fitted model for *any* logged metric or for a particular *iteration*." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "best_run, fitted_model = remote_run.get_output()\n", "print(best_run)\n", "print(fitted_model)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Best Model Based on Any Other Metric\n", "Show the run and the model that has the smallest `log_loss` value:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lookup_metric = \"log_loss\"\n", "best_run, fitted_model = remote_run.get_output(metric = lookup_metric)\n", "print(best_run)\n", "print(fitted_model)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Model from a Specific Iteration\n", "Show the run and the model from the first iteration:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "iteration = 0\n", "best_run, fitted_model = remote_run.get_output(iteration = iteration)\n", "print(best_run)\n", "print(fitted_model)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Test\n", "\n", "#### Load Test Data\n", "For the test data, it should have the same preparation step as the train data. Otherwise it might get failed at the preprocessing step." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dataset_test = Dataset.Tabular.from_delimited_files(path='https://dprepdata.blob.core.windows.net/demo/crime0-test.csv')\n", "\n", "df_test = dataset_test.to_pandas_dataframe()\n", "df_test = df_test[pd.notnull(df_test['Primary Type'])]\n", "\n", "y_test = df_test[['Primary Type']]\n", "X_test = df_test.drop(['Primary Type', 'FBI Code'], axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Testing Our Best Fitted Model\n", "We will use confusion matrix to see how our model works." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from pandas_ml import ConfusionMatrix\n", "\n", "ypred = fitted_model.predict(X_test)\n", "\n", "cm = ConfusionMatrix(y_test['Primary Type'], ypred)\n", "\n", "print(cm)\n", "\n", "cm.plot()" ] } ], "metadata": { "authors": [ { "name": "savitam" } ], "kernelspec": { "display_name": "Python 3.6", "language": "python", "name": "python36" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }