mirror of
https://github.com/Azure/MachineLearningNotebooks.git
synced 2025-12-20 09:37:04 -05:00
738 lines
26 KiB
Plaintext
738 lines
26 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Copyright (c) Microsoft Corporation. All rights reserved.\n",
|
|
"\n",
|
|
"Licensed under the MIT License."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Automated Machine Learning\n",
|
|
"_**Regression with Deployment using Hardware Performance Dataset**_\n",
|
|
"\n",
|
|
"## Contents\n",
|
|
"1. [Introduction](#Introduction)\n",
|
|
"1. [Setup](#Setup)\n",
|
|
"1. [Data](#Data)\n",
|
|
"1. [Train](#Train)\n",
|
|
"1. [Results](#Results)\n",
|
|
"1. [Test](#Test)\n",
|
|
"1. [Acknowledgements](#Acknowledgements)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Introduction\n",
|
|
"In this example we use the Hardware Performance Dataset to showcase how you can use AutoML for a simple regression problem. The Regression goal is to predict the performance of certain combinations of hardware parts.\n",
|
|
"\n",
|
|
"If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, go through the [configuration](../../../configuration.ipynb) notebook first if you haven't already to establish your connection to the AzureML Workspace. \n",
|
|
"\n",
|
|
"In this notebook you will learn how to:\n",
|
|
"1. Create an `Experiment` in an existing `Workspace`.\n",
|
|
"2. Configure AutoML using `AutoMLConfig`.\n",
|
|
"3. Train the model using local compute.\n",
|
|
"4. Explore the results.\n",
|
|
"5. Test the best fitted model."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Setup\n",
|
|
"As part of the setup you have already created an Azure ML Workspace object. For AutoML you will need to create an Experiment object, which is a named object in a Workspace used to run experiments."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import logging\n",
|
|
"\n",
|
|
"from matplotlib import pyplot as plt\n",
|
|
"import numpy as np\n",
|
|
"import pandas as pd\n",
|
|
"import os\n",
|
|
" \n",
|
|
"\n",
|
|
"import azureml.core\n",
|
|
"from azureml.core.experiment import Experiment\n",
|
|
"from azureml.core.workspace import Workspace\n",
|
|
"from azureml.core.dataset import Dataset\n",
|
|
"from azureml.train.automl import AutoMLConfig"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"ws = Workspace.from_config()\n",
|
|
"\n",
|
|
"# Choose a name for the experiment.\n",
|
|
"experiment_name = 'automl-regression-hardware'\n",
|
|
"\n",
|
|
"experiment = Experiment(ws, experiment_name)\n",
|
|
"\n",
|
|
"output = {}\n",
|
|
"output['SDK version'] = azureml.core.VERSION\n",
|
|
"output['Subscription ID'] = ws.subscription_id\n",
|
|
"output['Workspace Name'] = ws.name\n",
|
|
"output['Resource Group'] = ws.resource_group\n",
|
|
"output['Location'] = ws.location\n",
|
|
"output['Experiment Name'] = experiment.name\n",
|
|
"pd.set_option('display.max_colwidth', -1)\n",
|
|
"outputDf = pd.DataFrame(data = output, index = [''])\n",
|
|
"outputDf.T"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Create or Attach existing AmlCompute\n",
|
|
"You will need to create a compute target for your AutoML run. In this tutorial, you create AmlCompute as your training compute resource.\n",
|
|
"#### Creation of AmlCompute takes approximately 5 minutes. \n",
|
|
"If the AmlCompute with that name is already in your workspace this code will skip the creation process.\n",
|
|
"As with other Azure services, there are limits on certain resources (e.g. AmlCompute) associated with the Azure Machine Learning service. Please read this article on the default limits and how to request more quota."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.core.compute import AmlCompute\n",
|
|
"from azureml.core.compute import ComputeTarget\n",
|
|
"\n",
|
|
"# Choose a name for your cluster.\n",
|
|
"amlcompute_cluster_name = \"automlcl\"\n",
|
|
"\n",
|
|
"found = False\n",
|
|
"# Check if this compute target already exists in the workspace.\n",
|
|
"cts = ws.compute_targets\n",
|
|
"if amlcompute_cluster_name in cts and cts[amlcompute_cluster_name].type == 'AmlCompute':\n",
|
|
" found = True\n",
|
|
" print('Found existing compute target.')\n",
|
|
" compute_target = cts[amlcompute_cluster_name]\n",
|
|
" \n",
|
|
"if not found:\n",
|
|
" print('Creating a new compute target...')\n",
|
|
" provisioning_config = AmlCompute.provisioning_configuration(vm_size = \"STANDARD_D2_V2\", # for GPU, use \"STANDARD_NC6\"\n",
|
|
" #vm_priority = 'lowpriority', # optional\n",
|
|
" max_nodes = 6)\n",
|
|
"\n",
|
|
" # Create the cluster.\n",
|
|
" compute_target = ComputeTarget.create(ws, amlcompute_cluster_name, provisioning_config)\n",
|
|
" \n",
|
|
"print('Checking cluster status...')\n",
|
|
"# Can poll for a minimum number of nodes and for a specific timeout.\n",
|
|
"# If no min_node_count is provided, it will use the scale settings for the cluster.\n",
|
|
"compute_target.wait_for_completion(show_output = True, min_node_count = None, timeout_in_minutes = 20)\n",
|
|
" \n",
|
|
"# For a more detailed view of current AmlCompute status, use get_status()."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Data\n",
|
|
"\n",
|
|
"Create a run configuration for the remote run."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.core.runconfig import RunConfiguration\n",
|
|
"from azureml.core.conda_dependencies import CondaDependencies\n",
|
|
"import pkg_resources\n",
|
|
"\n",
|
|
"# create a new RunConfig object\n",
|
|
"conda_run_config = RunConfiguration(framework=\"python\")\n",
|
|
"\n",
|
|
"# Set compute target to AmlCompute\n",
|
|
"conda_run_config.target = compute_target\n",
|
|
"conda_run_config.environment.docker.enabled = True\n",
|
|
"\n",
|
|
"cd = CondaDependencies.create(conda_packages=['numpy', 'py-xgboost<=0.80'])\n",
|
|
"conda_run_config.environment.python.conda_dependencies = cd"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Load Data\n",
|
|
"\n",
|
|
"Load the hardware performance dataset into X and y. X contains the training features, which are inputs to the model. y contains the training labels, which are the expected output of the model."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"data = \"https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/machineData.csv\"\n",
|
|
"dataset = Dataset.Tabular.from_delimited_files(data)\n",
|
|
"X = dataset.drop_columns(columns=['ERP'])\n",
|
|
"y = dataset.keep_columns(columns=['ERP'], validate=True)\n",
|
|
"X_train, X_test = X.random_split(percentage=0.8, seed=223)\n",
|
|
"y_train, y_test = y.random_split(percentage=0.8, seed=223)\n",
|
|
"dataset.take(5).to_pandas_dataframe()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"\n",
|
|
"## Train\n",
|
|
"\n",
|
|
"Instantiate an `AutoMLConfig` object to specify the settings and data used to run the experiment.\n",
|
|
"\n",
|
|
"|Property|Description|\n",
|
|
"|-|-|\n",
|
|
"|**task**|classification or regression|\n",
|
|
"|**primary_metric**|This is the metric that you want to optimize. Regression supports the following primary metrics: <br><i>spearman_correlation</i><br><i>normalized_root_mean_squared_error</i><br><i>r2_score</i><br><i>normalized_mean_absolute_error</i>|\n",
|
|
"|**iteration_timeout_minutes**|Time limit in minutes for each iteration.|\n",
|
|
"|**iterations**|Number of iterations. In each iteration AutoML trains a specific pipeline with the data.|\n",
|
|
"|**n_cross_validations**|Number of cross validation splits.|\n",
|
|
"|**X**|(sparse) array-like, shape = [n_samples, n_features]|\n",
|
|
"|**y**|(sparse) array-like, shape = [n_samples, ], targets values.|\n",
|
|
"\n",
|
|
"**_You can find more information about primary metrics_** [here](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-configure-auto-train#primary-metric)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"##### If you would like to see even better results increase \"iteration_time_out minutes\" to 10+ mins and increase \"iterations\" to a minimum of 30"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"automl_settings = {\n",
|
|
" \"iteration_timeout_minutes\": 5,\n",
|
|
" \"iterations\": 10,\n",
|
|
" \"n_cross_validations\": 5,\n",
|
|
" \"primary_metric\": 'spearman_correlation',\n",
|
|
" \"preprocess\": True,\n",
|
|
" \"max_concurrent_iterations\": 5,\n",
|
|
" \"verbosity\": logging.INFO,\n",
|
|
"}\n",
|
|
"\n",
|
|
"automl_config = AutoMLConfig(task = 'regression',\n",
|
|
" debug_log = 'automl_errors.log',\n",
|
|
" run_configuration=conda_run_config,\n",
|
|
" X = X_train,\n",
|
|
" y = y_train,\n",
|
|
" **automl_settings\n",
|
|
" )"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"remote_run = experiment.submit(automl_config, show_output = False)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"remote_run"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Results"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Widget for Monitoring Runs\n",
|
|
"\n",
|
|
"The widget will first report a \"loading\" status while running the first iteration. After completing the first iteration, an auto-updating graph and table will be shown. The widget will refresh once per minute, so you should see the graph update as child runs complete.\n",
|
|
"\n",
|
|
"**Note:** The widget displays a link at the bottom. Use this link to open a web interface to explore the individual run details."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.widgets import RunDetails\n",
|
|
"RunDetails(remote_run).show() "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Wait until the run finishes.\n",
|
|
"remote_run.wait_for_completion(show_output = True)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Retrieve All Child Runs\n",
|
|
"You can also use SDK methods to fetch all the child runs and see individual metrics that we log."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"children = list(remote_run.get_children())\n",
|
|
"metricslist = {}\n",
|
|
"for run in children:\n",
|
|
" properties = run.get_properties()\n",
|
|
" metrics = {k: v for k, v in run.get_metrics().items() if isinstance(v, float)}\n",
|
|
" metricslist[int(properties['iteration'])] = metrics\n",
|
|
"\n",
|
|
"rundata = pd.DataFrame(metricslist).sort_index(1)\n",
|
|
"rundata"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Retrieve the Best Model\n",
|
|
"Below we select the best pipeline from our iterations. The get_output method returns the best run and the fitted model. The Model includes the pipeline and any pre-processing. Overloads on get_output allow you to retrieve the best run and fitted model for any logged metric or for a particular iteration."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"best_run, fitted_model = remote_run.get_output()\n",
|
|
"print(best_run)\n",
|
|
"print(fitted_model)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Best Model Based on Any Other Metric\n",
|
|
"Show the run and the model that has the smallest `root_mean_squared_error` value (which turned out to be the same as the one with largest `spearman_correlation` value):"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"lookup_metric = \"root_mean_squared_error\"\n",
|
|
"best_run, fitted_model = remote_run.get_output(metric = lookup_metric)\n",
|
|
"print(best_run)\n",
|
|
"print(fitted_model)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"iteration = 3\n",
|
|
"third_run, third_model = remote_run.get_output(iteration = iteration)\n",
|
|
"print(third_run)\n",
|
|
"print(third_model)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Register the Fitted Model for Deployment\n",
|
|
"If neither metric nor iteration are specified in the register_model call, the iteration with the best primary metric is registered."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"description = 'AutoML Model'\n",
|
|
"tags = None\n",
|
|
"model = remote_run.register_model(description = description, tags = tags)\n",
|
|
"\n",
|
|
"print(remote_run.model_id) # This will be written to the script file later in the notebook."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Create Scoring Script\n",
|
|
"The scoring script is required to generate the image for deployment. It contains the code to do the predictions on input data."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"%%writefile score.py\n",
|
|
"import pickle\n",
|
|
"import json\n",
|
|
"import numpy\n",
|
|
"import azureml.train.automl\n",
|
|
"from sklearn.externals import joblib\n",
|
|
"from azureml.core.model import Model\n",
|
|
"\n",
|
|
"def init():\n",
|
|
" global model\n",
|
|
" model_path = Model.get_model_path(model_name = '<<modelid>>') # this name is model.id of model that we want to deploy\n",
|
|
" # deserialize the model file back into a sklearn model\n",
|
|
" model = joblib.load(model_path)\n",
|
|
"\n",
|
|
"def run(rawdata):\n",
|
|
" try:\n",
|
|
" data = json.loads(rawdata)['data']\n",
|
|
" data = numpy.array(data)\n",
|
|
" result = model.predict(data)\n",
|
|
" except Exception as e:\n",
|
|
" result = str(e)\n",
|
|
" return json.dumps({\"error\": result})\n",
|
|
" return json.dumps({\"result\":result.tolist()})"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Create a YAML File for the Environment"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"To ensure the fit results are consistent with the training results, the SDK dependency versions need to be the same as the environment that trains the model. Details about retrieving the versions can be found in notebook [12.auto-ml-retrieve-the-training-sdk-versions](12.auto-ml-retrieve-the-training-sdk-versions.ipynb)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"dependencies = remote_run.get_run_sdk_dependencies(iteration = 1)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"for p in ['azureml-train-automl', 'azureml-core']:\n",
|
|
" print('{}\\t{}'.format(p, dependencies[p]))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"myenv = CondaDependencies.create(conda_packages=['numpy','scikit-learn','py-xgboost==0.80'], pip_packages=['azureml-defaults','azureml-train-automl'])\n",
|
|
"\n",
|
|
"conda_env_file_name = 'myenv.yml'\n",
|
|
"myenv.save_to_file('.', conda_env_file_name)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Substitute the actual version number in the environment file.\n",
|
|
"# This is not strictly needed in this notebook because the model should have been generated using the current SDK version.\n",
|
|
"# However, we include this in case this code is used on an experiment from a previous SDK version.\n",
|
|
"\n",
|
|
"with open(conda_env_file_name, 'r') as cefr:\n",
|
|
" content = cefr.read()\n",
|
|
"\n",
|
|
"with open(conda_env_file_name, 'w') as cefw:\n",
|
|
" cefw.write(content.replace(azureml.core.VERSION, dependencies['azureml-train-automl']))\n",
|
|
"\n",
|
|
"# Substitute the actual model id in the script file.\n",
|
|
"\n",
|
|
"script_file_name = 'score.py'\n",
|
|
"\n",
|
|
"with open(script_file_name, 'r') as cefr:\n",
|
|
" content = cefr.read()\n",
|
|
"\n",
|
|
"with open(script_file_name, 'w') as cefw:\n",
|
|
" cefw.write(content.replace('<<modelid>>', remote_run.model_id))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Deploy the model as a Web Service on Azure Container Instance"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.core.model import InferenceConfig\n",
|
|
"from azureml.core.webservice import AciWebservice\n",
|
|
"from azureml.core.webservice import Webservice\n",
|
|
"from azureml.core.model import Model\n",
|
|
"\n",
|
|
"inference_config = InferenceConfig(runtime = \"python\", \n",
|
|
" entry_script = script_file_name,\n",
|
|
" conda_file = conda_env_file_name)\n",
|
|
"\n",
|
|
"aciconfig = AciWebservice.deploy_configuration(cpu_cores = 1, \n",
|
|
" memory_gb = 1, \n",
|
|
" tags = {'area': \"digits\", 'type': \"automl_regression\"}, \n",
|
|
" description = 'sample service for Automl Regression')\n",
|
|
"\n",
|
|
"aci_service_name = 'automl-sample-hardware'\n",
|
|
"print(aci_service_name)\n",
|
|
"aci_service = Model.deploy(ws, aci_service_name, [model], inference_config, aciconfig)\n",
|
|
"aci_service.wait_for_deployment(True)\n",
|
|
"print(aci_service.state)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Delete a Web Service\n",
|
|
"\n",
|
|
"Deletes the specified web service."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"#aci_service.delete()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Get Logs from a Deployed Web Service\n",
|
|
"\n",
|
|
"Gets logs from a deployed web service."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"#aci_service.get_logs()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Test\n",
|
|
"\n",
|
|
"Now that the model is trained, split the data in the same way the data was split for training (The difference here is the data is being split locally) and then run the test data through the trained model to get the predicted values."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"X_test = X_test.to_pandas_dataframe()\n",
|
|
"y_test = y_test.to_pandas_dataframe()\n",
|
|
"y_test = np.array(y_test)\n",
|
|
"y_test = y_test[:,0]\n",
|
|
"X_train = X_train.to_pandas_dataframe()\n",
|
|
"y_train = y_train.to_pandas_dataframe()\n",
|
|
"y_train = np.array(y_train)\n",
|
|
"y_train = y_train[:,0]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"##### Predict on training and test set, and calculate residual values."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"y_pred_train = fitted_model.predict(X_train)\n",
|
|
"y_residual_train = y_train - y_pred_train\n",
|
|
"\n",
|
|
"y_pred_test = fitted_model.predict(X_test)\n",
|
|
"y_residual_test = y_test - y_pred_test"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Calculate metrics for the prediction\n",
|
|
"\n",
|
|
"Now visualize the data on a scatter plot to show what our truth (actual) values are compared to the predicted values \n",
|
|
"from the trained model that was returned."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"%matplotlib inline\n",
|
|
"from sklearn.metrics import mean_squared_error, r2_score\n",
|
|
"\n",
|
|
"# Set up a multi-plot chart.\n",
|
|
"f, (a0, a1) = plt.subplots(1, 2, gridspec_kw = {'width_ratios':[1, 1], 'wspace':0, 'hspace': 0})\n",
|
|
"f.suptitle('Regression Residual Values', fontsize = 18)\n",
|
|
"f.set_figheight(6)\n",
|
|
"f.set_figwidth(16)\n",
|
|
"\n",
|
|
"# Plot residual values of training set.\n",
|
|
"a0.axis([0, 360, -200, 200])\n",
|
|
"a0.plot(y_residual_train, 'bo', alpha = 0.5)\n",
|
|
"a0.plot([-10,360],[0,0], 'r-', lw = 3)\n",
|
|
"a0.text(16,170,'RMSE = {0:.2f}'.format(np.sqrt(mean_squared_error(y_train, y_pred_train))), fontsize = 12)\n",
|
|
"a0.text(16,140,'R2 score = {0:.2f}'.format(r2_score(y_train, y_pred_train)),fontsize = 12)\n",
|
|
"a0.set_xlabel('Training samples', fontsize = 12)\n",
|
|
"a0.set_ylabel('Residual Values', fontsize = 12)\n",
|
|
"\n",
|
|
"# Plot residual values of test set.\n",
|
|
"a1.axis([0, 90, -200, 200])\n",
|
|
"a1.plot(y_residual_test, 'bo', alpha = 0.5)\n",
|
|
"a1.plot([-10,360],[0,0], 'r-', lw = 3)\n",
|
|
"a1.text(5,170,'RMSE = {0:.2f}'.format(np.sqrt(mean_squared_error(y_test, y_pred_test))), fontsize = 12)\n",
|
|
"a1.text(5,140,'R2 score = {0:.2f}'.format(r2_score(y_test, y_pred_test)),fontsize = 12)\n",
|
|
"a1.set_xlabel('Test samples', fontsize = 12)\n",
|
|
"a1.set_yticklabels([])\n",
|
|
"\n",
|
|
"plt.show()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"%matplotlib notebook\n",
|
|
"test_pred = plt.scatter(y_test, y_pred_test, color='')\n",
|
|
"test_test = plt.scatter(y_test, y_test, color='g')\n",
|
|
"plt.legend((test_pred, test_test), ('prediction', 'truth'), loc='upper left', fontsize=8)\n",
|
|
"plt.show()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Acknowledgements\n",
|
|
"This Predicting Hardware Performance Dataset is made available under the CC0 1.0 Universal (CC0 1.0) Public Domain Dedication License: https://creativecommons.org/publicdomain/zero/1.0/. Any rights in individual contents of the database are licensed under the CC0 1.0 Universal (CC0 1.0) Public Domain Dedication License: https://creativecommons.org/publicdomain/zero/1.0/ . The dataset itself can be found here: https://www.kaggle.com/faizunnabi/comp-hardware-performance and https://archive.ics.uci.edu/ml/datasets/Computer+Hardware\n",
|
|
"\n",
|
|
"_**Citation Found Here**_\n"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"authors": [
|
|
{
|
|
"name": "v-rasav"
|
|
}
|
|
],
|
|
"kernelspec": {
|
|
"display_name": "Python 3.6",
|
|
"language": "python",
|
|
"name": "python36"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.7.1"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
} |