Update 12-1-19

This commit is contained in:
rastala
2018-12-01 20:37:45 -05:00
parent 96523ec751
commit 74309f91f7
9 changed files with 729 additions and 1408 deletions

View File

@@ -13,30 +13,65 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"# 00. Installation and configuration\n", "# Configuration\n",
"This notebook configures your library of notebooks to connect to an Azure Machine Learning Workspace. In this case, a library contains all of the notebooks in the current folder and any nested folders. You can configure this notebook to use an existing workspace or create a new workspace.\n",
"\n", "\n",
"## What is an Azure ML Workspace and why do I need one?\n", "_**Setting up your Azure Machine Learning services workspace and configuring your notebook library**_\n",
"\n", "\n",
"An AML Workspace is an Azure resource that organizes and coordinates the actions of many other Azure resources to assist in executing and sharing machine learning workflows. In particular, an AML Workspace coordinates storage, databases, and compute resources providing added functionality for machine learning experimentation, operationalization, and the monitoring of operationalized models." "---\n",
"---\n",
"\n",
"## Table of Contents\n",
"\n",
"1. [Introduction](#Introduction)\n",
" 1. What is an Azure Machine Learning workspace\n",
"1. [Setup](#Setup)\n",
" 1. Azure subscription\n",
" 1. Azure ML SDK and other library installation\n",
" 1. Azure Container Instance registration\n",
"1. [Configure your Azure ML Workspace](#Configure%20your%20Azure%20ML%20workspace)\n",
" 1. Workspace parameters\n",
" 1. Access your workspace\n",
" 1. Create a new workspace\n",
" 1. Create compute resources\n",
"1. [Next steps](#Next%20steps)\n",
"\n",
"---\n",
"\n",
"## Introduction\n",
"\n",
"This notebook configures your library of notebooks to connect to an Azure Machine Learning (ML) workspace. In this case, a library contains all of the notebooks in the current folder and any nested folders. You can configure this notebook library to use an existing workspace or create a new workspace.\n",
"\n",
"Typically you will need to run this notebook only once per notebook library as all other notebooks will use connection information that is written here. If you want to redirect your notebook library to work with a different workspace, then you should re-run this notebook.\n",
"\n",
"In this notebook you will\n",
"* Learn about getting an Azure subscription\n",
"* Specify your workspace parameters\n",
"* Access or create your workspace\n",
"* Add a default compute cluster for your workspace\n",
"\n",
"### What is an Azure Machine Learning workspace\n",
"\n",
"An Azure ML Workspace is an Azure resource that organizes and coordinates the actions of many other Azure resources to assist in executing and sharing machine learning workflows. In particular, an Azure ML Workspace coordinates storage, databases, and compute resources providing added functionality for machine learning experimentation, deployment, inferencing, and the monitoring of deployed models."
] ]
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"## Prerequisites" "## Setup\n",
"\n",
"This section describes activities required before you can access any Azure ML services functionality."
] ]
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"### 1. Access Azure Subscription\n", "### 1. Azure Subscription\n",
"\n", "\n",
"In order to create an AML Workspace, first you need access to an Azure Subscription. You can [create your own](https://azure.microsoft.com/en-us/free/) or get your existing subscription information from the [Azure portal](https://portal.azure.com).\n", "In order to create an Azure ML Workspace, first you need access to an Azure subscription. An Azure subscription allows you to manage storage, compute, and other assets in the Azure cloud. You can [create a new subscription](https://azure.microsoft.com/en-us/free/) or access existing subscription information from the [Azure portal](https://portal.azure.com). Later in this notebook you will need information such as your subscription ID in order to create and access AML workspaces.\n",
"\n", "\n",
"### 2. If you're running on your own local environment, install Azure ML SDK and other libraries\n", "### 2. Azure ML SDK and other library installation\n",
"\n", "\n",
"If you are running in your own environment, follow [SDK installation instructions](https://docs.microsoft.com/azure/machine-learning/service/how-to-configure-environment). If you are running in Azure Notebooks or another Microsoft managed environment, the SDK is already installed.\n", "If you are running in your own environment, follow [SDK installation instructions](https://docs.microsoft.com/azure/machine-learning/service/how-to-configure-environment). If you are running in Azure Notebooks or another Microsoft managed environment, the SDK is already installed.\n",
"\n", "\n",
@@ -46,7 +81,7 @@
"(myenv) $ conda install -y matplotlib tqdm scikit-learn\n", "(myenv) $ conda install -y matplotlib tqdm scikit-learn\n",
"```\n", "```\n",
"\n", "\n",
"Once installation is complete, check the Azure ML SDK version:" "Once installation is complete, the following cell checks the Azure ML SDK version:"
] ]
}, },
{ {
@@ -61,15 +96,18 @@
"source": [ "source": [
"import azureml.core\n", "import azureml.core\n",
"\n", "\n",
"print(\"SDK Version:\", azureml.core.VERSION)" "print(\"This notebook was created using version AZUREML-SDK-VERSION of the Azure ML SDK\")\n",
"print(\"You are currently using version\", azureml.core.VERSION, \"of the Azure ML SDK\")"
] ]
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"### 3. Make sure your subscription is registered to use ACI\n", "If you are using an older version of the SDK then this notebook was created using, you should upgrade your SDK.\n",
"Azure Machine Learning makes use of Azure Container Instance (ACI). You need to ensure your subscription has been registered to use ACI in order be able to deploy a dev/test web service. If you have run through the quickstart experience you have already performed this step. Otherwise you will need to use the [Azure CLI](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest) and execute the following commands.\n", "\n",
"### 3. Azure Container Instance registration\n",
"Azure Machine Learning uses of [Azure Container Instance (ACI)](https://azure.microsoft.com/services/container-instances) to deploy dev/test web services. An Azure subscription needs to be registered to use ACI. If you or the subscription owner have not yet registered ACI on your subscription, you will need to use the [Azure CLI](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest) and execute the following commands. Note that if you ran through the AML [quickstart](https://docs.microsoft.com/en-us/azure/machine-learning/service/quickstart-get-started) you have already registered ACI. \n",
"\n", "\n",
"```shell\n", "```shell\n",
"# check to see if ACI is already registered\n", "# check to see if ACI is already registered\n",
@@ -78,21 +116,38 @@
"# if ACI is not registered, run this command.\n", "# if ACI is not registered, run this command.\n",
"# note you need to be the subscription owner in order to execute this command successfully.\n", "# note you need to be the subscription owner in order to execute this command successfully.\n",
"(myenv) $ az provider register -n Microsoft.ContainerInstance\n", "(myenv) $ az provider register -n Microsoft.ContainerInstance\n",
"```" "```\n",
"\n",
"---"
] ]
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"## Set up your Azure Machine Learning workspace\n", "## Configure your Azure ML workspace\n",
"\n",
"### Workspace parameters\n",
"\n",
"To use an AML Workspace, you will need to import the Azure ML SDK and supply the following information:\n",
"* Your subscription id\n",
"* A resource group name\n",
"* (optional) The region that will host your workspace\n",
"* A name for your workspace\n",
"\n",
"You can get your subscription ID from the [Azure portal](https://portal.azure.com).\n",
"\n",
"You will also need access to a [_resource group_](https://docs.microsoft.com/en-us/azure/azure-resource-manager/resource-group-overview#resource-groups), which organizes Azure resources and provides a default region for the resources in a group. You can see what resource groups to which you have access, or create a new one in the [Azure portal](https://portal.azure.com). If you don't have a resource group, the create workspace command will create one for you using the name you provide.\n",
"\n",
"The region to host your workspace will be used if you are creating a new workspace. You do not need to specify this if you are using an existing workspace. You can find the list of supported regions [here](https://azure.microsoft.com/en-us/global-infrastructure/services/?products=machine-learning-service). You should pick a region that is close to your location or that contains your data.\n",
"\n",
"The name for your workspace is unique within the subscription and should be descriptive enough to discern among other AML Workspaces. The subscription may be used only by you, or it may be used by your department or your entire enterprise, so choose a name that makes sense for your situation.\n",
"\n",
"The following cell allows you to specify your workspace parameters. This cell uses the python method `os.getenv` to read values from environment variables which is useful for automation. If no environment variable exists, the parameters will be set to the specified default values. \n",
"\n", "\n",
"### Option 1: You have workspace already\n",
"If you ran the Azure Machine Learning [quickstart](https://docs.microsoft.com/en-us/azure/machine-learning/service/quickstart-get-started) in Azure Notebooks, you already have a configured workspace! You can go to your Azure Machine Learning Getting Started library, view *config.json* file, and copy-paste the values for subscription ID, resource group and workspace name below.\n", "If you ran the Azure Machine Learning [quickstart](https://docs.microsoft.com/en-us/azure/machine-learning/service/quickstart-get-started) in Azure Notebooks, you already have a configured workspace! You can go to your Azure Machine Learning Getting Started library, view *config.json* file, and copy-paste the values for subscription ID, resource group and workspace name below.\n",
"\n", "\n",
"If you have a workspace created another way, [these instructions](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-configure-environment#create-workspace-configuration-file) describe how to get your subscription and workspace information.\n", "Replace the default values in the cell below with your workspace parameters"
"\n",
"If this cell succeeds, you're done configuring this library! Otherwise continue to follow the instructions in the rest of the notebook."
] ]
}, },
{ {
@@ -103,9 +158,19 @@
"source": [ "source": [
"import os\n", "import os\n",
"\n", "\n",
"subscription_id = os.environ.get(\"SUBSCRIPTION_ID\", \"<my-subscription-id>\")\n", "subscription_id = os.getenv(\"SUBSCRIPTION_ID\", default=\"<my-subscription-id>\")\n",
"resource_group = os.environ.get(\"RESOURCE_GROUP\", \"<my-resource-group>\")\n", "resource_group = os.getenv(\"RESOURCE_GROUP\", default=\"<my-resource-group>\")\n",
"workspace_name = os.environ.get(\"WORKSPACE_NAME\", \"<my-workspace-name>\")" "workspace_name = os.getenv(\"WORKSPACE_NAME\", default=\"<my-workspace-name>\")\n",
"workspace_region = os.getenv(\"WORKSPACE_REGION\", default=\"eastus2\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Access your workspace\n",
"\n",
"The following cell uses the Azure ML SDK to attempt to load the workspace specified by your parameters. If this cell succeeds, your notebook library will be configured to access the workspace from all notebooks using the `Workspace.from_config()` method. The cell can fail if the specified workspace doesn't exist or you don't have permissions to access it. "
] ]
}, },
{ {
@@ -117,66 +182,30 @@
"from azureml.core import Workspace\n", "from azureml.core import Workspace\n",
"\n", "\n",
"try:\n", "try:\n",
" ws = Workspace(subscription_id = subscription_id, resource_group = resource_group, workspace_name = workspace_name)\n", " ws = Workspace(subscription_id = subscription_id, resource_group = resource_group, workspace_name = workspace_name)\n",
" ws.write_config()\n", " # write the details of the workspace to a configuration file to the notebook library\n",
" print('Workspace configuration succeeded. You are all set!')\n", " ws.write_config()\n",
" print(\"Workspace configuration succeeded. Skip the workspace creation steps below\")\n",
"except:\n", "except:\n",
" print('Workspace not found. Run the cells below.')" " print(\"Workspace not accessible. Change your parameters or create a new workspace below\")"
] ]
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"### Option 2: You don't have workspace yet\n", "### Create a new workspace\n",
"\n", "\n",
"If you don't have an existing workspace and are the owner of the subscription or resource group, you can create a new workspace. If you don't have a resource group, the create workspace command will create one for you using the name you provide.\n",
"\n", "\n",
"#### Requirements\n", "**Note**: As with other Azure services, there are limits on certain resources (for example AmlCompute quota) associated with the Azure ML service. Please read [this article](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-quotas) on the default limits and how to request more quota.\n",
"\n", "\n",
"Inside your Azure subscription, you will need access to a _resource group_, which organizes Azure resources and provides a default region for the resources in a group. You can see what resource groups to which you have access, or create a new one in the [Azure portal](https://portal.azure.com). If you don't have a resource group, the create workspace command will create one for you using the name you provide.\n", "This cell will create an Azure ML workspace for you in a subscription provided you have the correct permissions.\n",
"\n", "\n",
"To create or access an Azure ML Workspace, you will need to import the AML library and the following information:\n", "This will fail if:\n",
"* A name for your workspace\n", "* You do not have permission to create a workspace in the resource group\n",
"* Your subscription id\n", "* You do not have permission to create a resource group if it's non-existing.\n",
"* The resource group name\n", "* You are not a subscription owner or contributor and no Azure ML workspaces have ever been created in this subscription\n",
"\n",
"**Note**: As with other Azure services, there are limits on certain resources (for eg. AmlCompute quota) associated with the Azure Machine Learning service. Please read [this article](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-quotas) on the default limits and how to request more quota."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Supported Azure Regions\n",
"Specify a region where your workspace will be located from the list of [Azure Machine Learning regions](https://linktoregions)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"\n",
"subscription_id = os.environ.get(\"SUBSCRIPTION_ID\", \"<my-subscription-id>\")\n",
"resource_group = os.environ.get(\"RESOURCE_GROUP\", \"my-aml-resource-group\")\n",
"workspace_name = os.environ.get(\"WORKSPACE_NAME\", \"my-first-workspace\")\n",
"\n",
"workspace_region = os.environ.get(\"WORKSPACE_REGION\", \"eastus2\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Create the workspace\n",
"This cell will create an AML workspace for you in a subscription provided you have the correct permissions.\n",
"\n",
"This will fail when:\n",
"1. You do not have permission to create a workspace in the resource group\n",
"2. You do not have permission to create a resource group if it's non-existing.\n",
"2. You are not a subscription owner or contributor and no Azure ML workspaces have ever been created in this subscription\n",
"\n", "\n",
"If workspace creation fails, please work with your IT admin to provide you with the appropriate permissions or to provision the required resources." "If workspace creation fails, please work with your IT admin to provide you with the appropriate permissions or to provision the required resources."
] ]
@@ -191,9 +220,9 @@
}, },
"outputs": [], "outputs": [],
"source": [ "source": [
"# import the Workspace class and check the azureml SDK version\n",
"from azureml.core import Workspace\n", "from azureml.core import Workspace\n",
"\n", "\n",
"# Create the workspace using the specified parameters\n",
"ws = Workspace.create(name = workspace_name,\n", "ws = Workspace.create(name = workspace_name,\n",
" subscription_id = subscription_id,\n", " subscription_id = subscription_id,\n",
" resource_group = resource_group, \n", " resource_group = resource_group, \n",
@@ -201,6 +230,8 @@
" create_resource_group = True,\n", " create_resource_group = True,\n",
" exist_ok = True)\n", " exist_ok = True)\n",
"ws.get_details()\n", "ws.get_details()\n",
"\n",
"# write the details of the workspace to a configuration file to the notebook library\n",
"ws.write_config()" "ws.write_config()"
] ]
}, },
@@ -208,9 +239,23 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"## Create compute resources for your training experiments\n", "### Create compute resources for your training experiments\n",
"\n", "\n",
"Many of the subsequent examples use Azure Machine Learning managed compute (AmlCompute) to train models at scale. To create a **CPU** cluster now, run the cell below. The autoscale settings mean that the cluster will scale down to 0 nodes when inactive and up to 4 nodes when busy." "Many of the sample notebooks use Azure ML managed compute (AmlCompute) to train models using a dynamically scalable pool of compute. In this section you will create default compute clusters for use by the other notebooks and any other operations you choose.\n",
"\n",
"To create a cluster, you need to specify a compute configuration that specifies the type of machine to be used and the scalability behaviors. Then you choose a name for the cluster that is unique within the workspace that can be used to address the cluster later.\n",
"\n",
"The cluster parameters are:\n",
"* vm_size - this describes the virtual machine type and size used in the cluster. All machines in the cluster are the same type. You can get the list of vm sizes available in your region by using the CLI command\n",
"\n",
"```shell\n",
"az vm list-skus -o tsv\n",
"```\n",
"* min_nodes - this sets the minimum size of the cluster. If you set the minimum to 0 the cluster will shut down all nodes while note in use. Setting this number to a value higher than 0 will allow for faster start-up times, but you will also be billed when the cluster is not in use.\n",
"* max_nodes - this sets the maximum size of the cluster. Setting this to a larger number allows for more concurrency and a greater distributed processing of scale-out jobs.\n",
"\n",
"\n",
"To create a **CPU** cluster now, run the cell below. The autoscale settings mean that the cluster will scale down to 0 nodes when inactive and up to 4 nodes when busy."
] ]
}, },
{ {
@@ -228,13 +273,20 @@
"# Verify that cluster does not exist already\n", "# Verify that cluster does not exist already\n",
"try:\n", "try:\n",
" cpu_cluster = ComputeTarget(workspace=ws, name=cpu_cluster_name)\n", " cpu_cluster = ComputeTarget(workspace=ws, name=cpu_cluster_name)\n",
" print('Found existing cluster, use it.')\n", " print(\"Found existing cpucluster\")\n",
"except ComputeTargetException:\n", "except ComputeTargetException:\n",
" compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',\n", " print(\"Creating new cpucluster\")\n",
" \n",
" # Specify the configuration for the new cluster\n",
" compute_config = AmlCompute.provisioning_configuration(vm_size=\"STANDARD_D2_V2\",\n",
" min_nodes=0,\n",
" max_nodes=4)\n", " max_nodes=4)\n",
" cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)\n",
"\n", "\n",
"cpu_cluster.wait_for_completion(show_output=True)" " # Create the cluster with the specified name and configuration\n",
" cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)\n",
" \n",
" # Wait for the cluster to complete, show the output log\n",
" cpu_cluster.wait_for_completion(show_output=True)"
] ]
}, },
{ {
@@ -256,24 +308,35 @@
"# Choose a name for your GPU cluster\n", "# Choose a name for your GPU cluster\n",
"gpu_cluster_name = \"gpucluster\"\n", "gpu_cluster_name = \"gpucluster\"\n",
"\n", "\n",
"# Check if cluster exists already\n", "# Verify that cluster does not exist already\n",
"try:\n", "try:\n",
" gpu_cluster = ComputeTarget(workspace=ws, name=gpu_cluster_name)\n", " gpu_cluster = ComputeTarget(workspace=ws, name=gpu_cluster_name)\n",
" print('Found existing cluster, use it.')\n", " print(\"Found existing gpu cluster\")\n",
"except ComputeTargetException:\n", "except ComputeTargetException:\n",
" compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_NC6',\n", " print(\"Creating new gpucluster\")\n",
" \n",
" # Specify the configuration for the new cluster\n",
" compute_config = AmlCompute.provisioning_configuration(vm_size=\"STANDARD_NC6\",\n",
" min_nodes=0,\n",
" max_nodes=4)\n", " max_nodes=4)\n",
" # Create the cluster with the specified name and configuration\n",
" gpu_cluster = ComputeTarget.create(ws, gpu_cluster_name, compute_config)\n", " gpu_cluster = ComputeTarget.create(ws, gpu_cluster_name, compute_config)\n",
"\n", "\n",
"gpu_cluster.wait_for_completion(show_output=True)" " # Wait for the cluster to complete, show the output log\n",
" gpu_cluster.wait_for_completion(show_output=True)"
] ]
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"## Success!\n", "---\n",
"Great, you are ready to move on to the rest of the sample notebooks. A good place to start is the [01.train-model tutorial](./tutorials/01.train-model.ipynb) to learn how to train and then deploy an image classification model." "\n",
"## Next steps\n",
"\n",
"In this notebook you configured this notebook library to connect easily to an Azure ML workspace. You can copy this notebook to your own libraries to connect them to you workspace, or use it to bootstrap new workspaces completely.\n",
"\n",
"If you came here from another notebook, you can return there and complete that exercise, or you can try out the [Tutorials](./tutorials) or jump into \"how-to\" notebooks and start creating and deploying models. A good place to start is the [train in notebook](./how-to-use-azureml/training/train-in-notebook) example that walks through a simplified but complete end to end machine learning process."
] ]
}, },
{ {
@@ -305,7 +368,7 @@
"name": "python", "name": "python",
"nbconvert_exporter": "python", "nbconvert_exporter": "python",
"pygments_lexer": "ipython3", "pygments_lexer": "ipython3",
"version": "3.6.7" "version": "3.6.5"
} }
}, },
"nbformat": 4, "nbformat": 4,

View File

@@ -284,13 +284,17 @@
"\n", "\n",
"import pandas as pd\n", "import pandas as pd\n",
"from sklearn.model_selection import train_test_split\n", "from sklearn.model_selection import train_test_split\n",
"from sklearn.preprocessing import LabelEncoder\n",
"import os\n", "import os\n",
"from os.path import expanduser, join, dirname\n", "from os.path import expanduser, join, dirname\n",
"\n", "\n",
"def get_data():\n", "def get_data():\n",
" # Burning man 2016 data\n", " # Burning man 2016 data\n",
" df = pd.read_csv(\"/tmp/azureml_runs/data/data.tsv\", delimiter=\"\\t\", quotechar='\"')\n", " df = pd.read_csv(\"/tmp/azureml_runs/data/data.tsv\", delimiter=\"\\t\", quotechar='\"')\n",
" y = df[\"Label\"].values\n", " # get integer labels\n",
" le = LabelEncoder()\n",
" le.fit(df[\"Label\"].values)\n",
" y = le.transform(df[\"Label\"].values)\n",
" X = df.drop([\"Label\"], axis=1)\n", " X = df.drop([\"Label\"], axis=1)\n",
"\n", "\n",
" X_train, _, y_train, _ = train_test_split(X, y, test_size=0.1, random_state=42)\n", " X_train, _, y_train, _ = train_test_split(X, y, test_size=0.1, random_state=42)\n",
@@ -528,19 +532,26 @@
"source": [ "source": [
"import sklearn\n", "import sklearn\n",
"from sklearn.model_selection import train_test_split\n", "from sklearn.model_selection import train_test_split\n",
"from sklearn.preprocessing import LabelEncoder\n",
"from pandas_ml import ConfusionMatrix\n", "from pandas_ml import ConfusionMatrix\n",
"\n", "\n",
"df = pd.read_csv(\"https://automldemods.blob.core.windows.net/datasets/PlayaEvents2016,_1.6MB,_3.4k-rows.cleaned.2.tsv\",\n", "df = pd.read_csv(\"https://automldemods.blob.core.windows.net/datasets/PlayaEvents2016,_1.6MB,_3.4k-rows.cleaned.2.tsv\",\n",
" delimiter=\"\\t\", quotechar='\"')\n", " delimiter=\"\\t\", quotechar='\"')\n",
"\n", "\n",
"y = df[\"Label\"].values\n", "# get integer labels\n",
"le = LabelEncoder()\n",
"le.fit(df[\"Label\"].values)\n",
"y = le.transform(df[\"Label\"].values)\n",
"X = df.drop([\"Label\"], axis=1)\n", "X = df.drop([\"Label\"], axis=1)\n",
"\n", "\n",
"_, X_test, _, y_test = train_test_split(X, y, test_size=0.1, random_state=42)\n", "_, X_test, _, y_test = train_test_split(X, y, test_size=0.1, random_state=42)\n",
"\n", "\n",
"ypred = fitted_model.predict(X_test.values)\n", "ypred = fitted_model.predict(X_test.values)\n",
"\n", "\n",
"cm = ConfusionMatrix(y_test, ypred)\n", "ypred_strings = le.inverse_transform(ypred)\n",
"ytest_strings = le.inverse_transform(y_test)\n",
"\n",
"cm = ConfusionMatrix(ytest_strings, ypred_strings)\n",
"\n", "\n",
"print(cm)\n", "print(cm)\n",
"\n", "\n",

View File

@@ -0,0 +1,568 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Copyright (c) Microsoft Corporation. All rights reserved.\n",
"\n",
"Licensed under the MIT License."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# AutoML 08b: Remote Execution with DataPrep\n",
"\n",
"This sample accesses a data file on a remote DSVM through Datastore using DataPrep. Advantages of using DataPrep are:\n",
"1. DataPrep supports reading from and writing to datastores.\n",
"2. DataPrep supports automatic file type and column type detection.\n",
"3. DataPrep makes passing data into AutoML really simple.\n",
"\n",
"More DataPrep documentation and examples can be found [here](https://github.com/Microsoft/AMLDataPrepDocs).\n",
"\n",
"Make sure you have executed the [00.configuration](00.configuration.ipynb) before running this notebook.\n",
"\n",
"In this notebook you would see\n",
"1. Storing data in DataStore.\n",
"2. Doing some basic data preparation using DataPrep and passing the prepared data (DataFlow) to AutoML for training (classficiation).\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create Experiment\n",
"\n",
"As part of the setup you have already created a <b>Workspace</b>. For AutoML you would need to create an <b>Experiment</b>. An <b>Experiment</b> is a named object in a <b>Workspace</b>, which is used to run experiments."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import logging\n",
"import os\n",
"import random\n",
"import time\n",
"\n",
"from matplotlib import pyplot as plt\n",
"from matplotlib.pyplot import imshow\n",
"import numpy as np\n",
"import pandas as pd\n",
"from sklearn import datasets\n",
"\n",
"import azureml.core\n",
"from azureml.core.compute import DsvmCompute\n",
"from azureml.core.experiment import Experiment\n",
"from azureml.core.workspace import Workspace\n",
"from azureml.train.automl import AutoMLConfig\n",
"from azureml.train.automl.run import AutoMLRun"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ws = Workspace.from_config()\n",
"\n",
"# choose a name for experiment\n",
"experiment_name = 'automl-remote-datastore-file'\n",
"# project folder\n",
"project_folder = './sample_projects/automl-remote-dsvm-file'\n",
"\n",
"experiment=Experiment(ws, experiment_name)\n",
"\n",
"output = {}\n",
"output['SDK version'] = azureml.core.VERSION\n",
"output['Subscription ID'] = ws.subscription_id\n",
"output['Workspace'] = ws.name\n",
"output['Resource Group'] = ws.resource_group\n",
"output['Location'] = ws.location\n",
"output['Project Directory'] = project_folder\n",
"output['Experiment Name'] = experiment.name\n",
"pd.set_option('display.max_colwidth', -1)\n",
"pd.DataFrame(data=output, index=['']).T"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Diagnostics\n",
"\n",
"Opt-in diagnostics for better experience, quality, and security of future releases"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.telemetry import set_diagnostics_collection\n",
"set_diagnostics_collection(send_diagnostics=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create a Remote Linux DSVM\n",
"Note: If creation fails with a message about Marketplace purchase eligibilty, go to portal.azure.com, start creating DSVM there, and select \"Want to create programmatically\" to enable programmatic creation. Once you've enabled it, you can exit without actually creating VM.\n",
"\n",
"**Note**: By default SSH runs on port 22 and you don't need to specify it. But if for security reasons you can switch to a different port (such as 5022), you can append the port number to the address. [Read more](https://render.githubusercontent.com/documentation/sdk/ssh-issue.md) on this."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"compute_target_name = 'automl-dataprep'\n",
"\n",
"try:\n",
" while ws.compute_targets[compute_target_name].provisioning_state == 'Creating':\n",
" time.sleep(1)\n",
" \n",
" dsvm_compute = DsvmCompute(workspace=ws, name=compute_target_name)\n",
" print('found existing:', dsvm_compute.name)\n",
"except:\n",
" dsvm_config = DsvmCompute.provisioning_configuration(vm_size=\"Standard_D2_v2\")\n",
" dsvm_compute = DsvmCompute.create(ws, name=compute_target_name, provisioning_configuration=dsvm_config)\n",
" dsvm_compute.wait_for_completion(show_output=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Copy data file to local\n",
"\n",
"We will download a 1MB simple random sample of the Chicago Crime data into a local temporary directory."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import tempfile\n",
"import requests\n",
"\n",
"temp_folder = tempfile.mkdtemp()\n",
"temp_tsv = os.path.join(temp_folder, 'crime0.csv')\n",
"\n",
"request = requests.get('https://dprepdata.blob.core.windows.net/demo/crime0-random.csv')\n",
"with open(temp_tsv, 'w', encoding='utf-8') as f:\n",
" f.write(request.text)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Upload data to the cloud"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's make the data available in your datastore. Datastore is a convenient construct associated with your workspace for you to reference different types of cloud storage locations (e.g. Azure Blob Containers, Azure File Shares, Azure Data Lake Stores, etc.). The benefit Datastore brings is you only need to register datastores once and you will be able to access them by name and will not need to expose secrets in your code. When you first create a workspace, a default datastore is registered for you which references the Azure Blob Container that was provisioned with the workspace. Let's upload the data we just got from the public location to the default datastore.\n",
"\n",
"The `csv` file is uploaded into a directory named `datasets` at the root of the datastore."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core import Workspace, Datastore\n",
"\n",
"ds = ws.get_default_datastore()\n",
"print(ds.datastore_type, ds.account_name, ds.container_name)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ds.upload(src_dir=temp_folder, target_path='datasets', overwrite=True, show_progress=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create Dataflow using DataPrep\n",
"Let's use DataPrep to read the `csv` file from the datastore we just uploaded to and get the data profile to make sure our data looks good. We will predict the type of the offense (`Primary Type`)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import azureml.dataprep as dprep\n",
"\n",
"dflow = dprep.read_csv(path=ds.path('datasets/crime0.csv'))\n",
"dflow.get_profile()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's also take a look at the first 5 rows of the data to give ourselves an idea of what the data looks like."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dflow.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"From the first 5 rows, we see that there are some rows that have no value in the label column (`Primary Type`). Let's remove those rows."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dflow = dflow.drop_nulls('Primary Type')\n",
"dflow.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we've removed those rows, let's split the dataflow into a features dataflow and a label dataflow."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X = dflow.drop_columns(columns=['Primary Type', 'FBI Code'])\n",
"y = dflow.keep_columns(columns=['Primary Type'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Instantiate AutoML <a class=\"anchor\" id=\"Instatiate-AutoML-Remote-DSVM\"></a>\n",
"\n",
"You can specify automl_settings as **kwargs** as well. Also note that you can use the get_data() symantic for local excutions too. \n",
"\n",
"<i>Note: For Remote DSVM and Batch AI you cannot pass Numpy arrays directly to AutoMLConfig.</i>\n",
"\n",
"|Property|Description|\n",
"|-|-|\n",
"|**primary_metric**|This is the metric that you want to optimize. Classification supports the following primary metrics: <br><i>accuracy</i><br><i>AUC_weighted</i><br><i>average_precision_score_weighted</i><br><i>norm_macro_recall</i><br><i>precision_score_weighted</i>|\n",
"|**iteration_timeout_minutes**|Time limit in minutes for each iteration|\n",
"|**iterations**|Number of iterations. In each iteration Auto ML trains a specific pipeline with the data|\n",
"|**n_cross_validations**|Number of cross validation splits|\n",
"|**max_concurrent_iterations**|Max number of iterations that would be executed in parallel. This should be less than the number of cores on the DSVM\n",
"|**preprocess**| *True/False* <br>Setting this to *True* enables Auto ML to perform preprocessing <br>on the input to handle *missing data*, and perform some common *feature extraction*|\n",
"|**enable_cache**|Setting this to *True* enables preprocess done once and reuse the same preprocessed data for all the iterations. Default value is True.|\n",
"|**max_cores_per_iteration**| Indicates how many cores on the compute target would be used to train a single pipeline.<br> Default is *1*, you can set it to *-1* to use all cores|"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.runconfig import RunConfiguration\n",
"from azureml.core.conda_dependencies import CondaDependencies\n",
"\n",
"conda_run_config = RunConfiguration(framework=\"python\")\n",
"\n",
"conda_run_config.target = dsvm_compute\n",
"\n",
"cd = CondaDependencies.create(pip_packages=['azureml-sdk[automl]==0.1.0.1918169'], conda_packages=['numpy'], pin_sdk_version=False, pip_indexurl='https://azuremlsdktestpypi.azureedge.net/sdk-release/master/588E708E0DF342C4A80BD954289657CF')\n",
"conda_run_config.environment.python.conda_dependencies = cd"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"automl_settings = {\n",
" \"iteration_timeout_minutes\": 60,\n",
" \"iterations\": 4,\n",
" \"n_cross_validations\": 5,\n",
" \"primary_metric\": 'accuracy',\n",
" \"preprocess\": True,\n",
" \"max_cores_per_iteration\": 1,\n",
" \"verbosity\": logging.INFO\n",
"}\n",
"automl_config = AutoMLConfig(task='classification',\n",
" debug_log='automl_errors.log',\n",
" path=project_folder,\n",
" run_configuration=conda_run_config,\n",
" X=X,\n",
" y=y,\n",
" **automl_settings)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Training the Models <a class=\"anchor\" id=\"Training-the-model-Remote-DSVM\"></a>\n",
"\n",
"For remote runs the execution is asynchronous, so you will see the iterations get populated as they complete. You can interact with the widgets/models even when the experiment is running to retreive the best model up to that point. Once you are satisfied with the model you can cancel a particular iteration or the whole run."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"remote_run = experiment.submit(automl_config, show_output=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exploring the Results <a class=\"anchor\" id=\"Exploring-the-Results-Remote-DSVM\"></a>\n",
"#### Widget for monitoring runs\n",
"\n",
"The widget will sit on \"loading\" until the first iteration completed, then you will see an auto-updating graph and table show up. It refreshed once per minute, so you should see the graph update as child runs complete.\n",
"\n",
"You can click on a pipeline to see run properties and output logs. Logs are also available on the DSVM under /tmp/azureml_run/{iterationid}/azureml-logs\n",
"\n",
"NOTE: The widget displays a link at the bottom. This links to a web-ui to explore the individual run details."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.widgets import RunDetails\n",
"RunDetails(remote_run).show() "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Wait until the run finishes.\n",
"remote_run.wait_for_completion(show_output = True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"remote_run"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"#### Retrieve All Child Runs\n",
"You can also use sdk methods to fetch all the child runs and see individual metrics that we log. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"children = list(remote_run.get_children())\n",
"metricslist = {}\n",
"for run in children:\n",
" properties = run.get_properties()\n",
" metrics = {k: v for k, v in run.get_metrics().items() if isinstance(v, float)} \n",
" metricslist[int(properties['iteration'])] = metrics\n",
"\n",
"rundata = pd.DataFrame(metricslist).sort_index(1)\n",
"rundata"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Canceling Runs\n",
"You can cancel ongoing remote runs using the *cancel()* and *cancel_iteration()* functions"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Cancel the ongoing experiment and stop scheduling new iterations\n",
"# remote_run.cancel()\n",
"\n",
"# Cancel iteration 1 and move onto iteration 2\n",
"# remote_run.cancel_iteration(1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Pre-process cache cleanup\n",
"The preprocess data gets cache at user default file store. When the run is completed the cache can be cleaned by running below cell"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"remote_run.clean_preprocessor_cache()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Retrieve the Best Model\n",
"\n",
"Below we select the best pipeline from our iterations. The *get_output* method returns the best run and the fitted model. There are overloads on *get_output* that allow you to retrieve the best run and fitted model for *any* logged metric or a particular *iteration*."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"best_run, fitted_model = remote_run.get_output()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Best Model based on any other metric"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# lookup_metric = \"accuracy\"\n",
"# best_run, fitted_model = remote_run.get_output(metric=lookup_metric)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Model from a specific iteration"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# iteration = 1\n",
"# best_run, fitted_model = remote_run.get_output(iteration=iteration)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Testing the Best Fitted Model <a class=\"anchor\" id=\"Testing-the-Fitted-Model-Remote-DSVM\"></a>\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dflow = dprep.read_csv(path='https://dprepdata.blob.core.windows.net/demo/crime0-test.csv')\n",
"dflow.head(5)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from pandas_ml import ConfusionMatrix\n",
"\n",
"y_test = dflow.keep_columns(columns=['Primary Type']).to_pandas_dataframe()\n",
"X_test = dflow.drop_columns(columns=['Primary Type', 'FBI Code']).to_pandas_dataframe()\n",
"\n",
"ypred = fitted_model.predict(X_test.values)\n",
"\n",
"cm = ConfusionMatrix(y_test['Primary Type'], ypred)\n",
"\n",
"print(cm)\n",
"\n",
"cm.plot()"
]
}
],
"metadata": {
"authors": [
{
"name": "savitam"
}
],
"kernelspec": {
"display_name": "Python [conda env:cli_dev]",
"language": "python",
"name": "conda-env-cli_dev-py"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

View File

@@ -1,289 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Copyright (c) Microsoft Corporation. All rights reserved.\n",
"\n",
"Licensed under the MIT License."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 03. Train on Azure Container Instance\n",
"\n",
"* Create Workspace\n",
"* Create `train.py` in the project folder.\n",
"* Configure an ACI (Azure Container Instance) run\n",
"* Execute in ACI"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Prerequisites\n",
"Make sure you go through the [00. Installation and Configuration](00.configuration.ipynb) Notebook first if you haven't."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Check core SDK version number\n",
"import azureml.core\n",
"\n",
"print(\"SDK version:\", azureml.core.VERSION)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Initialize Workspace\n",
"\n",
"Initialize a workspace object from persisted configuration"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"create workspace"
]
},
"outputs": [],
"source": [
"from azureml.core import Workspace\n",
"\n",
"ws = Workspace.from_config()\n",
"print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\\n')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create An Experiment\n",
"\n",
"**Experiment** is a logical container in an Azure ML Workspace. It hosts run records which can include run metrics and output artifacts from your experiments."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core import Experiment\n",
"experiment_name = 'train-on-aci'\n",
"experiment = Experiment(workspace = ws, name = experiment_name)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Remote execution on ACI\n",
"\n",
"The training script `train.py` is already created for you. Let's have a look."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"with open('./train.py', 'r') as f:\n",
" print(f.read())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Configure for using ACI\n",
"Linux-based ACI is available in `West US`, `East US`, `West Europe`, `North Europe`, `West US 2`, `Southeast Asia`, `Australia East`, `East US 2`, and `Central US` regions. See details [here](https://docs.microsoft.com/en-us/azure/container-instances/container-instances-quotas#region-availability)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"configure run"
]
},
"outputs": [],
"source": [
"from azureml.core.runconfig import RunConfiguration\n",
"from azureml.core.conda_dependencies import CondaDependencies\n",
"\n",
"# create a new runconfig object\n",
"run_config = RunConfiguration()\n",
"\n",
"# signal that you want to use ACI to execute script.\n",
"run_config.target = \"containerinstance\"\n",
"\n",
"# ACI container group is only supported in certain regions, which can be different than the region the Workspace is in.\n",
"run_config.container_instance.region = 'eastus2'\n",
"\n",
"# set the ACI CPU and Memory \n",
"run_config.container_instance.cpu_cores = 1\n",
"run_config.container_instance.memory_gb = 2\n",
"\n",
"# enable Docker \n",
"run_config.environment.docker.enabled = True\n",
"\n",
"# set Docker base image to the default CPU-based image\n",
"run_config.environment.docker.base_image = azureml.core.runconfig.DEFAULT_CPU_IMAGE\n",
"\n",
"# use conda_dependencies.yml to create a conda environment in the Docker image for execution\n",
"run_config.environment.python.user_managed_dependencies = False\n",
"\n",
"# auto-prepare the Docker image when used for execution (if it is not already prepared)\n",
"run_config.auto_prepare_environment = True\n",
"\n",
"# specify CondaDependencies obj\n",
"run_config.environment.python.conda_dependencies = CondaDependencies.create(conda_packages=['scikit-learn'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Submit the Experiment\n",
"Finally, run the training job on the ACI"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"remote run",
"aci"
]
},
"outputs": [],
"source": [
"%%time \n",
"from azureml.core.script_run_config import ScriptRunConfig\n",
"\n",
"script_run_config = ScriptRunConfig(source_directory='./',\n",
" script='train.py',\n",
" run_config=run_config)\n",
"\n",
"run = experiment.submit(script_run_config)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"query history"
]
},
"outputs": [],
"source": [
"# Show run details\n",
"run"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"remote run",
"aci"
]
},
"outputs": [],
"source": [
"%%time\n",
"# Shows output of the run on stdout.\n",
"run.wait_for_completion(show_output=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"get metrics"
]
},
"outputs": [],
"source": [
"# get all metris logged in the run\n",
"run.get_metrics()\n",
"metrics = run.get_metrics()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"print('When alpha is {1:0.2f}, we have min MSE {0:0.2f}.'.format(\n",
" min(metrics['mse']), \n",
" metrics['alpha'][np.argmin(metrics['mse'])]\n",
"))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# show all the files stored within the run record\n",
"run.get_file_names()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now you can take a model produced here, register it and then deploy as a web service."
]
}
],
"metadata": {
"authors": [
{
"name": "roastala"
}
],
"kernelspec": {
"display_name": "Python 3.6",
"language": "python",
"name": "python36"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.6"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

View File

@@ -1,44 +0,0 @@
# Copyright (c) Microsoft. All rights reserved.
# Licensed under the MIT license.
from sklearn.datasets import load_diabetes
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from azureml.core.run import Run
from sklearn.externals import joblib
import os
import numpy as np
os.makedirs('./outputs', exist_ok=True)
X, y = load_diabetes(return_X_y=True)
run = Run.get_context()
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2,
random_state=0)
data = {"train": {"X": X_train, "y": y_train},
"test": {"X": X_test, "y": y_test}}
# list of numbers from 0.0 to 1.0 with a 0.05 interval
alphas = np.arange(0.0, 1.0, 0.05)
for alpha in alphas:
# Use Ridge algorithm to create a regression model
reg = Ridge(alpha=alpha)
reg.fit(data["train"]["X"], data["train"]["y"])
preds = reg.predict(data["test"]["X"])
mse = mean_squared_error(preds, data["test"]["y"])
run.log('alpha', alpha)
run.log('mse', mse)
model_file_name = 'ridge_{0:.2f}.pkl'.format(alpha)
# save model in the outputs folder so it automatically get uploaded
with open(model_file_name, "wb") as file:
joblib.dump(value=reg, filename=os.path.join('./outputs/',
model_file_name))
print('alpha is {0:.2f}, and mse is {1:0.2f}'.format(alpha, mse))

View File

@@ -1,427 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Copyright (c) Microsoft Corporation. All rights reserved.\n",
"\n",
"Licensed under the MIT License."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Tutorial: Train a classification model with automated machine learning\n",
"\n",
"In this tutorial, you'll learn how to generate a machine learning model using automated machine learning (automated ML). Azure Machine Learning can perform algorithm selection and hyperparameter selection in an automated way for you. The final model can then be deployed following the workflow in the [Deploy a model](02.deploy-models.ipynb) tutorial.\n",
"\n",
"[flow diagram](./imgs/flow2.png)\n",
"\n",
"Similar to the [train models tutorial](01.train-models.ipynb), this tutorial classifies handwritten images of digits (0-9) from the [MNIST](http://yann.lecun.com/exdb/mnist/) dataset. But this time you don't to specify an algorithm or tune hyperparameters. The automated ML technique iterates over many combinations of algorithms and hyperparameters until it finds the best model based on your criterion.\n",
"\n",
"You'll learn how to:\n",
"\n",
"> * Set up your development environment\n",
"> * Access and examine the data\n",
"> * Train using an automated classifier locally with custom parameters\n",
"> * Explore the results\n",
"> * Review training results\n",
"> * Register the best model\n",
"\n",
"## Prerequisites\n",
"\n",
"Use [these instructions](https://aka.ms/aml-how-to-configure-environment) to: \n",
"* Create a workspace and its configuration file (**config.json**) \n",
"* Upload your **config.json** to the same folder as this notebook"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Start a notebook\n",
"\n",
"To follow along, start a new notebook from the same directory as **config.json** and copy the code from the sections below.\n",
"\n",
"\n",
"## Set up your development environment\n",
"\n",
"All the setup for your development work can be accomplished in the Python notebook. Setup includes:\n",
"\n",
"* Import Python packages\n",
"* Configure a workspace to enable communication between your local computer and remote resources\n",
"* Create a directory to store training scripts\n",
"\n",
"### Import packages\n",
"Import Python packages you need in this tutorial."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import azureml.core\n",
"import pandas as pd\n",
"from azureml.core.workspace import Workspace\n",
"from azureml.train.automl.run import AutoMLRun\n",
"import time\n",
"import logging\n",
"from sklearn import datasets\n",
"from matplotlib import pyplot as plt\n",
"from matplotlib.pyplot import imshow\n",
"import random\n",
"import numpy as np"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Configure workspace\n",
"\n",
"Create a workspace object from the existing workspace. `Workspace.from_config()` reads the file **aml_config/config.json** and loads the details into an object named `ws`. `ws` is used throughout the rest of the code in this tutorial.\n",
"\n",
"Once you have a workspace object, specify a name for the experiment and create and register a local directory with the workspace. The history of all runs is recorded under the specified experiment."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ws = Workspace.from_config()\n",
"# choose a name for the run history container in the workspace\n",
"experiment_name = 'automl-classifier'\n",
"# project folder\n",
"project_folder = './automl-classifier'\n",
"\n",
"import os\n",
"\n",
"output = {}\n",
"output['SDK version'] = azureml.core.VERSION\n",
"output['Subscription ID'] = ws.subscription_id\n",
"output['Workspace'] = ws.name\n",
"output['Resource Group'] = ws.resource_group\n",
"output['Location'] = ws.location\n",
"output['Project Directory'] = project_folder\n",
"pd.set_option('display.max_colwidth', -1)\n",
"pd.DataFrame(data=output, index=['']).T"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Explore data\n",
"\n",
"The initial training tutorial used a high-resolution version of the MNIST dataset (28x28 pixels). Since auto training requires many iterations, this tutorial uses a smaller resolution version of the images (8x8 pixels) to demonstrate the concepts while speeding up the time needed for each iteration."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn import datasets\n",
"\n",
"digits = datasets.load_digits()\n",
"\n",
"# Exclude the first 100 rows from training so that they can be used for test.\n",
"X_train = digits.data[100:,:]\n",
"y_train = digits.target[100:]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Display some sample images\n",
"\n",
"Load the data into `numpy` arrays. Then use `matplotlib` to plot 30 random images from the dataset with their labels above them."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"count = 0\n",
"sample_size = 30\n",
"plt.figure(figsize = (16, 6))\n",
"for i in np.random.permutation(X_train.shape[0])[:sample_size]:\n",
" count = count + 1\n",
" plt.subplot(1, sample_size, count)\n",
" plt.axhline('')\n",
" plt.axvline('')\n",
" plt.text(x = 2, y = -2, s = y_train[i], fontsize = 18)\n",
" plt.imshow(X_train[i].reshape(8, 8), cmap = plt.cm.Greys)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You now have the necessary packages and data ready for auto training for your model. \n",
"\n",
"## Auto train a model \n",
"\n",
"To auto train a model, first define settings for autogeneration and tuning and then run the automatic classifier.\n",
"\n",
"\n",
"### Define settings for autogeneration and tuning\n",
"\n",
"Define the experiment parameters and models settings for autogeneration and tuning. \n",
"\n",
"\n",
"|Property| Value in this tutorial |Description|\n",
"|----|----|---|\n",
"|**primary_metric**|AUC Weighted | Metric that you want to optimize.|\n",
"|**max_time_sec**|12,000|Time limit in seconds for each iteration|\n",
"|**iterations**|20|Number of iterations. In each iteration, the model trains with the data with a specific pipeline|\n",
"|**n_cross_validations**|3|Number of cross validation splits|\n",
"|**exit_score**|0.9985|*double* value indicating the target for *primary_metric*. Once the target is surpassed the run terminates|\n",
"|**blacklist_algos**|['kNN','LinearSVM']|*Array* of *strings* indicating algorithms to ignore.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"configure automl"
]
},
"outputs": [],
"source": [
"from azureml.train.automl import AutoMLConfig\n",
"\n",
"##Local compute \n",
"Automl_config = AutoMLConfig(task = 'classification',\n",
" primary_metric = 'AUC_weighted',\n",
" max_time_sec = 12000,\n",
" iterations = 20,\n",
" n_cross_validations = 3,\n",
" exit_score = 0.9985,\n",
" blacklist_algos = ['kNN','LinearSVM'],\n",
" X = X_train,\n",
" y = y_train,\n",
" path=project_folder)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Run the automatic classifier\n",
"\n",
"Start the experiment to run locally. Define the compute target as local and set the output to true to view progress on the experiment."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"local submitted run",
"automl"
]
},
"outputs": [],
"source": [
"from azureml.core.experiment import Experiment\n",
"experiment=Experiment(ws, experiment_name)\n",
"local_run = experiment.submit(Automl_config, show_output=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Explore the results\n",
"\n",
"Explore the results of automatic training with a Jupyter widget or by examining the experiment history.\n",
"\n",
"### Jupyter widget\n",
"\n",
"Use the Jupyter notebook widget to see a graph and a table of all results."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"use notebook widget"
]
},
"outputs": [],
"source": [
"from azureml.widgets import RunDetails\n",
"RunDetails(local_run).show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Retrieve all iterations\n",
"\n",
"View the experiment history and see individual metrics for each iteration run."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"get metrics",
"query history"
]
},
"outputs": [],
"source": [
"children = list(local_run.get_children())\n",
"metricslist = {}\n",
"for run in children:\n",
" properties = run.get_properties()\n",
" metrics = {k: v for k, v in run.get_metrics().items() if isinstance(v, float)}\n",
" metricslist[int(properties['iteration'])] = metrics\n",
"\n",
"import pandas as pd\n",
"rundata = pd.DataFrame(metricslist).sort_index(1)\n",
"rundata"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Register the best model \n",
"\n",
"Use the `local_run` object to get the best model and register it into the workspace. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"query history",
"register model from history"
]
},
"outputs": [],
"source": [
"# find the run with the highest accuracy value.\n",
"best_run, fitted_model = local_run.get_output()\n",
"\n",
"# register model in workspace\n",
"description = 'Automated Machine Learning Model'\n",
"tags = None\n",
"local_run.register_model(description=description, tags=tags)\n",
"local_run.model_id # Use this id to deploy the model as a web service in Azure"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Test the best model\n",
"\n",
"Use the model to predict a few random digits. Display the predicted and the image. Red font and inverse image (white on black) is used to highlight the misclassified samples.\n",
"\n",
"Since the model accuracy is high, you might have to run the following code a few times before you can see a misclassified sample."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# find 30 random samples from test set\n",
"n = 30\n",
"X_test = digits.data[:100, :]\n",
"y_test = digits.target[:100]\n",
"sample_indices = np.random.permutation(X_test.shape[0])[0:n]\n",
"test_samples = X_test[sample_indices]\n",
"\n",
"\n",
"# predict using the model\n",
"result = fitted_model.predict(test_samples)\n",
"\n",
"# compare actual value vs. the predicted values:\n",
"i = 0\n",
"plt.figure(figsize = (20, 1))\n",
"\n",
"for s in sample_indices:\n",
" plt.subplot(1, n, i + 1)\n",
" plt.axhline('')\n",
" plt.axvline('')\n",
" \n",
" # use different color for misclassified sample\n",
" font_color = 'red' if y_test[s] != result[i] else 'black'\n",
" clr_map = plt.cm.gray if y_test[s] != result[i] else plt.cm.Greys\n",
" \n",
" plt.text(x = 2, y = -2, s = result[i], fontsize = 18, color = font_color)\n",
" plt.imshow(X_test[s].reshape(8, 8), cmap = clr_map)\n",
" \n",
" i = i + 1\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Next steps\n",
"\n",
"In this Azure Machine Learning tutorial, you used Python to:\n",
"\n",
"> * Set up your development environment\n",
"> * Access and examine the data\n",
"> * Train using an automated classifier locally with custom parameters\n",
"> * Explore the results\n",
"> * Review training results\n",
"> * Register the best model\n",
"\n",
"Learn more about [how to configure settings for automatic training](https://aka.ms/aml-how-to-configure-auto) or [how to use automatic training on a remote resource](https://aka.ms/aml-how-to-auto-remote)."
]
}
],
"metadata": {
"authors": [
{
"name": "jeffshep"
}
],
"kernelspec": {
"display_name": "Python 3.6",
"language": "python",
"name": "python36"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.6"
},
"msauthor": "sgilley"
},
"nbformat": 4,
"nbformat_minor": 2
}

View File

@@ -1,561 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Copyright (c) Microsoft Corporation. All rights reserved.\n",
"\n",
"Licensed under the MIT License."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Tutorial: Use Azure DataPrep SDK to prepare data for machine learning"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Prepare data for use as a training data set in a machine learning model with the Azure DataPrep SDK. Perform various transformations to filter and combine two different NYC Taxi data sets. Learn some of the unique features of the DataPrep SDK: \n",
"\n",
"* Transform data from derived examples \n",
"* Infer field type from data \n",
"\n",
"This tutorial is part one of a two-part tutorial series."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this tutorial you:\n",
"* Load two datasets with different field names \n",
"* Cleanse the data \n",
"* Use smart transforms to predict your logic based on an example\n",
"* Use automated feature engineering to build dynamic fields \n",
"* Merge the two datasets to use for your machine learning training \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Import packages\n",
"Begin by importing the Azure DataPrep SDK."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import azureml.dataprep as dprep"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load data\n",
"Download two different NYC Taxi data sets into dataflow objects. These datasets contain slightly different fields. The method `auto_read_file()` automatically recognizes the input file type."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dataset_root = \"https://dprepdata.blob.core.windows.net/demo\"\n",
"\n",
"green_path = \"/\".join([dataset_root, \"green-small/*\"])\n",
"yellow_path = \"/\".join([dataset_root, \"yellow-small/*\"])\n",
"\n",
"green_df = dprep.read_csv(path=green_path, header=dprep.PromoteHeadersMode.GROUPED)\n",
"yellow_df = dprep.auto_read_file(path=yellow_path)\n",
"\n",
"display(green_df.head(5))\n",
"display(yellow_df.head(5))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data transformation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now you populate some variables with shortcut transforms that will apply to all dataflows. The variable `drop_if_all_null` will be used to delete records where all fields are null. The variable `useful_columns` holds an array of column descriptions that are retained in each dataflow."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"all_columns = dprep.ColumnSelector(term=\".*\", use_regex=True)\n",
"drop_if_all_null = [all_columns, dprep.ColumnRelationship(dprep.ColumnRelationship.ALL)]\n",
"useful_columns = [\n",
" \"cost\", \"distance\"\"distance\", \"dropoff_datetime\", \"dropoff_latitude\", \"dropoff_longitude\",\n",
" \"passengers\", \"pickup_datetime\", \"pickup_latitude\", \"pickup_longitude\", \"store_forward\", \"vendor\"\n",
"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You first work with the green taxi data and get it into a valid shape that can be combined with the yellow taxi data. Create a temporary dataflow `tmp_df`, and call the `replace_na()`, `drop_nulls()`, and `keep_columns()` functions using the shortcut transform variables you created. Additionally, rename all the columns in the dataframe to match the names in `useful_columns`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tmp_df = (green_df\n",
" .replace_na(columns=all_columns)\n",
" .drop_nulls(*drop_if_all_null)\n",
" .rename_columns(column_pairs={\n",
" \"VendorID\": \"vendor\",\n",
" \"lpep_pickup_datetime\": \"pickup_datetime\",\n",
" \"Lpep_dropoff_datetime\": \"dropoff_datetime\",\n",
" \"lpep_dropoff_datetime\": \"dropoff_datetime\",\n",
" \"Store_and_fwd_flag\": \"store_forward\",\n",
" \"store_and_fwd_flag\": \"store_forward\",\n",
" \"Pickup_longitude\": \"pickup_longitude\",\n",
" \"Pickup_latitude\": \"pickup_latitude\",\n",
" \"Dropoff_longitude\": \"dropoff_longitude\",\n",
" \"Dropoff_latitude\": \"dropoff_latitude\",\n",
" \"Passenger_count\": \"passengers\",\n",
" \"Fare_amount\": \"cost\",\n",
" \"Trip_distance\": \"distance\"\n",
" })\n",
" .keep_columns(columns=useful_columns))\n",
"tmp_df.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Overwrite the `green_df` variable with the transforms performed on `tmp_df` in the previous step."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"green_df = tmp_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Perform the same transformation steps to the yellow taxi data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tmp_df = (yellow_df\n",
" .replace_na(columns=all_columns)\n",
" .drop_nulls(*drop_if_all_null)\n",
" .rename_columns(column_pairs={\n",
" \"vendor_name\": \"vendor\",\n",
" \"VendorID\": \"vendor\",\n",
" \"vendor_id\": \"vendor\",\n",
" \"Trip_Pickup_DateTime\": \"pickup_datetime\",\n",
" \"tpep_pickup_datetime\": \"pickup_datetime\",\n",
" \"Trip_Dropoff_DateTime\": \"dropoff_datetime\",\n",
" \"tpep_dropoff_datetime\": \"dropoff_datetime\",\n",
" \"store_and_forward\": \"store_forward\",\n",
" \"store_and_fwd_flag\": \"store_forward\",\n",
" \"Start_Lon\": \"pickup_longitude\",\n",
" \"Start_Lat\": \"pickup_latitude\",\n",
" \"End_Lon\": \"dropoff_longitude\",\n",
" \"End_Lat\": \"dropoff_latitude\",\n",
" \"Passenger_Count\": \"passengers\",\n",
" \"passenger_count\": \"passengers\",\n",
" \"Fare_Amt\": \"cost\",\n",
" \"fare_amount\": \"cost\",\n",
" \"Trip_Distance\": \"distance\",\n",
" \"trip_distance\": \"distance\"\n",
" })\n",
" .keep_columns(columns=useful_columns))\n",
"tmp_df.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Again, overwrite `yellow_df` with `tmp_df`, and then call the `append_rows()` function on the green taxi data to append the yellow taxi data, creating a new combined dataframe."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"yellow_df = tmp_df\n",
"combined_df = green_df.append_rows(other_activities=[yellow_df])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Convert types and filter "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Examine the pickup and drop-off coordinates summary statistics to see how the data is distributed. First define a `TypeConverter` object to change the lat/long fields to decimal type. Next, call the `keep_columns()` function to restrict output to only the lat/long fields, and then call `get_profile()`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"decimal_type = dprep.TypeConverter(data_type=dprep.FieldType.DECIMAL)\n",
"combined_df = combined_df.set_column_types(type_conversions={\n",
" \"pickup_longitude\": decimal_type,\n",
" \"pickup_latitude\": decimal_type,\n",
" \"dropoff_longitude\": decimal_type,\n",
" \"dropoff_latitude\": decimal_type\n",
"})\n",
"combined_df.keep_columns(columns=[\n",
" \"pickup_longitude\", \"pickup_latitude\", \n",
" \"dropoff_longitude\", \"dropoff_latitude\"\n",
"]).get_profile()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"From the summary statistics output, you see that there are coordinates that are missing, and coordinates that are not in New York City. Filter out coordinates not in the city border by chaining column filter commands within the `filter()` function, and defining minimum and maximum bounds for each field. Then call `get_profile()` again to verify the transformation."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tmp_df = (combined_df\n",
" .drop_nulls(\n",
" columns=[\"pickup_longitude\", \"pickup_latitude\", \"dropoff_longitude\", \"dropoff_latitude\"],\n",
" column_relationship=dprep.ColumnRelationship(dprep.ColumnRelationship.ANY)\n",
" ) \n",
" .filter(dprep.f_and(\n",
" dprep.col(\"pickup_longitude\") <= -73.72,\n",
" dprep.col(\"pickup_longitude\") >= -74.09,\n",
" dprep.col(\"pickup_latitude\") <= 40.88,\n",
" dprep.col(\"pickup_latitude\") >= 40.53,\n",
" dprep.col(\"dropoff_longitude\") <= -73.72,\n",
" dprep.col(\"dropoff_longitude\") >= -74.09,\n",
" dprep.col(\"dropoff_latitude\") <= 40.88,\n",
" dprep.col(\"dropoff_latitude\") >= 40.53\n",
" )))\n",
"tmp_df.keep_columns(columns=[\n",
" \"pickup_longitude\", \"pickup_latitude\", \n",
" \"dropoff_longitude\", \"dropoff_latitude\"\n",
"]).get_profile()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Overwrite `combined_df` with the transformations you made to `tmp_df`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"combined_df = tmp_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Split and rename columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Look at the data profile for the `store_forward` column."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"combined_df.keep_columns(columns='store_forward').get_profile()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"From the data profile output of `store_forward`, you see that the data is inconsistent and there are missing/null values. Replace these values using the `replace()` and `fill_nulls()` functions, and in both cases change to the string \"N\"."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"combined_df = combined_df.replace(columns=\"store_forward\", find=\"0\", replace_with=\"N\").fill_nulls(\"store_forward\", \"N\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Split the pick up and drop off datetimes into respective date and time columns. Use `split_column_by_example()` to perform the split. In this case, the optional `example` parameter of `split_column_by_example()` is omitted. Therefore the function will automatically determine where to split based on the data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tmp_df = (combined_df\n",
" .split_column_by_example(source_column=\"pickup_datetime\")\n",
" .split_column_by_example(source_column=\"dropoff_datetime\"))\n",
"tmp_df.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Rename the columns generated by `split_column_by_example()` into meaningful names."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tmp_df_renamed = (tmp_df\n",
" .rename_columns(column_pairs={\n",
" \"pickup_datetime_1\": \"pickup_date\",\n",
" \"pickup_datetime_2\": \"pickup_time\",\n",
" \"dropoff_datetime_1\": \"dropoff_date\",\n",
" \"dropoff_datetime_2\": \"dropoff_time\"\n",
" }))\n",
"tmp_df_renamed.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Overwrite `combined_df` with the executed transformations, and then call `get_profile()` to see full summary statistics after all transformations."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"combined_df = tmp_df_renamed\n",
"combined_df.get_profile()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Feature engineering"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Split the pickup and drop-off date further into day of week, day of month, and month. To get day of week, use the `derive_column_by_example()` function. This function takes as a parameter an array of example objects that define the input data, and the desired output. The function then automatically determines your desired transformation. For pickup and drop-off time columns, split into hour, minute, and second using the `split_column_by_example()` function with no example parameter.\n",
"\n",
"Once you have generated these new features, delete the original fields in favor of the newly generated features using `drop_columns()`. Rename all remaining fields to accurate descriptions."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tmp_df = (combined_df\n",
" .derive_column_by_example(\n",
" source_columns=\"pickup_date\", \n",
" new_column_name=\"pickup_weekday\", \n",
" example_data=[(\"2009-01-04\", \"Sunday\"), (\"2013-08-22\", \"Thursday\")]\n",
" )\n",
" .derive_column_by_example(\n",
" source_columns=\"dropoff_date\",\n",
" new_column_name=\"dropoff_weekday\",\n",
" example_data=[(\"2013-08-22\", \"Thursday\"), (\"2013-11-03\", \"Sunday\")]\n",
" )\n",
" \n",
" .split_column_by_example(source_column=\"pickup_time\")\n",
" .split_column_by_example(source_column=\"dropoff_time\")\n",
" # the following two split_column_by_example calls reference the generated column names from the above two calls\n",
" .split_column_by_example(source_column=\"pickup_time_1\")\n",
" .split_column_by_example(source_column=\"dropoff_time_1\")\n",
" .drop_columns(columns=[\n",
" \"pickup_date\", \"pickup_time\", \"dropoff_date\", \"dropoff_time\", \n",
" \"pickup_date_1\", \"dropoff_date_1\", \"pickup_time_1\", \"dropoff_time_1\"\n",
" ])\n",
" \n",
" .rename_columns(column_pairs={\n",
" \"pickup_date_2\": \"pickup_month\",\n",
" \"pickup_date_3\": \"pickup_monthday\",\n",
" \"pickup_time_1_1\": \"pickup_hour\",\n",
" \"pickup_time_1_2\": \"pickup_minute\",\n",
" \"pickup_time_2\": \"pickup_second\",\n",
" \"dropoff_date_2\": \"dropoff_month\",\n",
" \"dropoff_date_3\": \"dropoff_monthday\",\n",
" \"dropoff_time_1_1\": \"dropoff_hour\",\n",
" \"dropoff_time_1_2\": \"dropoff_minute\",\n",
" \"dropoff_time_2\": \"dropoff_second\"\n",
" }))\n",
"\n",
"tmp_df.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"From the data above, you see that the pickup and drop-off date and time components produced from the derived transformations are correct. Drop the `pickup_datetime` and `dropoff_datetime` columns as they are no longer needed."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tmp_df = tmp_df.drop_columns(columns=[\"pickup_datetime\", \"dropoff_datetime\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use the type inference functionality to automatically check the data type of each field, and display the inference results using `inference_info()`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"type_infer = tmp_df.builders.set_column_types()\n",
"type_infer.learn()\n",
"type_infer.inference_info"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The inference results look correct based on the data, now apply the type conversions to the dataflow."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tmp_df = type_infer.to_dataflow()\n",
"tmp_df.get_profile()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"At this point, you have a fully transformed and prepared dataflow object to use in a machine learning model. The DataPrep SDK includes object serialization functionality, which is used as follows."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dflow_prepared = tmp_df\n",
"package = dprep.Package([dflow_prepared])\n",
"package.save(\".\\dflow\")"
]
}
],
"metadata": {
"authors": [
{
"name": "cforbe"
}
],
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.7"
},
"msauthor": "trbye"
},
"nbformat": 4,
"nbformat_minor": 2
}