{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Copyright (c) Microsoft Corporation. All rights reserved.\n",
"\n",
"Licensed under the MIT License."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Automated Machine Learning - AutoFeaturization (Part 1)\n",
"_**Autofeaturization of credit card fraudulent transactions dataset on remote compute**_\n",
"\n",
"## Contents\n",
"1. [Introduction](#Introduction)\n",
"1. [Setup](#Setup)\n",
"1. [Data](#Data)\n",
"1. [Autofeaturization](#Autofeaturization)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## Introduction"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Autofeaturization is a new feature to let you as the user run an AutoML experiment to only featurize the datasets. These datasets along with the transformer will be stored in the experiment which can later be retrieved and used to train models, either via AutoML or custom training. \n",
"\n",
"**To run Autofeaturization, pass in zero iterations and featurization as auto. This will featurize the datasets and terminate the experiment. Training will not occur.**\n",
"\n",
"*Limitations - Sparse data cannot be supported at the moment. Any dataset that has extensive categorical data might be featurized into sparse data which will not be allowed as input to AutoML. Efforts are underway to support sparse data and will be updated soon.* \n",
"\n",
"In this example we use the credit card fraudulent transactions dataset to showcase how you can use AutoML for autofeaturization. The goal is to clean and featurize the training dataset.\n",
"\n",
"This notebook is using remote compute to complete the featurization.\n",
"\n",
"If you are using an Azure Machine Learning Compute Instance, you are all set. Otherwise, go through the [configuration](../../configuration.ipynb) notebook first if you haven't already, to establish your connection to the AzureML Workspace. \n",
"\n",
"In the below steps, you will learn how to:\n",
"1. Create an autofeaturization experiment using an existing workspace.\n",
"2. View the featurized datasets and transformer"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## Setup\n",
"\n",
"As part of the setup you have already created an Azure ML `Workspace` object. For Automated ML you will need to create an `Experiment` object, which is a named object in a `Workspace` used to run experiments."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import logging\n",
"import pandas as pd\n",
"import azureml.core\n",
"from azureml.core.experiment import Experiment\n",
"from azureml.core.workspace import Workspace\n",
"from azureml.core.dataset import Dataset\n",
"from azureml.train.automl import AutoMLConfig"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This sample notebook may use features that are not available in previous versions of the Azure ML SDK."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"This notebook was created using version 1.59.0 of the Azure ML SDK\")\n",
"print(\"You are currently using version\", azureml.core.VERSION, \"of the Azure ML SDK\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ws = Workspace.from_config()\n",
"\n",
"# choose a name for experiment\n",
"experiment_name = 'automl-autofeaturization-ccard-remote'\n",
"\n",
"experiment=Experiment(ws, experiment_name)\n",
"\n",
"output = {}\n",
"output['Subscription ID'] = ws.subscription_id\n",
"output['Workspace'] = ws.name\n",
"output['Resource Group'] = ws.resource_group\n",
"output['Location'] = ws.location\n",
"output['Experiment Name'] = experiment.name\n",
"outputDf = pd.DataFrame(data = output, index = [''])\n",
"outputDf.T"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create or Attach existing AmlCompute\n",
"A compute target is required to execute the Automated ML run. In this tutorial, you create AmlCompute as your training compute resource.\n",
"\n",
"> Note that if you have an AzureML Data Scientist role, you will not have permission to create compute resources. Talk to your workspace or IT admin to create the compute targets described in this section, if they do not already exist.\n",
"\n",
"#### Creation of AmlCompute takes approximately 5 minutes. \n",
"If the AmlCompute with that name is already in your workspace this code will skip the creation process.\n",
"As with other Azure services, there are limits on certain resources (e.g. AmlCompute) associated with the Azure Machine Learning service. Please read [this article](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-quotas) on the default limits and how to request more quota."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.compute import ComputeTarget, AmlCompute\n",
"from azureml.core.compute_target import ComputeTargetException\n",
"\n",
"# Choose a name for your CPU cluster\n",
"cpu_cluster_name = \"cpu-cluster\"\n",
"\n",
"# Verify that cluster does not exist already\n",
"try:\n",
" compute_target = ComputeTarget(workspace=ws, name=cpu_cluster_name)\n",
" print('Found existing cluster, use it.')\n",
"except ComputeTargetException:\n",
" compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_DS12_V2',\n",
" max_nodes=6)\n",
" compute_target = ComputeTarget.create(ws, cpu_cluster_name, compute_config)\n",
"\n",
"compute_target.wait_for_completion(show_output=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## Data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Load Data\n",
"\n",
"Load the credit card fraudulent transactions dataset from a CSV file, containing both training features and labels. The features are inputs to the model, while the training labels represent the expected output of the model. \n",
"\n",
"Here the autofeaturization run will featurize the training data passed in."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Training Dataset"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"training_data = \"https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/creditcard_train.csv\"\n",
"training_dataset = Dataset.Tabular.from_delimited_files(training_data) # Tabular dataset\n",
"\n",
"label_column_name = 'Class' # output label"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## AutoFeaturization\n",
"\n",
"Instantiate an AutoMLConfig object. This defines the settings and data used to run the autofeaturization experiment.\n",
"\n",
"|Property|Description|\n",
"|-|-|\n",
"|**task**|classification or regression|\n",
"|**training_data**|Input training dataset, containing both features and label column.|\n",
"|**iterations**|For an autofeaturization run, iterations will be 0.|\n",
"|**featurization**|For an autofeaturization run, featurization will be 'auto'.|\n",
"|**label_column_name**|The name of the label column.|\n",
"\n",
"**_You can find more information about primary metrics_** [here](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-configure-auto-train#primary-metric)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"automl_config = AutoMLConfig(task = 'classification',\n",
" debug_log = 'automl_errors.log',\n",
" iterations = 0, # autofeaturization run can be triggered by setting iterations to 0\n",
" compute_target = compute_target,\n",
" training_data = training_dataset,\n",
" label_column_name = label_column_name,\n",
" featurization = 'auto',\n",
" verbosity = logging.INFO\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Call the `submit` method on the experiment object and pass the run configuration. Depending on the data this can run for a while. Validation errors and current status will be shown when setting `show_output=True` and the execution will be synchronous."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"remote_run = experiment.submit(automl_config, show_output = False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Transformer and Featurized Datasets\n",
"The given datasets have been featurized and stored under `Outputs + logs` from the details page of the remote run. The structure is shown below. The featurized dataset is stored under `/outputs/featurization/data` and the transformer is saved under `/outputs/featurization/pipeline` \n",
"\n",
"Below you will learn how to refer to the data saved in your run and retrieve the same."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Results"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Widget for Monitoring Runs\n",
"\n",
"The widget will first report a \"loading\" status while running the first iteration. After completing the first iteration, an auto-updating graph and table will be shown. The widget will refresh once per minute, so you should see the graph update as child runs complete.\n",
"\n",
"**Note:** The widget displays a link at the bottom. Use this link to open a web interface to explore the individual run details"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.widgets import RunDetails\n",
"RunDetails(remote_run).show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"remote_run.wait_for_completion(show_output=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Automated Machine Learning - AutoFeaturization (Part 2)\n",
"_**Training using a custom model with the featurized data from Autofeaturization run of credit card fraudulent transactions dataset**_\n",
"\n",
"## Contents\n",
"1. [Introduction](#Introduction)\n",
"1. [Data Setup](#DataSetup)\n",
"1. [Autofeaturization Data](#AutofeaturizationData)\n",
"1. [Train](#Train)\n",
"1. [Results](#Results)\n",
"1. [Test](#Test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## Introduction\n",
"\n",
"Here we use the featurized dataset saved in the above run to showcase how you can perform custom training by using the transformer from an autofeaturization run to transform validation / test datasets. \n",
"\n",
"The goal is to use autofeaturized run data and transformer to transform and run a custom training experiment independently\n",
"\n",
"In the below steps, you will learn how to:\n",
"1. Read transformer from a completed autofeaturization run and transform data\n",
"2. Pull featurized data from a completed autofeaturization run\n",
"3. Run a custom training experiment with the above data\n",
"4. Check results"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## Data Setup"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will load the featurized training data and also load the transformer from the above autofeaturized run. This transformer can then be used to transform the test data to check the accuracy of the custom model after training."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Load Test Data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"load test dataset from CSV and split into X and y columns to featurize with the transformer going forward."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"test_data = \"https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/creditcard_test.csv\"\n",
"\n",
"test_dataset = pd.read_csv(test_data)\n",
"label_column_name = 'Class'\n",
"\n",
"X_test_data = test_dataset[test_dataset.columns.difference([label_column_name])]\n",
"y_test_data = test_dataset[label_column_name].values\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Load data_transformer from the above remote run artifact"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### (Method 1)\n",
"\n",
"Method 1 allows you to read the transformer from the remote storage."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import mlflow\n",
"mlflow.set_tracking_uri(ws.get_mlflow_tracking_uri())\n",
"\n",
"# Set uri to fetch data transformer from remote parent run.\n",
"artifact_path = \"/outputs/featurization/pipeline/\"\n",
"uri = \"runs:/\" + remote_run.id + artifact_path\n",
"\n",
"print(uri)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### (Method 2)\n",
"\n",
"Method 2 downloads the transformer to the local directory and then can be used to transform the data. Uncomment to use."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"''' import pathlib\n",
"\n",
"# Download the transformer to the local directory\n",
"transformers_file_path = \"/outputs/featurization/pipeline/\"\n",
"local_path = \"./transformer\"\n",
"remote_run.download_files(prefix=transformers_file_path, output_directory=local_path, batch_size=500)\n",
"\n",
"path = pathlib.Path(\"transformer\") \n",
"path = str(path.absolute()) + transformers_file_path\n",
"str_uri = \"file:///\" + path\n",
"\n",
"print(str_uri) '''"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Transform Data"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"**Note:** Not all datasets produce a y_transformer. The dataset used in the current notebook requires a transformer as the y column data is categorical. \n",
"\n",
"We will go ahead and download the mlflow transformer model and use it to transform test data that can be used for further experimentation below. To run the commented code, make sure the environment requirement is satisfied. You can go ahead and create the environment from the `conda.yaml` file under `/outputs/featurization/pipeline/` and run the given code in it."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"''' from azureml.automl.core.shared.constants import Transformers\n",
"\n",
"transformers = mlflow.sklearn.load_model(uri) # Using method 1\n",
"data_transformers = transformers.get_transformers()\n",
"x_transformer = data_transformers[Transformers.X_TRANSFORMER]\n",
"y_transformer = data_transformers[Transformers.Y_TRANSFORMER]\n",
"\n",
"X_test = x_transformer.transform(X_test_data)\n",
"y_test = y_transformer.transform(y_test_data) '''"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Run the following cell to see the featurization summary of X and y transformers. Uncomment to use. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"''' X_data_summary = x_transformer.get_featurization_summary(is_user_friendly=False)\n",
"\n",
"summary_df = pd.DataFrame.from_records(X_data_summary)\n",
"summary_df '''"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Load Datastore\n",
"\n",
"The below data store holds the featurized datasets, hence we load and access the data. Check the path and file names according to the saved structure in your experiment `Outputs + logs` as seen in Autofeaturization Part 1"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.datastore import Datastore\n",
"\n",
"ds = Datastore.get(ws, \"workspaceartifactstore\")\n",
"experiment_loc = \"ExperimentRun/dcid.\" + remote_run.id\n",
"\n",
"remote_data_path = \"/outputs/featurization/data/\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## Autofeaturization Data\n",
"\n",
"We will load the training data from the previously completed Autofeaturization experiment. The resulting featurized dataframe can be passed into the custom model for training. Here we are saving the file to local from the experiment storage and reading the data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"train_data_file_path = \"full_training_dataset.df.parquet\"\n",
"local_data_path = \"./data/\" + train_data_file_path\n",
"\n",
"remote_run.download_file(remote_data_path + train_data_file_path, local_data_path)\n",
"\n",
"full_training_data = pd.read_parquet(local_data_path)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Another way to load the data is to go to the above autofeaturization experiment and check for the featurized dataset ids under `Output datasets`. Uncomment and replace them accordingly below, to use."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# train_data = Dataset.get_by_id(ws, 'cb4418ee-bac4-45ac-b055-600653bdf83a') # replace the featurized full_training_dataset id\n",
"# full_training_data = train_data.to_pandas_dataframe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Training Data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We are dropping the y column and weights column from the featurized training dataset."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"Y_COLUMN = \"automl_y\"\n",
"SW_COLUMN = \"automl_weights\"\n",
"\n",
"X_train = full_training_data[full_training_data.columns.difference([Y_COLUMN, SW_COLUMN])]\n",
"y_train = full_training_data[Y_COLUMN].values\n",
"sample_weight = full_training_data[SW_COLUMN].values"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## Train"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we are passing our training data to the lightgbm classifier, any custom model can be used with your data. Let us first install lightgbm."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"! pip install lightgbm"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import lightgbm as lgb\n",
"\n",
"model = lgb.LGBMClassifier(learning_rate=0.08,max_depth=-5,random_state=42)\n",
"model.fit(X_train, y_train, sample_weight=sample_weight)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Once training is done, the test data obtained after transforming from the above downloaded transformer can be used to calculate the accuracy "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print('Training accuracy {:.4f}'.format(model.score(X_train, y_train)))\n",
"\n",
"# Uncomment below to test the model on test data \n",
"# print('Testing accuracy {:.4f}'.format(model.score(X_test, y_test)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## Analyze results\n",
"\n",
"### Retrieve the Model"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## Test the fitted model\n",
"\n",
"Now that the model is trained, split the data in the same way the data was split for training (The difference here is the data is being split locally) and then run the test data through the trained model to get the predicted values."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Uncomment below to test the model on test data\n",
"# y_pred = model.predict(X_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Experiment Complete!"
]
}
],
"metadata": {
"authors": [
{
"name": "bhavanatumma"
}
],
"interpreter": {
"hash": "adb464b67752e4577e3dc163235ced27038d19b7d88def00d75d1975bde5d9ab"
},
"kernelspec": {
"display_name": "Python 3.8 - AzureML",
"language": "python",
"name": "python38-azureml"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
}
},
"nbformat": 4,
"nbformat_minor": 2
}