{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Copyright (c) Microsoft Corporation. All rights reserved.\n", "\n", "Licensed under the MIT License." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/work-with-data/datasets-tutorial/train-with-datasets/train-with-datasets.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Train with Azure Machine Learning datasets\n", "Datasets are categorized into TabularDataset and FileDataset based on how users consume them in training. \n", "* A TabularDataset represents data in a tabular format by parsing the provided file or list of files. TabularDataset can be created from csv, tsv, parquet files, SQL query results etc. For the complete list, please visit our [documentation](https://aka.ms/tabulardataset-api-reference). It provides you with the ability to materialize the data into a pandas DataFrame.\n", "* A FileDataset references single or multiple files in your datastores or public urls. This provides you with the ability to download or mount the files to your compute. The files can be of any format, which enables a wider range of machine learning scenarios including deep learning.\n", "\n", "In this tutorial, you will learn how to train with Azure Machine Learning datasets:\n", "\n", "☑ Use datasets directly in your training script\n", "\n", "☑ Use datasets to mount files to a remote compute" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prerequisites\n", "If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, go through the [configuration notebook](../../../configuration.ipynb) first if you haven't already established your connection to the AzureML Workspace." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Check core SDK version number\n", "import azureml.core\n", "\n", "print('SDK version:', azureml.core.VERSION)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Initialize Workspace\n", "\n", "Initialize a workspace object from persisted configuration." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from azureml.core import Workspace\n", "\n", "ws = Workspace.from_config()\n", "print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep='\\n')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create Experiment\n", "\n", "**Experiment** is a logical container in an Azure ML Workspace. It hosts run records which can include run metrics and output artifacts from your experiments." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "experiment_name = 'train-with-datasets'\n", "\n", "from azureml.core import Experiment\n", "exp = Experiment(workspace=ws, name=experiment_name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create or Attach existing compute resource\n", "By using Azure Machine Learning Compute, a managed service, data scientists can train machine learning models on clusters of Azure virtual machines. Examples include VMs with GPU support. In this tutorial, you create Azure Machine Learning Compute as your training environment. The code below creates the compute clusters for you if they don't already exist in your workspace.\n", "\n", "**Creation of compute takes approximately 5 minutes.** If the AmlCompute with that name is already in your workspace the code will skip the creation process." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from azureml.core.compute import AmlCompute\n", "from azureml.core.compute import ComputeTarget\n", "import os\n", "\n", "# choose a name for your cluster\n", "compute_name = os.environ.get('AML_COMPUTE_CLUSTER_NAME', 'cpu-cluster')\n", "compute_min_nodes = os.environ.get('AML_COMPUTE_CLUSTER_MIN_NODES', 0)\n", "compute_max_nodes = os.environ.get('AML_COMPUTE_CLUSTER_MAX_NODES', 4)\n", "\n", "# This example uses CPU VM. For using GPU VM, set SKU to STANDARD_NC6\n", "vm_size = os.environ.get('AML_COMPUTE_CLUSTER_SKU', 'STANDARD_D2_V2')\n", "\n", "\n", "if compute_name in ws.compute_targets:\n", " compute_target = ws.compute_targets[compute_name]\n", " if compute_target and type(compute_target) is AmlCompute:\n", " print('found compute target. just use it. ' + compute_name)\n", "else:\n", " print('creating a new compute target...')\n", " provisioning_config = AmlCompute.provisioning_configuration(vm_size=vm_size,\n", " min_nodes=compute_min_nodes, \n", " max_nodes=compute_max_nodes)\n", "\n", " # create the cluster\n", " compute_target = ComputeTarget.create(ws, compute_name, provisioning_config)\n", " \n", " # can poll for a minimum number of nodes and for a specific timeout. \n", " # if no min node count is provided it will use the scale settings for the cluster\n", " compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)\n", " \n", " # For a more detailed view of current AmlCompute status, use get_status()\n", " print(compute_target.get_status().serialize())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You now have the necessary packages and compute resources to train a model in the cloud.\n", "## Use datasets directly in training\n", "\n", "### Create a TabularDataset\n", "By creating a dataset, you create a reference to the data source location. If you applied any subsetting transformations to the dataset, they will be stored in the dataset as well. The data remains in its existing location, so no extra storage cost is incurred. \n", "\n", "Every workspace comes with a default [datastore](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-access-data) (and you can register more) which is backed by the Azure blob storage account associated with the workspace. We can use it to transfer data from local to the cloud, and create dataset from it. We will now upload the [Iris data](./train-dataset/Iris.csv) to the default datastore (blob) within your workspace." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "datastore = ws.get_default_datastore()\n", "datastore.upload_files(files = ['./train-dataset/iris.csv'],\n", " target_path = 'train-dataset/tabular/',\n", " overwrite = True,\n", " show_progress = True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then we will create an unregistered TabularDataset pointing to the path in the datastore. You can also create a dataset from multiple paths. [learn more](https://aka.ms/azureml/howto/createdatasets) \n", "\n", "[TabularDataset](https://docs.microsoft.com/python/api/azureml-core/azureml.data.tabulardataset?view=azure-ml-py) represents data in a tabular format by parsing the provided file or list of files. This provides you with the ability to materialize the data into a Pandas or Spark DataFrame. You can create a TabularDataset object from .csv, .tsv, and parquet files, and from SQL query results. For a complete list, see [TabularDatasetFactory](https://docs.microsoft.com/python/api/azureml-core/azureml.data.dataset_factory.tabulardatasetfactory?view=azure-ml-py) class." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "dataset-remarks-tabular-sample" ] }, "outputs": [], "source": [ "from azureml.core import Dataset\n", "dataset = Dataset.Tabular.from_delimited_files(path = [(datastore, 'train-dataset/tabular/iris.csv')])\n", "\n", "# preview the first 3 rows of the dataset\n", "dataset.take(3).to_pandas_dataframe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create a training script\n", "\n", "To submit the job to the cluster, first create a training script. Run the following code to create the training script called `train_titanic.py` in the script_folder. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "script_folder = os.path.join(os.getcwd(), 'train-dataset')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%writefile $script_folder/train_iris.py\n", "\n", "import os\n", "\n", "from azureml.core import Dataset, Run\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.tree import DecisionTreeClassifier\n", "from sklearn.externals import joblib\n", "\n", "run = Run.get_context()\n", "# get input dataset by name\n", "dataset = run.input_datasets['iris']\n", "\n", "df = dataset.to_pandas_dataframe()\n", "\n", "x_col = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']\n", "y_col = ['species']\n", "x_df = df.loc[:, x_col]\n", "y_df = df.loc[:, y_col]\n", "\n", "#dividing X,y into train and test data\n", "x_train, x_test, y_train, y_test = train_test_split(x_df, y_df, test_size=0.2, random_state=223)\n", "\n", "data = {'train': {'X': x_train, 'y': y_train},\n", "\n", " 'test': {'X': x_test, 'y': y_test}}\n", "\n", "clf = DecisionTreeClassifier().fit(data['train']['X'], data['train']['y'])\n", "model_file_name = 'decision_tree.pkl'\n", "\n", "print('Accuracy of Decision Tree classifier on training set: {:.2f}'.format(clf.score(x_train, y_train)))\n", "print('Accuracy of Decision Tree classifier on test set: {:.2f}'.format(clf.score(x_test, y_test)))\n", "\n", "os.makedirs('./outputs', exist_ok=True)\n", "with open(model_file_name, 'wb') as file:\n", " joblib.dump(value=clf, filename='outputs/' + model_file_name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Configure and use datasets as the input to Estimator" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can ask the system to build a conda environment based on your dependency specification. Once the environment is built, and if you don't change your dependencies, it will be reused in subsequent runs." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from azureml.core import Environment\n", "from azureml.core.conda_dependencies import CondaDependencies\n", "\n", "conda_env = Environment('conda-env')\n", "conda_env.python.conda_dependencies = CondaDependencies.create(pip_packages=['azureml-sdk',\n", " 'azureml-dataprep[pandas,fuse]',\n", " 'scikit-learn'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An estimator object is used to submit the run. Azure Machine Learning has pre-configured estimators for common machine learning frameworks, as well as generic Estimator. Create a generic estimator for by specifying\n", "\n", "* The name of the estimator object, `est`\n", "* The directory that contains your scripts. All the files in this directory are uploaded into the cluster nodes for execution. \n", "* The training script name, train_titanic.py\n", "* The input dataset for training\n", "* The compute target. In this case you will use the AmlCompute you created\n", "* The environment definition for the experiment" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from azureml.train.estimator import Estimator\n", "\n", "est = Estimator(source_directory=script_folder, \n", " entry_script='train_iris.py', \n", " # pass dataset object as an input with name 'titanic'\n", " inputs=[dataset.as_named_input('iris')],\n", " compute_target=compute_target,\n", " environment_definition= conda_env) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Submit job to run\n", "Submit the estimator to the Azure ML experiment to kick off the execution." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "run = exp.submit(est)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from azureml.widgets import RunDetails\n", "\n", "# monitor the run\n", "RunDetails(run).show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Use datasets to mount files to a remote compute\n", "\n", "You can use the `Dataset` object to mount or download files referred by it. When you mount a file system, you attach that file system to a directory (mount point) and make it available to the system. Because mounting load files at the time of processing, it is usually faster than download.
\n", "Note: mounting is only available for Linux-based compute (DSVM/VM, AMLCompute, HDInsights)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Upload data files into datastore\n", "We will first load diabetes data from `scikit-learn` to the train-dataset folder." ] }, { "cell_type": "code", "execution_count": null, "outputs": [], "source": [ "from sklearn.datasets import load_diabetes\n", "import numpy as np\n", "\n", "training_data = load_diabetes()\n", "np.save(file='train-dataset/features.npy', arr=training_data['data'])\n", "np.save(file='train-dataset/labels.npy', arr=training_data['target'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's upload the 2 files into the default datastore under a path named `diabetes`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "datastore.upload_files(['train-dataset/features.npy', 'train-dataset/labels.npy'], target_path='diabetes', overwrite=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create a FileDataset\n", "\n", "[FileDataset](https://docs.microsoft.com/python/api/azureml-core/azureml.data.file_dataset.filedataset?view=azure-ml-py) references single or multiple files in your datastores or public URLs. Using this method, you can download or mount the files to your compute as a FileDataset object. The files can be in any format, which enables a wider range of machine learning scenarios, including deep learning." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from azureml.core import Dataset\n", "\n", "dataset = Dataset.File.from_files(path = [(datastore, 'diabetes/')])\n", "\n", "# see a list of files referenced by dataset\n", "dataset.to_path()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create a training script\n", "\n", "To submit the job to the cluster, first create a training script. Run the following code to create the training script called `train_diabetes.py` in the script_folder. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%writefile $script_folder/train_diabetes.py\n", "\n", "import os\n", "import glob\n", "\n", "from sklearn.linear_model import Ridge\n", "from sklearn.metrics import mean_squared_error\n", "from sklearn.model_selection import train_test_split\n", "from azureml.core.run import Run\n", "from sklearn.externals import joblib\n", "\n", "import numpy as np\n", "\n", "os.makedirs('./outputs', exist_ok=True)\n", "\n", "run = Run.get_context()\n", "base_path = run.input_datasets['diabetes']\n", "\n", "X = np.load(glob.glob(os.path.join(base_path, '**/features.npy'), recursive=True)[0])\n", "y = np.load(glob.glob(os.path.join(base_path, '**/labels.npy'), recursive=True)[0])\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " X, y, test_size=0.2, random_state=0)\n", "data = {'train': {'X': X_train, 'y': y_train},\n", " 'test': {'X': X_test, 'y': y_test}}\n", "\n", "# list of numbers from 0.0 to 1.0 with a 0.05 interval\n", "alphas = np.arange(0.0, 1.0, 0.05)\n", "\n", "for alpha in alphas:\n", " # use Ridge algorithm to create a regression model\n", " reg = Ridge(alpha=alpha)\n", " reg.fit(data['train']['X'], data['train']['y'])\n", "\n", " preds = reg.predict(data['test']['X'])\n", " mse = mean_squared_error(preds, data['test']['y'])\n", " run.log('alpha', alpha)\n", " run.log('mse', mse)\n", "\n", " model_file_name = 'ridge_{0:.2f}.pkl'.format(alpha)\n", " with open(model_file_name, 'wb') as file:\n", " joblib.dump(value=reg, filename='outputs/' + model_file_name)\n", "\n", " print('alpha is {0:.2f}, and mse is {1:0.2f}'.format(alpha, mse))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Configure & Run" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from azureml.core import ScriptRunConfig\n", "\n", "src = ScriptRunConfig(source_directory=script_folder, \n", " script='train_diabetes.py', \n", " # to mount the dataset on the remote compute and pass the mounted path as an argument to the training script\n", " arguments =[dataset.as_named_input('diabetes').as_mount()])\n", "\n", "src.run_config.framework = 'python'\n", "src.run_config.environment = conda_env\n", "src.run_config.target = compute_target.name" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "run = exp.submit(config=src)\n", "\n", "# monitor the run\n", "RunDetails(run).show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Display run results\n", "You now have a model trained on a remote cluster. Retrieve all the metrics logged during the run, including the accuracy of the model:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(run.get_metrics())\n", "metrics = run.get_metrics()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Register datasets\n", "Use the register() method to register datasets to your workspace so they can be shared with others, reused across various experiments, and referred to by name in your training script." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dataset = dataset.register(workspace = ws,\n", " name = 'diabetes dataset',\n", " description='training dataset',\n", " create_new_version=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Register models with datasets\n", "The last step in the training script wrote the model files in a directory named `outputs` in the VM of the cluster where the job is executed. `outputs` is a special directory in that all content in this directory is automatically uploaded to your workspace. This content appears in the run record in the experiment under your workspace. Hence, the model file is now also available in your workspace.\n", "\n", "You can register models with datasets for reproducibility and auditing purpose." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# find the index where MSE is the smallest\n", "indices = list(range(0, len(metrics['mse'])))\n", "min_mse_index = min(indices, key=lambda x: metrics['mse'][x])\n", "\n", "print('When alpha is {1:0.2f}, we have min MSE {0:0.2f}.'.format(\n", " metrics['mse'][min_mse_index], \n", " metrics['alpha'][min_mse_index]\n", "))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# find the best model\n", "best_alpha = metrics['alpha'][min_mse_index]\n", "model_file_name = 'ridge_{0:.2f}.pkl'.format(best_alpha)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# register the best model with the input dataset\n", "model = run.register_model(model_name='sklearn_diabetes', model_path=os.path.join('outputs', model_file_name),\n", " datasets =[('training data',dataset)])" ] } ], "metadata": { "authors": [ { "name": "sihhu" } ], "category": "tutorial", "compute": [ "Remote" ], "datasets": [ "Iris", "Diabetes" ], "deployment": [ "None" ], "exclude_from_index": false, "framework": [ "Azure ML" ], "friendly_name": "Train with Datasets (Tabular and File)", "index_order": 1, "kernelspec": { "display_name": "Python 3.6", "language": "python", "name": "python36" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" }, "star_tag": [ "featured" ], "tags": [ "Dataset", "Estimator", "ScriptRun" ], "task": "Train" }, "nbformat": 4, "nbformat_minor": 2 }