Files
MachineLearningNotebooks/tutorials/image-classification-mnist-data/img-classification-part1-training.ipynb

695 lines
28 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Copyright (c) Microsoft Corporation. All rights reserved.\n",
"\n",
"Licensed under the MIT License."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Tutorial #1: Train an image classification model with Azure Machine Learning\n",
"\n",
"In this tutorial, you train a machine learning model on remote compute resources. You'll use the training and deployment workflow for Azure Machine Learning service (preview) in a Python Jupyter notebook. You can then use the notebook as a template to train your own machine learning model with your own data. This tutorial is **part one of a two-part tutorial series**. \n",
"\n",
"This tutorial trains a simple logistic regression using the [MNIST](https://docs.microsoft.com/azure/open-datasets/dataset-mnist) dataset and [scikit-learn](http://scikit-learn.org) with Azure Machine Learning. MNIST is a popular dataset consisting of 70,000 grayscale images. Each image is a handwritten digit of 28x28 pixels, representing a number from 0 to 9. The goal is to create a multi-class classifier to identify the digit a given image represents. \n",
"\n",
"Learn how to:\n",
"\n",
"> * Set up your development environment\n",
"> * Access and examine the data\n",
"> * Train a simple logistic regression model on a remote cluster\n",
"> * Review training results, find and register the best model\n",
"\n",
"You'll learn how to select a model and deploy it in [part two of this tutorial](deploy-models.ipynb) later. \n",
"\n",
"## Prerequisites\n",
"\n",
"See prerequisites in the [Azure Machine Learning documentation](https://docs.microsoft.com/azure/machine-learning/service/tutorial-train-models-with-aml#prerequisites).\n",
"\n",
"On the computer running this notebook, conda install matplotlib, numpy, scikit-learn=1.5.1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Set up your development environment\n",
"\n",
"All the setup for your development work can be accomplished in a Python notebook. Setup includes:\n",
"\n",
"* Importing Python packages\n",
"* Connecting to a workspace to enable communication between your local computer and remote resources\n",
"* Creating an experiment to track all your runs\n",
"* Creating a remote compute target to use for training\n",
"\n",
"### Import packages\n",
"\n",
"Import Python packages you need in this session. Also display the Azure Machine Learning SDK version."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"check version"
]
},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"\n",
"import azureml.core\n",
"from azureml.core import Workspace\n",
"\n",
"# check core SDK version number\n",
"print(\"Azure ML SDK Version: \", azureml.core.VERSION)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Connect to workspace\n",
"\n",
"Create a workspace object from the existing workspace. `Workspace.from_config()` reads the file **config.json** and loads the details into an object named `ws`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"load workspace"
]
},
"outputs": [],
"source": [
"# load workspace configuration from the config.json file in the current folder.\n",
"ws = Workspace.from_config()\n",
"print(ws.name, ws.location, ws.resource_group, sep='\\t')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create experiment\n",
"\n",
"Create an experiment to track the runs in your workspace. A workspace can have muliple experiments. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"create experiment"
]
},
"outputs": [],
"source": [
"experiment_name = 'Tutorial-sklearn-mnist'\n",
"\n",
"from azureml.core import Experiment\n",
"exp = Experiment(workspace=ws, name=experiment_name)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create or Attach existing compute resource\n",
"\n",
"> Note that if you have an AzureML Data Scientist role, you will not have permission to create compute resources. Talk to your workspace or IT admin to create the compute targets described in this section, if they do not already exist.\n",
"\n",
"By using Azure Machine Learning Compute, a managed service, data scientists can train machine learning models on clusters of Azure virtual machines. Examples include VMs with GPU support. In this tutorial, you create Azure Machine Learning Compute as your training environment. You will submit Python code to run on this VM later in the tutorial. \n",
"The code below creates the compute clusters for you if they don't already exist in your workspace.\n",
"\n",
"**Creation of compute takes approximately 5 minutes.** If the AmlCompute with that name is already in your workspace the code will skip the creation process."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"create mlc",
"amlcompute"
]
},
"outputs": [],
"source": [
"from azureml.core.compute import AmlCompute\n",
"from azureml.core.compute import ComputeTarget\n",
"import os\n",
"\n",
"# choose a name for your cluster\n",
"compute_name = os.environ.get(\"AML_COMPUTE_CLUSTER_NAME\", \"cpu-cluster\")\n",
"compute_min_nodes = os.environ.get(\"AML_COMPUTE_CLUSTER_MIN_NODES\", 0)\n",
"compute_max_nodes = os.environ.get(\"AML_COMPUTE_CLUSTER_MAX_NODES\", 4)\n",
"\n",
"# This example uses CPU VM. For using GPU VM, set SKU to Standard_NC6s_v3\n",
"vm_size = os.environ.get(\"AML_COMPUTE_CLUSTER_SKU\", \"STANDARD_D2_V2\")\n",
"\n",
"\n",
"if compute_name in ws.compute_targets:\n",
" compute_target = ws.compute_targets[compute_name]\n",
" if compute_target and type(compute_target) is AmlCompute:\n",
" print(\"found compute target: \" + compute_name)\n",
"else:\n",
" print(\"creating new compute target...\")\n",
" provisioning_config = AmlCompute.provisioning_configuration(vm_size = vm_size,\n",
" min_nodes = compute_min_nodes, \n",
" max_nodes = compute_max_nodes)\n",
"\n",
" # create the cluster\n",
" compute_target = ComputeTarget.create(ws, compute_name, provisioning_config)\n",
" \n",
" # can poll for a minimum number of nodes and for a specific timeout. \n",
" # if no min node count is provided it will use the scale settings for the cluster\n",
" compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)\n",
" \n",
" # For a more detailed view of current AmlCompute status, use get_status()\n",
" print(compute_target.get_status().serialize())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You now have the necessary packages and compute resources to train a model in the cloud. \n",
"\n",
"## Explore data\n",
"\n",
"Before you train a model, you need to understand the data that you are using to train it. In this section you learn how to:\n",
"\n",
"* Download the MNIST dataset\n",
"* Display some sample images\n",
"\n",
"### Download the MNIST dataset\n",
"\n",
"Use Azure Open Datasets to get the raw MNIST data files. [Azure Open Datasets](https://docs.microsoft.com/azure/open-datasets/overview-what-are-open-datasets) are curated public datasets that you can use to add scenario-specific features to machine learning solutions for more accurate models. Each dataset has a corrseponding class, `MNIST` in this case, to retrieve the data in different ways.\n",
"\n",
"This code retrieves the data as a `FileDataset` object, which is a subclass of `Dataset`. A `FileDataset` references single or multiple files of any format in your datastores or public urls. The class provides you with the ability to download or mount the files to your compute by creating a reference to the data source location. Additionally, you register the Dataset to your workspace for easy retrieval during training.\n",
"\n",
"Follow the [how-to](https://aka.ms/azureml/howto/createdatasets) to learn more about Datasets and their usage in the SDK."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core import Dataset\n",
"from azureml.opendatasets import MNIST\n",
"\n",
"data_folder = os.path.join(os.getcwd(), 'data')\n",
"os.makedirs(data_folder, exist_ok=True)\n",
"\n",
"mnist_file_dataset = MNIST.get_file_dataset()\n",
"mnist_file_dataset.download(data_folder, overwrite=True)\n",
"\n",
"mnist_file_dataset = mnist_file_dataset.register(workspace=ws,\n",
" name='mnist_opendataset',\n",
" description='training and test dataset',\n",
" create_new_version=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Display some sample images\n",
"\n",
"Load the compressed files into `numpy` arrays. Then use `matplotlib` to plot 30 random images from the dataset with their labels above them. Note this step requires a `load_data` function that's included in an `utils.py` file. This file is included in the sample folder. Please make sure it is placed in the same folder as this notebook. The `load_data` function simply parses the compresse files into numpy arrays."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# make sure utils.py is in the same directory as this code\n",
"from utils import load_data\n",
"import glob\n",
"\n",
"\n",
"# note we also shrink the intensity values (X) from 0-255 to 0-1. This helps the model converge faster.\n",
"X_train = load_data(glob.glob(os.path.join(data_folder,\"**/train-images-idx3-ubyte.gz\"), recursive=True)[0], False) / 255.0\n",
"X_test = load_data(glob.glob(os.path.join(data_folder,\"**/t10k-images-idx3-ubyte.gz\"), recursive=True)[0], False) / 255.0\n",
"y_train = load_data(glob.glob(os.path.join(data_folder,\"**/train-labels-idx1-ubyte.gz\"), recursive=True)[0], True).reshape(-1)\n",
"y_test = load_data(glob.glob(os.path.join(data_folder,\"**/t10k-labels-idx1-ubyte.gz\"), recursive=True)[0], True).reshape(-1)\n",
"\n",
"\n",
"# now let's show some randomly chosen images from the traininng set.\n",
"count = 0\n",
"sample_size = 30\n",
"plt.figure(figsize = (16, 6))\n",
"for i in np.random.permutation(X_train.shape[0])[:sample_size]:\n",
" count = count + 1\n",
" plt.subplot(1, sample_size, count)\n",
" plt.axhline('')\n",
" plt.axvline('')\n",
" plt.text(x=10, y=-10, s=y_train[i], fontsize=18)\n",
" plt.imshow(X_train[i].reshape(28, 28), cmap=plt.cm.Greys)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Train on a remote cluster\n",
"\n",
"For this task, you submit the job to run on the remote training cluster you set up earlier. To submit a job you:\n",
"* Create a directory\n",
"* Create a training script\n",
"* Create a script run configuration\n",
"* Submit the job \n",
"\n",
"### Create a directory\n",
"\n",
"Create a directory to deliver the necessary code from your computer to the remote resource."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"script_folder = os.path.join(os.getcwd(), \"sklearn-mnist\")\n",
"os.makedirs(script_folder, exist_ok=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create a training script\n",
"\n",
"To submit the job to the cluster, first create a training script. Run the following code to create the training script called `train.py` in the directory you just created. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%writefile $script_folder/train.py\n",
"\n",
"import argparse\n",
"import os\n",
"import numpy as np\n",
"import glob\n",
"\n",
"from sklearn.linear_model import LogisticRegression\n",
"import joblib\n",
"\n",
"from azureml.core import Run\n",
"from utils import load_data\n",
"\n",
"# let user feed in 2 parameters, the dataset to mount or download, and the regularization rate of the logistic regression model\n",
"parser = argparse.ArgumentParser()\n",
"parser.add_argument('--data-folder', type=str, dest='data_folder', help='data folder mounting point')\n",
"parser.add_argument('--regularization', type=float, dest='reg', default=0.01, help='regularization rate')\n",
"args = parser.parse_args()\n",
"\n",
"data_folder = args.data_folder\n",
"print('Data folder:', data_folder)\n",
"\n",
"# load train and test set into numpy arrays\n",
"# note we scale the pixel intensity values to 0-1 (by dividing it with 255.0) so the model can converge faster.\n",
"X_train = load_data(glob.glob(os.path.join(data_folder, '**/train-images-idx3-ubyte.gz'), recursive=True)[0], False) / 255.0\n",
"X_test = load_data(glob.glob(os.path.join(data_folder, '**/t10k-images-idx3-ubyte.gz'), recursive=True)[0], False) / 255.0\n",
"y_train = load_data(glob.glob(os.path.join(data_folder, '**/train-labels-idx1-ubyte.gz'), recursive=True)[0], True).reshape(-1)\n",
"y_test = load_data(glob.glob(os.path.join(data_folder, '**/t10k-labels-idx1-ubyte.gz'), recursive=True)[0], True).reshape(-1)\n",
"\n",
"print(X_train.shape, y_train.shape, X_test.shape, y_test.shape, sep = '\\n')\n",
"\n",
"# get hold of the current run\n",
"run = Run.get_context()\n",
"\n",
"print('Train a logistic regression model with regularization rate of', args.reg)\n",
"clf = LogisticRegression(C=1.0/args.reg, solver=\"liblinear\", multi_class=\"auto\", random_state=42)\n",
"clf.fit(X_train, y_train)\n",
"\n",
"print('Predict the test set')\n",
"y_hat = clf.predict(X_test)\n",
"\n",
"# calculate accuracy on the prediction\n",
"acc = np.average(y_hat == y_test)\n",
"print('Accuracy is', acc)\n",
"\n",
"run.log('regularization rate', np.float(args.reg))\n",
"run.log('accuracy', np.float(acc))\n",
"\n",
"os.makedirs('outputs', exist_ok=True)\n",
"# note file saved in the outputs folder is automatically uploaded into experiment record\n",
"joblib.dump(value=clf, filename='outputs/sklearn_mnist_model.pkl')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice how the script gets data and saves models:\n",
"\n",
"+ The training script reads an argument to find the directory containing the data. When you submit the job later, you point to the dataset for this argument:\n",
"`parser.add_argument('--data-folder', type=str, dest='data_folder', help='data directory mounting point')`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"+ The training script saves your model into a directory named outputs. <br/>\n",
"`joblib.dump(value=clf, filename='outputs/sklearn_mnist_model.pkl')`<br/>\n",
"Anything written in this directory is automatically uploaded into your workspace. You'll access your model from this directory later in the tutorial."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The file `utils.py` is referenced from the training script to load the dataset correctly. Copy this script into the script folder so that it can be accessed along with the training script on the remote resource."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import shutil\n",
"shutil.copy('utils.py', script_folder)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Configure the training job\n",
"\n",
"Create a ScriptRunConfig object to specify the configuration details of your training job, including your training script, environment to use, and the compute target to run on. Configure the ScriptRunConfig by specifying:\n",
"\n",
"* The directory that contains your scripts. All the files in this directory are uploaded into the cluster nodes for execution. \n",
"* The compute target. In this case you will use the AmlCompute you created\n",
"* The training script name, train.py\n",
"* An environment that contains the libraries needed to run the script\n",
"* Arguments required from the training script. \n",
"\n",
"In this tutorial, the target is AmlCompute. All files in the script folder are uploaded into the cluster nodes for execution. The data_folder is set to use the dataset.\n",
"\n",
"First, create the environment that contains: the scikit-learn library, azureml-dataset-runtime required for accessing the dataset, and azureml-defaults which contains the dependencies for logging metrics. The azureml-defaults also contains the dependencies required for deploying the model as a web service later in the part 2 of the tutorial.\n",
"\n",
"Once the environment is defined, register it with the Workspace to re-use it in part 2 of the tutorial."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.environment import Environment\n",
"from azureml.core.conda_dependencies import CondaDependencies\n",
"\n",
"# to install required packages\n",
"env = Environment('tutorial-env')\n",
"cd = CondaDependencies.create(pip_packages=['azureml-dataset-runtime[pandas,fuse]', 'azureml-defaults'], conda_packages = ['scikit-learn==1.5.1', 'numpy==1.23.5'])\n",
"\n",
"env.python.conda_dependencies = cd\n",
"\n",
"# Register environment to re-use later\n",
"env.register(workspace = ws)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then, create the ScriptRunConfig by specifying the training script, compute target and environment."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"configure estimator"
]
},
"outputs": [],
"source": [
"from azureml.core import ScriptRunConfig\n",
"\n",
"args = ['--data-folder', mnist_file_dataset.as_mount(), '--regularization', 0.5]\n",
"\n",
"src = ScriptRunConfig(source_directory=script_folder,\n",
" script='train.py', \n",
" arguments=args,\n",
" compute_target=compute_target,\n",
" environment=env)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Submit the job to the cluster\n",
"\n",
"Run the experiment by submitting the ScriptRunConfig object. And you can navigate to Azure portal to monitor the run."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"remote run",
"amlcompute",
"scikit-learn"
]
},
"outputs": [],
"source": [
"run = exp.submit(config=src)\n",
"run"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since the call is asynchronous, it returns a **Preparing** or **Running** state as soon as the job is started.\n",
"\n",
"## Monitor a remote run\n",
"\n",
"In total, the first run takes **approximately 10 minutes**. But for subsequent runs, as long as the dependencies in the Azure ML environment don't change, the same image is reused and hence the container start up time is much faster.\n",
"\n",
"Here is what's happening while you wait:\n",
"\n",
"- **Image creation**: A Docker image is created matching the Python environment specified by the Azure ML environment. The image is built and stored in the ACR (Azure Container Registry) associated with your workspace. Image creation and uploading takes **about 5 minutes**. \n",
"\n",
" This stage happens once for each Python environment since the container is cached for subsequent runs. During image creation, logs are streamed to the run history. You can monitor the image creation progress using these logs.\n",
"\n",
"- **Scaling**: If the remote cluster requires more nodes to execute the run than currently available, additional nodes are added automatically. Scaling typically takes **about 5 minutes.**\n",
"\n",
"- **Running**: In this stage, the necessary scripts and files are sent to the compute target, then data stores are mounted/copied, then the entry_script is run. While the job is running, stdout and the files in the ./logs directory are streamed to the run history. You can monitor the run's progress using these logs.\n",
"\n",
"- **Post-Processing**: The ./outputs directory of the run is copied over to the run history in your workspace so you can access these results.\n",
"\n",
"\n",
"You can check the progress of a running job in multiple ways. This tutorial uses a Jupyter widget as well as a `wait_for_completion` method. \n",
"\n",
"### Jupyter widget\n",
"\n",
"Watch the progress of the run with a Jupyter widget. Like the run submission, the widget is asynchronous and provides live updates every 10-15 seconds until the job completes."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"use notebook widget"
]
},
"outputs": [],
"source": [
"from azureml.widgets import RunDetails\n",
"RunDetails(run).show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By the way, if you need to cancel a run, you can follow [these instructions](https://aka.ms/aml-docs-cancel-run)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Get log results upon completion\n",
"\n",
"Model training happens in the background. You can use `wait_for_completion` to block and wait until the model has completed training before running more code. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"remote run",
"amlcompute",
"scikit-learn"
]
},
"outputs": [],
"source": [
"# specify show_output to True for a verbose log\n",
"run.wait_for_completion(show_output=True) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Display run results\n",
"\n",
"You now have a model trained on a remote cluster. Retrieve all the metrics logged during the run, including the accuracy of the model:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"get metrics"
]
},
"outputs": [],
"source": [
"print(run.get_metrics())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the next tutorial you will explore this model in more detail.\n",
"\n",
"## Register model\n",
"\n",
"The last step in the training script wrote the file `outputs/sklearn_mnist_model.pkl` in a directory named `outputs` in the VM of the cluster where the job is executed. `outputs` is a special directory in that all content in this directory is automatically uploaded to your workspace. This content appears in the run record in the experiment under your workspace. Hence, the model file is now also available in your workspace.\n",
"\n",
"You can see files associated with that run."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"query history"
]
},
"outputs": [],
"source": [
"print(run.get_file_names())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Register the model in the workspace so that you (or other collaborators) can later query, examine, and deploy this model."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"register model from history"
]
},
"outputs": [],
"source": [
"# register model \n",
"model = run.register_model(model_name='sklearn_mnist', model_path='outputs/sklearn_mnist_model.pkl')\n",
"print(model.name, model.id, model.version, sep='\\t')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Next steps\n",
"\n",
"In this Azure Machine Learning tutorial, you used Python to:\n",
"\n",
"> * Set up your development environment\n",
"> * Access and examine the data\n",
"> * Train multiple models on a remote cluster using the popular scikit-learn machine learning library\n",
"> * Review training details and register the best model\n",
"\n",
"You are ready to deploy this registered model using the instructions in the next part of the tutorial series:\n",
"\n",
"> [Tutorial 2 - Deploy models](img-classification-part2-deploy.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/tutorials/img-classification-part1-training.png)"
]
}
],
"metadata": {
"authors": [
{
"name": "maxluk"
}
],
"kernelspec": {
"display_name": "Python 3.8 - AzureML",
"language": "python",
"name": "python38-azureml"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
},
"msauthor": "roastala",
"network_required": false
},
"nbformat": 4,
"nbformat_minor": 2
}