mirror of
https://github.com/Azure/MachineLearningNotebooks.git
synced 2025-12-20 09:37:04 -05:00
567 lines
20 KiB
Plaintext
567 lines
20 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Copyright (c) Microsoft Corporation. All rights reserved.\n",
|
|
"\n",
|
|
"Licensed under the MIT License."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Tensorboard Integration with Run History\n",
|
|
"\n",
|
|
"1. Run a Tensorflow job locally and view its TB output live.\n",
|
|
"2. The same, for a DSVM.\n",
|
|
"3. And once more, with an AmlCompute cluster.\n",
|
|
"4. Finally, we'll collect all of these historical runs together into a single Tensorboard graph."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Prerequisites\n",
|
|
"* Understand the [architecture and terms](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture) introduced by Azure Machine Learning\n",
|
|
"* Go through the [configuration notebook](../../../configuration.ipynb) notebook to:\n",
|
|
" * install the AML SDK\n",
|
|
" * create a workspace and its configuration file (`config.json`)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Check core SDK version number\n",
|
|
"import azureml.core\n",
|
|
"\n",
|
|
"print(\"SDK version:\", azureml.core.VERSION)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Install the Azure ML TensorBoard package."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"!pip install azureml-tensorboard"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Diagnostics\n",
|
|
"Opt-in diagnostics for better experience, quality, and security of future releases."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"tags": [
|
|
"Diagnostics"
|
|
]
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.telemetry import set_diagnostics_collection\n",
|
|
"\n",
|
|
"set_diagnostics_collection(send_diagnostics=True)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Initialize Workspace\n",
|
|
"\n",
|
|
"Initialize a workspace object from persisted configuration."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.core import Workspace\n",
|
|
"\n",
|
|
"ws = Workspace.from_config()\n",
|
|
"print('Workspace name: ' + ws.name, \n",
|
|
" 'Azure region: ' + ws.location, \n",
|
|
" 'Subscription id: ' + ws.subscription_id, \n",
|
|
" 'Resource group: ' + ws.resource_group, sep='\\n')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Set experiment name and create project\n",
|
|
"Choose a name for your run history container in the workspace, and create a folder for the project."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from os import path, makedirs\n",
|
|
"experiment_name = 'tensorboard-demo'\n",
|
|
"\n",
|
|
"# experiment folder\n",
|
|
"exp_dir = './sample_projects/' + experiment_name\n",
|
|
"\n",
|
|
"if not path.exists(exp_dir):\n",
|
|
" makedirs(exp_dir)\n",
|
|
"\n",
|
|
"# runs we started in this session, for the finale\n",
|
|
"runs = []"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Download Tensorflow Tensorboard demo code\n",
|
|
"\n",
|
|
"Tensorflow's repository has an MNIST demo with extensive Tensorboard instrumentation. We'll use it here for our purposes.\n",
|
|
"\n",
|
|
"Note that we don't need to make any code changes at all - the code works without modification from the Tensorflow repository."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import requests\n",
|
|
"import os\n",
|
|
"\n",
|
|
"tf_code = requests.get(\"https://raw.githubusercontent.com/tensorflow/tensorflow/r1.8/tensorflow/examples/tutorials/mnist/mnist_with_summaries.py\")\n",
|
|
"with open(os.path.join(exp_dir, \"mnist_with_summaries.py\"), \"w\") as file:\n",
|
|
" file.write(tf_code.text)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Configure and run locally\n",
|
|
"\n",
|
|
"We'll start by running this locally. While it might not initially seem that useful to use this for a local run - why not just run TB against the files generated locally? - even in this case there is some value to using this feature. Your local run will be registered in the run history, and your Tensorboard logs will be uploaded to the artifact store associated with this run. Later, you'll be able to restore the logs from any run, regardless of where it happened.\n",
|
|
"\n",
|
|
"Note that for this run, you will need to install Tensorflow on your local machine by yourself. Further, the Tensorboard module (that is, the one included with Tensorflow) must be accessible to this notebook's kernel, as the local machine is what runs Tensorboard."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.core.runconfig import RunConfiguration\n",
|
|
"\n",
|
|
"# Create a run configuration.\n",
|
|
"run_config = RunConfiguration()\n",
|
|
"run_config.environment.python.user_managed_dependencies = True\n",
|
|
"\n",
|
|
"# You can choose a specific Python environment by pointing to a Python path \n",
|
|
"#run_config.environment.python.interpreter_path = '/home/ninghai/miniconda3/envs/sdk2/bin/python'"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.core import Experiment\n",
|
|
"from azureml.core.script_run_config import ScriptRunConfig\n",
|
|
"\n",
|
|
"logs_dir = os.path.join(os.curdir, \"logs\")\n",
|
|
"data_dir = os.path.abspath(os.path.join(os.curdir, \"mnist_data\"))\n",
|
|
"\n",
|
|
"if not path.exists(data_dir):\n",
|
|
" makedirs(data_dir)\n",
|
|
"\n",
|
|
"os.environ[\"TEST_TMPDIR\"] = data_dir\n",
|
|
"\n",
|
|
"# Writing logs to ./logs results in their being uploaded to Artifact Service,\n",
|
|
"# and thus, made accessible to our Tensorboard instance.\n",
|
|
"arguments_list = [\"--log_dir\", logs_dir]\n",
|
|
"\n",
|
|
"# Create an experiment\n",
|
|
"exp = Experiment(ws, experiment_name)\n",
|
|
"\n",
|
|
"# If you would like the run to go for longer, add --max_steps 5000 to the arguments list:\n",
|
|
"# arguments_list += [\"--max_steps\", \"5000\"]\n",
|
|
"\n",
|
|
"script = ScriptRunConfig(exp_dir,\n",
|
|
" script=\"mnist_with_summaries.py\",\n",
|
|
" run_config=run_config,\n",
|
|
" arguments=arguments_list)\n",
|
|
"\n",
|
|
"run = exp.submit(script)\n",
|
|
"# You can also wait for the run to complete\n",
|
|
"# run.wait_for_completion(show_output=True)\n",
|
|
"runs.append(run)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Start Tensorboard\n",
|
|
"\n",
|
|
"Now, while the run is in progress, we just need to start Tensorboard with the run as its target, and it will begin streaming logs."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.tensorboard import Tensorboard\n",
|
|
"\n",
|
|
"# The Tensorboard constructor takes an array of runs, so be sure and pass it in as a single-element array here\n",
|
|
"tb = Tensorboard([run])\n",
|
|
"\n",
|
|
"# If successful, start() returns a string with the URI of the instance.\n",
|
|
"tb.start()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Stop Tensorboard\n",
|
|
"\n",
|
|
"When you're done, make sure to call the `stop()` method of the Tensorboard object, or it will stay running even after your job completes."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"tb.stop()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Now, with a DSVM\n",
|
|
"\n",
|
|
"Tensorboard uploading works with all compute targets. Here we demonstrate it from a DSVM.\n",
|
|
"Note that the Tensorboard instance itself will be run by the notebook kernel. Again, this means this notebook's kernel must have access to the Tensorboard module.\n",
|
|
"\n",
|
|
"If you are unfamiliar with DSVM configuration, check [Train in a remote VM](../../training/train-on-remote-vm/train-on-remote-vm.ipynb) for a more detailed breakdown.\n",
|
|
"\n",
|
|
"**Note**: To streamline the compute that Azure Machine Learning creates, we are making updates to support creating only single to multi-node `AmlCompute`. The `DSVMCompute` class will be deprecated in a later release, but the DSVM can be created using the below single line command and then attached(like any VM) using the sample code below. Also note, that we only support Linux VMs for remote execution from AML and the commands below will spin a Linux VM only.\n",
|
|
"\n",
|
|
"```shell\n",
|
|
"# create a DSVM in your resource group\n",
|
|
"# note you need to be at least a contributor to the resource group in order to execute this command successfully.\n",
|
|
"(myenv) $ az vm create --resource-group <resource_group_name> --name <some_vm_name> --image microsoft-dsvm:linux-data-science-vm-ubuntu:linuxdsvmubuntu:latest --admin-username <username> --admin-password <password> --generate-ssh-keys --authentication-type password\n",
|
|
"```\n",
|
|
"You can also use [this url](https://portal.azure.com/#create/microsoft-dsvm.linux-data-science-vm-ubuntulinuxdsvmubuntu) to create the VM using the Azure Portal."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.core.compute import ComputeTarget, RemoteCompute\n",
|
|
"from azureml.core.compute_target import ComputeTargetException\n",
|
|
"\n",
|
|
"username = os.getenv('AZUREML_DSVM_USERNAME', default='<my_username>')\n",
|
|
"address = os.getenv('AZUREML_DSVM_ADDRESS', default='<ip_address_or_fqdn>')\n",
|
|
"\n",
|
|
"compute_target_name = 'cpudsvm'\n",
|
|
"# if you want to connect using SSH key instead of username/password you can provide parameters private_key_file and private_key_passphrase \n",
|
|
"try:\n",
|
|
" attached_dsvm_compute = RemoteCompute(workspace=ws, name=compute_target_name)\n",
|
|
" print('found existing:', attached_dsvm_compute.name)\n",
|
|
"except ComputeTargetException:\n",
|
|
" config = RemoteCompute.attach_configuration(username=username,\n",
|
|
" address=address,\n",
|
|
" ssh_port=22,\n",
|
|
" private_key_file='./.ssh/id_rsa')\n",
|
|
" attached_dsvm_compute = ComputeTarget.attach(ws, compute_target_name, config)\n",
|
|
" \n",
|
|
" attached_dsvm_compute.wait_for_completion(show_output=True)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Submit run using TensorFlow estimator\n",
|
|
"\n",
|
|
"Instead of manually configuring the DSVM environment, we can use the TensorFlow estimator and everything is set up automatically."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.train.dnn import TensorFlow\n",
|
|
"\n",
|
|
"script_params = {\"--log_dir\": \"./logs\"}\n",
|
|
"\n",
|
|
"# If you want the run to go longer, set --max-steps to a higher number.\n",
|
|
"# script_params[\"--max_steps\"] = \"5000\"\n",
|
|
"\n",
|
|
"tf_estimator = TensorFlow(source_directory=exp_dir,\n",
|
|
" compute_target=attached_dsvm_compute,\n",
|
|
" entry_script='mnist_with_summaries.py',\n",
|
|
" script_params=script_params)\n",
|
|
"\n",
|
|
"run = exp.submit(tf_estimator)\n",
|
|
"\n",
|
|
"runs.append(run)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Start Tensorboard with this run\n",
|
|
"\n",
|
|
"Just like before."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# The Tensorboard constructor takes an array of runs, so be sure and pass it in as a single-element array here\n",
|
|
"tb = Tensorboard([run])\n",
|
|
"\n",
|
|
"# If successful, start() returns a string with the URI of the instance.\n",
|
|
"tb.start()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Stop Tensorboard\n",
|
|
"\n",
|
|
"When you're done, make sure to call the `stop()` method of the Tensorboard object, or it will stay running even after your job completes."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"tb.stop()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Once more, with an AmlCompute cluster\n",
|
|
"\n",
|
|
"Just to prove we can, let's create an AmlCompute CPU cluster, and run our demo there, as well."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.core.compute import ComputeTarget, AmlCompute\n",
|
|
"\n",
|
|
"# choose a name for your cluster\n",
|
|
"cluster_name = \"cpucluster\"\n",
|
|
"\n",
|
|
"cts = ws.compute_targets\n",
|
|
"found = False\n",
|
|
"if cluster_name in cts and cts[cluster_name].type == 'AmlCompute':\n",
|
|
" found = True\n",
|
|
" print('Found existing compute target.')\n",
|
|
" compute_target = cts[cluster_name]\n",
|
|
"if not found:\n",
|
|
" print('Creating a new compute target...')\n",
|
|
" compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2', \n",
|
|
" max_nodes=4)\n",
|
|
"\n",
|
|
" # create the cluster\n",
|
|
" compute_target = ComputeTarget.create(ws, cluster_name, compute_config)\n",
|
|
"\n",
|
|
"compute_target.wait_for_completion(show_output=True, min_node_count=None)\n",
|
|
"\n",
|
|
"# use get_status() to get a detailed status for the current cluster. \n",
|
|
"# print(compute_target.get_status().serialize())"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Submit run using TensorFlow estimator\n",
|
|
"\n",
|
|
"Again, we can use the TensorFlow estimator and everything is set up automatically."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"script_params = {\"--log_dir\": \"./logs\"}\n",
|
|
"\n",
|
|
"# If you want the run to go longer, set --max-steps to a higher number.\n",
|
|
"# script_params[\"--max_steps\"] = \"5000\"\n",
|
|
"\n",
|
|
"tf_estimator = TensorFlow(source_directory=exp_dir,\n",
|
|
" compute_target=compute_target,\n",
|
|
" entry_script='mnist_with_summaries.py',\n",
|
|
" script_params=script_params)\n",
|
|
"\n",
|
|
"run = exp.submit(tf_estimator)\n",
|
|
"\n",
|
|
"runs.append(run)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Start Tensorboard with this run\n",
|
|
"\n",
|
|
"Once more..."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# The Tensorboard constructor takes an array of runs, so be sure and pass it in as a single-element array here\n",
|
|
"tb = Tensorboard([run])\n",
|
|
"\n",
|
|
"# If successful, start() returns a string with the URI of the instance.\n",
|
|
"tb.start()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Stop Tensorboard\n",
|
|
"\n",
|
|
"When you're done, make sure to call the `stop()` method of the Tensorboard object, or it will stay running even after your job completes."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"tb.stop()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Finale\n",
|
|
"\n",
|
|
"If you've paid close attention, you'll have noticed that we've been saving the run objects in an array as we went along. We can start a Tensorboard instance that combines all of these run objects into a single process. This way, you can compare historical runs. You can even do this with live runs; if you made some of those previous runs longer via the `--max_steps` parameter, they might still be running, and you'll see them live in this instance as well."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# The Tensorboard constructor takes an array of runs...\n",
|
|
"# and it turns out that we have been building one of those all along.\n",
|
|
"tb = Tensorboard(runs)\n",
|
|
"\n",
|
|
"# If successful, start() returns a string with the URI of the instance.\n",
|
|
"tb.start()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Stop Tensorboard\n",
|
|
"\n",
|
|
"As you might already know, make sure to call the `stop()` method of the Tensorboard object, or it will stay running (until you kill the kernel associated with this notebook, at least)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"tb.stop()"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"authors": [
|
|
{
|
|
"name": "roastala"
|
|
}
|
|
],
|
|
"kernelspec": {
|
|
"display_name": "Python 3.6",
|
|
"language": "python",
|
|
"name": "python36"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.6.6"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
} |