mirror of
https://github.com/Azure/MachineLearningNotebooks.git
synced 2025-12-21 10:05:09 -05:00
331 lines
9.6 KiB
Plaintext
331 lines
9.6 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Copyright (c) Microsoft Corporation. All rights reserved.\n",
|
|
"\n",
|
|
"Licensed under the MIT License."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# 05. Train in Spark\n",
|
|
"* Create Workspace\n",
|
|
"* Create Experiment\n",
|
|
"* Copy relevant files to the script folder\n",
|
|
"* Configure and Run"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Prerequisites\n",
|
|
"Make sure you go through the [00. Installation and Configuration](00.configuration.ipynb) Notebook first if you haven't."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Check core SDK version number\n",
|
|
"import azureml.core\n",
|
|
"\n",
|
|
"print(\"SDK version:\", azureml.core.VERSION)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Initialize Workspace\n",
|
|
"\n",
|
|
"Initialize a workspace object from persisted configuration."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.core import Workspace\n",
|
|
"\n",
|
|
"ws = Workspace.from_config()\n",
|
|
"print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep='\\n')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Create Experiment\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"experiment_name = 'train-on-spark'\n",
|
|
"\n",
|
|
"from azureml.core import Experiment\n",
|
|
"exp = Experiment(workspace=ws, name=experiment_name)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## View `train-spark.py`\n",
|
|
"\n",
|
|
"For convenience, we created a training script for you. It is printed below as a text, but you can also run `%pfile ./train-spark.py` in a cell to show the file."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"with open('train-spark.py', 'r') as training_script:\n",
|
|
" print(training_script.read())"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Configure & Run"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Configure an ACI run\n",
|
|
"Before you try running on an actual Spark cluster, you can use a Docker image with Spark already baked in, and run it in ACI(Azure Container Registry)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.core.runconfig import RunConfiguration\n",
|
|
"from azureml.core.conda_dependencies import CondaDependencies\n",
|
|
"\n",
|
|
"# use pyspark framework\n",
|
|
"aci_run_config = RunConfiguration(framework=\"pyspark\")\n",
|
|
"\n",
|
|
"# use ACI to run the Spark job\n",
|
|
"aci_run_config.target = 'containerinstance'\n",
|
|
"aci_run_config.container_instance.region = 'eastus2'\n",
|
|
"aci_run_config.container_instance.cpu_cores = 1\n",
|
|
"aci_run_config.container_instance.memory_gb = 2\n",
|
|
"\n",
|
|
"# specify base Docker image to use\n",
|
|
"aci_run_config.environment.docker.enabled = True\n",
|
|
"aci_run_config.environment.docker.base_image = azureml.core.runconfig.DEFAULT_MMLSPARK_CPU_IMAGE\n",
|
|
"\n",
|
|
"# specify CondaDependencies\n",
|
|
"cd = CondaDependencies()\n",
|
|
"cd.add_conda_package('numpy')\n",
|
|
"aci_run_config.environment.python.conda_dependencies = cd"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Submit script to ACI to run"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.core import ScriptRunConfig\n",
|
|
"\n",
|
|
"script_run_config = ScriptRunConfig(source_directory = '.',\n",
|
|
" script= 'train-spark.py',\n",
|
|
" run_config = aci_run_config)\n",
|
|
"run = exp.submit(script_run_config)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"run"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"run.wait_for_completion(show_output=True)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"**Note** you can also create a new VM, or attach an existing VM, and use Docker-based execution to run the Spark job. Please see the `04.train-in-vm` for example on how to configure and run in Docker mode in a VM."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Attach an HDI cluster\n",
|
|
"Now we can use a real Spark cluster, HDInsight for Spark, to run this job. To use HDI commpute target:\n",
|
|
" 1. Create a Spark for HDI cluster in Azure. Here are some [quick instructions](https://docs.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-jupyter-spark-sql). Make sure you use the Ubuntu flavor, NOT CentOS.\n",
|
|
" 2. Enter the IP address, username and password below"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.core.compute import HDInsightCompute\n",
|
|
"from azureml.exceptions import ComputeTargetException\n",
|
|
"\n",
|
|
"try:\n",
|
|
" # if you want to connect using SSH key instead of username/password you can provide parameters private_key_file and private_key_passphrase\n",
|
|
" hdi_compute = HDInsightCompute.attach(workspace=ws, \n",
|
|
" name=\"myhdi\", \n",
|
|
" address=\"<myhdi-ssh>.azurehdinsight.net\", \n",
|
|
" ssh_port=22, \n",
|
|
" username='<ssh-username>', \n",
|
|
" password='<ssh-pwd>')\n",
|
|
"\n",
|
|
"except ComputeTargetException as e:\n",
|
|
" print(\"Caught = {}\".format(e.message))\n",
|
|
" \n",
|
|
" \n",
|
|
"hdi_compute.wait_for_completion(show_output=True)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Configure HDI run"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.core.runconfig import RunConfiguration\n",
|
|
"from azureml.core.conda_dependencies import CondaDependencies\n",
|
|
"\n",
|
|
"\n",
|
|
"# use pyspark framework\n",
|
|
"hdi_run_config = RunConfiguration(framework=\"pyspark\")\n",
|
|
"\n",
|
|
"# Set compute target to the HDI cluster\n",
|
|
"hdi_run_config.target = hdi_compute.name\n",
|
|
"\n",
|
|
"# specify CondaDependencies object to ask system installing numpy\n",
|
|
"cd = CondaDependencies()\n",
|
|
"cd.add_conda_package('numpy')\n",
|
|
"hdi_run_config.environment.python.conda_dependencies = cd"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Submit the script to HDI"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.core import ScriptRunConfig\n",
|
|
"\n",
|
|
"script_run_config = ScriptRunConfig(source_directory = '.',\n",
|
|
" script= 'train-spark.py',\n",
|
|
" run_config = hdi_run_config)\n",
|
|
"run = exp.submit(config=script_run_config)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# get the URL of the run history web page\n",
|
|
"run"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# get all metris logged in the run\n",
|
|
"metrics = run.get_metrics()\n",
|
|
"print(metrics)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
}
|
|
],
|
|
"metadata": {
|
|
"authors": [
|
|
{
|
|
"name": "aashishb"
|
|
}
|
|
],
|
|
"kernelspec": {
|
|
"display_name": "Python 3.6",
|
|
"language": "python",
|
|
"name": "python36"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.6.6"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
} |