{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Copyright (c) Microsoft Corporation. All rights reserved.\n", "\n", "Licensed under the MIT License." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/training/train-in-spark/train-in-spark.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 05. Train in Spark\n", "* Create Workspace\n", "* Create Experiment\n", "* Copy relevant files to the script folder\n", "* Configure and Run" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prerequisites\n", "If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, go through the [configuration](../../../configuration.ipynb) Notebook first if you haven't already to establish your connection to the AzureML Workspace." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Check core SDK version number\n", "import azureml.core\n", "\n", "print(\"SDK version:\", azureml.core.VERSION)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Initialize Workspace\n", "\n", "Initialize a workspace object from persisted configuration." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from azureml.core import Workspace\n", "\n", "ws = Workspace.from_config()\n", "print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep='\\n')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create Experiment\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "experiment_name = 'train-on-spark'\n", "\n", "from azureml.core import Experiment\n", "exp = Experiment(workspace=ws, name=experiment_name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## View `train-spark.py`\n", "\n", "For convenience, we created a training script for you. It is printed below as a text, but you can also run `%pfile ./train-spark.py` in a cell to show the file." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "with open('train-spark.py', 'r') as training_script:\n", " print(training_script.read())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Configure & Run" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note** You can use Docker-based execution to run the Spark job in local computer or a remote VM. Please see the `train-in-remote-vm` notebook for example on how to configure and run in Docker mode in a VM. Make sure you choose a Docker image that has Spark installed, such as `microsoft/mmlspark:0.12`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Attach an HDI cluster\n", "Here we will use a actual Spark cluster, HDInsight for Spark, to run this job. To use HDI commpute target:\n", " 1. Create a Spark for HDI cluster in Azure. Here are some [quick instructions](https://docs.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-jupyter-spark-sql). Make sure you use the Ubuntu flavor, NOT CentOS.\n", " 2. Enter the IP address, username and password below" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from azureml.core.compute import ComputeTarget, HDInsightCompute\n", "from azureml.exceptions import ComputeTargetException\n", "import os\n", "\n", "try:\n", " # if you want to connect using SSH key instead of username/password you can provide parameters private_key_file and private_key_passphrase\n", " attach_config = HDInsightCompute.attach_configuration(address=os.environ.get('hdiservername', '-ssh.azurehdinsight.net'), \n", " ssh_port=22, \n", " username=os.environ.get('hdiusername', ''), \n", " password=os.environ.get('hdipassword', ''))\n", " hdi_compute = ComputeTarget.attach(workspace=ws, \n", " name='myhdi', \n", " attach_configuration=attach_config)\n", "\n", "except ComputeTargetException as e:\n", " print(\"Caught = {}\".format(e.message))\n", " \n", " \n", "hdi_compute.wait_for_completion(show_output=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Configure HDI run" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Configure an execution using the HDInsight cluster with a conda environment that has `numpy`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from azureml.core.runconfig import RunConfiguration\n", "from azureml.core.conda_dependencies import CondaDependencies\n", "\n", "# use pyspark framework\n", "hdi_run_config = RunConfiguration(framework=\"pyspark\")\n", "\n", "# Set compute target to the HDI cluster\n", "hdi_run_config.target = hdi_compute.name\n", "\n", "# specify CondaDependencies object to ask system installing numpy\n", "cd = CondaDependencies()\n", "cd.add_conda_package('numpy')\n", "hdi_run_config.environment.python.conda_dependencies = cd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Submit the script to HDI" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from azureml.core import ScriptRunConfig\n", "\n", "script_run_config = ScriptRunConfig(source_directory = '.',\n", " script= 'train-spark.py',\n", " run_config = hdi_run_config)\n", "run = exp.submit(config=script_run_config)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Monitor the run using a Juypter widget" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from azureml.widgets import RunDetails\n", "RunDetails(run).show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note: if you need to cancel a run, you can follow [these instructions](https://aka.ms/aml-docs-cancel-run)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After the run is succesfully finished, you can check the metrics logged." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# get all metris logged in the run\n", "metrics = run.get_metrics()\n", "print(metrics)" ] } ], "metadata": { "authors": [ { "name": "aashishb" } ], "kernelspec": { "display_name": "Python 3.6", "language": "python", "name": "python36" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.7" }, "friendly_name": "Training in Spark", "exclude_from_index": false, "index_order": 1, "category": "training", "task": "Submiting a run on a spark cluster", "datasets": [ "None" ], "compute": [ "HDI cluster" ], "deployment": [ "None" ], "framework": [ "PySpark" ], "tags": [ "None" ] }, "nbformat": 4, "nbformat_minor": 2 }