Update aml-pipelines-use-databricks-as-compute-target.ipynb

add "how to" guidance for common issue in DatabricksStep
2025-12-20 09:37:04 -05:00 · 2021-09-13 09:32:32 -07:00 · 2021-09-10 13:20:03 -07:00 · 2021-09-10 12:51:41 -07:00
2 changed files with 962 additions and 907 deletions
--- a/.vscode/settings.json
+++ b/.vscode/settings.json
@@ -0,0 +1 @@
 {}
--- a/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-use-databricks-as-compute-target.ipynb
+++ b/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-use-databricks-as-compute-target.ipynb
@@ -19,20 +19,21 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-        "# Using Databricks as a Compute Target from Azure Machine Learning Pipeline\n",
+    "# Using Databricks as a Compute Target from Azure Machine Learning Pipeline\r\n",
-        "To use Databricks as a compute target from [Azure Machine Learning Pipeline](https://aka.ms/pl-concept), a [DatabricksStep](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-steps/azureml.pipeline.steps.databricks_step.databricksstep?view=azure-ml-py) is used. This notebook demonstrates the use of DatabricksStep in Azure Machine Learning Pipeline.\n",
+    "To use Databricks as a compute target from [Azure Machine Learning Pipeline](https://aka.ms/pl-concept), a [DatabricksStep](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-steps/azureml.pipeline.steps.databricks_step.databricksstep?view=azure-ml-py) is used. This notebook demonstrates the use of DatabricksStep in Azure Machine Learning Pipeline.\r\n",
-        "\n",
+    "\r\n",
-        "The notebook will show:\n",
+    "The notebook will show:\r\n",
-        "1. Running an arbitrary Databricks notebook that the customer has in Databricks workspace\n",
+    "1. Running an arbitrary Databricks notebook that the customer has in Databricks workspace\r\n",
-        "2. Running an arbitrary Python script that the customer has in DBFS\n",
+    "2. Running an arbitrary Python script that the customer has in DBFS\r\n",
-        "3. Running an arbitrary Python script that is available on local computer (will upload to DBFS, and then run in Databricks) \n",
+    "3. Running an arbitrary Python script that is available on local computer (will upload to DBFS, and then run in Databricks) \r\n",
-        "4. Running a JAR job that the customer has in DBFS.\n",
+    "4. Running a JAR job that the customer has in DBFS.\r\n",
-        "\n",
+    "5. How to get run context in a Databricks interactive cluster\r\n",
-        "## Before you begin:\n",
+    "\r\n",
-        "\n",
+    "## Before you begin:\r\n",
-        "1. **Create an Azure Databricks workspace** in the same subscription where you have your Azure Machine Learning workspace. You will need details of this workspace later on to define DatabricksStep. [Click here](https://ms.portal.azure.com/#blade/HubsExtension/Resources/resourceType/Microsoft.Databricks%2Fworkspaces) for more information.\n",
+    "\r\n",
-        "2. **Create PAT (access token)**: Manually create a Databricks access token at the Azure Databricks portal. See [this](https://docs.databricks.com/api/latest/authentication.html#generate-a-token) for more information.\n",
+    "1. **Create an Azure Databricks workspace** in the same subscription where you have your Azure Machine Learning workspace. You will need details of this workspace later on to define DatabricksStep. [Click here](https://ms.portal.azure.com/#blade/HubsExtension/Resources/resourceType/Microsoft.Databricks%2Fworkspaces) for more information.\r\n",
-        "3. **Add demo notebook to ADB**: This notebook has a sample you can use as is. Launch Azure Databricks attached to your Azure Machine Learning workspace and add a new notebook. \n",
+    "2. **Create PAT (access token)**: Manually create a Databricks access token at the Azure Databricks portal. See [this](https://docs.databricks.com/api/latest/authentication.html#generate-a-token) for more information.\r\n",
    "3. **Add demo notebook to ADB**: This notebook has a sample you can use as is. Launch Azure Databricks attached to your Azure Machine Learning workspace and add a new notebook. \r\n",
    "4. **Create/attach a Blob storage** for use from ADB"
   ]
  },
@@ -48,33 +49,33 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-        "```python\n",
+    "```python\r\n",
-        "# direct access\n",
+    "# direct access\r\n",
-        "dbutils.widgets.get(\"myparam\")\n",
+    "dbutils.widgets.get(\"myparam\")\r\n",
-        "p = getArgument(\"myparam\")\n",
+    "p = getArgument(\"myparam\")\r\n",
-        "print (\"Param -\\'myparam':\")\n",
+    "print (\"Param -\\'myparam':\")\r\n",
-        "print (p)\n",
+    "print (p)\r\n",
-        "\n",
+    "\r\n",
-        "dbutils.widgets.get(\"input\")\n",
+    "dbutils.widgets.get(\"input\")\r\n",
-        "i = getArgument(\"input\")\n",
+    "i = getArgument(\"input\")\r\n",
-        "print (\"Param -\\'input':\")\n",
+    "print (\"Param -\\'input':\")\r\n",
-        "print (i)\n",
+    "print (i)\r\n",
-        "\n",
+    "\r\n",
-        "dbutils.widgets.get(\"output\")\n",
+    "dbutils.widgets.get(\"output\")\r\n",
-        "o = getArgument(\"output\")\n",
+    "o = getArgument(\"output\")\r\n",
-        "print (\"Param -\\'output':\")\n",
+    "print (\"Param -\\'output':\")\r\n",
-        "print (o)\n",
+    "print (o)\r\n",
-        "\n",
+    "\r\n",
-        "n = i + \"/testdata.txt\"\n",
+    "n = i + \"/testdata.txt\"\r\n",
-        "df = spark.read.csv(n)\n",
+    "df = spark.read.csv(n)\r\n",
-        "\n",
+    "\r\n",
-        "display (df)\n",
+    "display (df)\r\n",
-        "\n",
+    "\r\n",
-        "data = [('value1', 'value2')]\n",
+    "data = [('value1', 'value2')]\r\n",
-        "df2 = spark.createDataFrame(data)\n",
+    "df2 = spark.createDataFrame(data)\r\n",
-        "\n",
+    "\r\n",
-        "z = o + \"/output.txt\"\n",
+    "z = o + \"/output.txt\"\r\n",
-        "df2.write.csv(z)\n",
+    "df2.write.csv(z)\r\n",
    "```"
   ]
  },
@@ -91,18 +92,18 @@
   "metadata": {},
   "outputs": [],
   "source": [
-        "import os\n",
+    "import os\r\n",
-        "import azureml.core\n",
+    "import azureml.core\r\n",
-        "from azureml.core.runconfig import JarLibrary\n",
+    "from azureml.core.runconfig import JarLibrary\r\n",
-        "from azureml.core.compute import ComputeTarget, DatabricksCompute\n",
+    "from azureml.core.compute import ComputeTarget, DatabricksCompute\r\n",
-        "from azureml.exceptions import ComputeTargetException\n",
+    "from azureml.exceptions import ComputeTargetException\r\n",
-        "from azureml.core import Workspace, Experiment\n",
+    "from azureml.core import Workspace, Experiment\r\n",
-        "from azureml.pipeline.core import Pipeline, PipelineData\n",
+    "from azureml.pipeline.core import Pipeline, PipelineData\r\n",
-        "from azureml.pipeline.steps import DatabricksStep\n",
+    "from azureml.pipeline.steps import DatabricksStep\r\n",
-        "from azureml.core.datastore import Datastore\n",
+    "from azureml.core.datastore import Datastore\r\n",
-        "from azureml.data.data_reference import DataReference\n",
+    "from azureml.data.data_reference import DataReference\r\n",
-        "\n",
+    "\r\n",
-        "# Check core SDK version number\n",
+    "# Check core SDK version number\r\n",
    "print(\"SDK version:\", azureml.core.VERSION)"
   ]
  },
@@ -121,7 +122,7 @@
   "metadata": {},
   "outputs": [],
   "source": [
-        "ws = Workspace.from_config()\n",
+    "ws = Workspace.from_config()\r\n",
    "print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\\n')"
   ]
  },
@@ -149,29 +150,29 @@
   },
   "outputs": [],
   "source": [
-        "# Replace with your account info before running.\n",
+    "# Replace with your account info before running.\r\n",
-        " \n",
+    " \r\n",
-        "db_compute_name=os.getenv(\"DATABRICKS_COMPUTE_NAME\", \"<my-databricks-compute-name>\") # Databricks compute name\n",
+    "db_compute_name=os.getenv(\"DATABRICKS_COMPUTE_NAME\", \"<my-databricks-compute-name>\") # Databricks compute name\r\n",
-        "db_resource_group=os.getenv(\"DATABRICKS_RESOURCE_GROUP\", \"<my-db-resource-group>\") # Databricks resource group\n",
+    "db_resource_group=os.getenv(\"DATABRICKS_RESOURCE_GROUP\", \"<my-db-resource-group>\") # Databricks resource group\r\n",
-        "db_workspace_name=os.getenv(\"DATABRICKS_WORKSPACE_NAME\", \"<my-db-workspace-name>\") # Databricks workspace name\n",
+    "db_workspace_name=os.getenv(\"DATABRICKS_WORKSPACE_NAME\", \"<my-db-workspace-name>\") # Databricks workspace name\r\n",
-        "db_access_token=os.getenv(\"DATABRICKS_ACCESS_TOKEN\", \"<my-access-token>\") # Databricks access token\n",
+    "db_access_token=os.getenv(\"DATABRICKS_ACCESS_TOKEN\", \"<my-access-token>\") # Databricks access token\r\n",
-        " \n",
+    " \r\n",
-        "try:\n",
+    "try:\r\n",
-        "    databricks_compute = DatabricksCompute(workspace=ws, name=db_compute_name)\n",
+    "    databricks_compute = DatabricksCompute(workspace=ws, name=db_compute_name)\r\n",
-        "    print('Compute target {} already exists'.format(db_compute_name))\n",
+    "    print('Compute target {} already exists'.format(db_compute_name))\r\n",
-        "except ComputeTargetException:\n",
+    "except ComputeTargetException:\r\n",
-        "    print('Compute not found, will use below parameters to attach new one')\n",
+    "    print('Compute not found, will use below parameters to attach new one')\r\n",
-        "    print('db_compute_name {}'.format(db_compute_name))\n",
+    "    print('db_compute_name {}'.format(db_compute_name))\r\n",
-        "    print('db_resource_group {}'.format(db_resource_group))\n",
+    "    print('db_resource_group {}'.format(db_resource_group))\r\n",
-        "    print('db_workspace_name {}'.format(db_workspace_name))\n",
+    "    print('db_workspace_name {}'.format(db_workspace_name))\r\n",
-        "    print('db_access_token {}'.format(db_access_token))\n",
+    "    print('db_access_token {}'.format(db_access_token))\r\n",
-        " \n",
+    " \r\n",
-        "    config = DatabricksCompute.attach_configuration(\n",
+    "    config = DatabricksCompute.attach_configuration(\r\n",
-        "        resource_group = db_resource_group,\n",
+    "        resource_group = db_resource_group,\r\n",
-        "        workspace_name = db_workspace_name,\n",
+    "        workspace_name = db_workspace_name,\r\n",
-        "        access_token= db_access_token)\n",
+    "        access_token= db_access_token)\r\n",
-        "    databricks_compute=ComputeTarget.attach(ws, db_compute_name, config)\n",
+    "    databricks_compute=ComputeTarget.attach(ws, db_compute_name, config)\r\n",
-        "    databricks_compute.wait_for_completion(True)\n"
+    "    databricks_compute.wait_for_completion(True)\r\n"
   ]
  },
  {
@@ -303,20 +304,20 @@
   "metadata": {},
   "outputs": [],
   "source": [
-        "from azureml.pipeline.core import PipelineParameter\n",
+    "from azureml.pipeline.core import PipelineParameter\r\n",
-        "\n",
+    "\r\n",
-        "# Use the default blob storage\n",
+    "# Use the default blob storage\r\n",
-        "def_blob_store = Datastore(ws, \"workspaceblobstore\")\n",
+    "def_blob_store = Datastore(ws, \"workspaceblobstore\")\r\n",
-        "print('Datastore {} will be used'.format(def_blob_store.name))\n",
+    "print('Datastore {} will be used'.format(def_blob_store.name))\r\n",
-        "\n",
+    "\r\n",
-        "pipeline_param = PipelineParameter(name=\"my_pipeline_param\", default_value=\"pipeline_param1\")\n",
+    "pipeline_param = PipelineParameter(name=\"my_pipeline_param\", default_value=\"pipeline_param1\")\r\n",
-        "\n",
+    "\r\n",
-        "# We are uploading a sample file in the local directory to be used as a datasource\n",
+    "# We are uploading a sample file in the local directory to be used as a datasource\r\n",
-        "def_blob_store.upload_files(files=[\"./testdata.txt\"], target_path=\"dbtest\", overwrite=False)\n",
+    "def_blob_store.upload_files(files=[\"./testdata.txt\"], target_path=\"dbtest\", overwrite=False)\r\n",
-        "\n",
+    "\r\n",
-        "step_1_input = DataReference(datastore=def_blob_store, path_on_datastore=\"dbtest\",\n",
+    "step_1_input = DataReference(datastore=def_blob_store, path_on_datastore=\"dbtest\",\r\n",
-        "                                     data_reference_name=\"input\")\n",
+    "                                     data_reference_name=\"input\")\r\n",
-        "\n",
+    "\r\n",
    "step_1_output = PipelineData(\"output\", datastore=def_blob_store)"
   ]
  },
@@ -412,7 +413,7 @@
   "metadata": {},
   "source": [
    "### 1. Running the demo notebook already added to the Databricks workspace\n",
-        "Create a notebook in the Azure Databricks workspace, and provide the path to that notebook as the value associated with the environment variable \"DATABRICKS_NOTEBOOK_PATH\". This will then set the variable\u00c2\u00a0notebook_path\u00c2\u00a0when you run the code cell below:\n",
+    "Create a notebook in the Azure Databricks workspace, and provide the path to that notebook as the value associated with the environment variable \"DATABRICKS_NOTEBOOK_PATH\". This will then set the variableÂ notebook_pathÂ when you run the code cell below:\n",
    "\n",
    "your notebook's path in Azure Databricks UI by hovering over to notebook's title. A typical path of notebook looks like this `/Users/example@databricks.com/example`. See [Databricks Workspace](https://docs.azuredatabricks.net/user-guide/workspace.html) to learn about the folder structure.\n",
    "\n",
@@ -425,19 +426,19 @@
   "metadata": {},
   "outputs": [],
   "source": [
-        "notebook_path=os.getenv(\"DATABRICKS_NOTEBOOK_PATH\", \"<my-databricks-notebook-path>\") # Databricks notebook path\n",
+    "notebook_path=os.getenv(\"DATABRICKS_NOTEBOOK_PATH\", \"<my-databricks-notebook-path>\") # Databricks notebook path\r\n",
-        "\n",
+    "\r\n",
-        "dbNbStep = DatabricksStep(\n",
+    "dbNbStep = DatabricksStep(\r\n",
-        "    name=\"DBNotebookInWS\",\n",
+    "    name=\"DBNotebookInWS\",\r\n",
-        "    inputs=[step_1_input],\n",
+    "    inputs=[step_1_input],\r\n",
-        "    outputs=[step_1_output],\n",
+    "    outputs=[step_1_output],\r\n",
-        "    num_workers=1,\n",
+    "    num_workers=1,\r\n",
-        "    notebook_path=notebook_path,\n",
+    "    notebook_path=notebook_path,\r\n",
-        "    notebook_params={'myparam': 'testparam', \n",
+    "    notebook_params={'myparam': 'testparam', \r\n",
-        "                     'myparam2': pipeline_param},\n",
+    "                     'myparam2': pipeline_param},\r\n",
-        "    run_name='DB_Notebook_demo',\n",
+    "    run_name='DB_Notebook_demo',\r\n",
-        "    compute_target=databricks_compute,\n",
+    "    compute_target=databricks_compute,\r\n",
-        "    allow_reuse=True\n",
+    "    allow_reuse=True\r\n",
    ")"
   ]
  },
@@ -456,9 +457,9 @@
   "metadata": {},
   "outputs": [],
   "source": [
-        "steps = [dbNbStep]\n",
+    "steps = [dbNbStep]\r\n",
-        "pipeline = Pipeline(workspace=ws, steps=steps)\n",
+    "pipeline = Pipeline(workspace=ws, steps=steps)\r\n",
-        "pipeline_run = Experiment(ws, 'DB_Notebook_demo').submit(pipeline)\n",
+    "pipeline_run = Experiment(ws, 'DB_Notebook_demo').submit(pipeline)\r\n",
    "pipeline_run.wait_for_completion()"
   ]
  },
@@ -475,7 +476,7 @@
   "metadata": {},
   "outputs": [],
   "source": [
-        "from azureml.widgets import RunDetails\n",
+    "from azureml.widgets import RunDetails\r\n",
    "RunDetails(pipeline_run).show()"
   ]
  },
@@ -503,17 +504,17 @@
   "metadata": {},
   "outputs": [],
   "source": [
-        "python_script_path = os.getenv(\"DATABRICKS_PYTHON_SCRIPT_PATH\", \"<my-databricks-python-script-path>\") # Databricks python script path\n",
+    "python_script_path = os.getenv(\"DATABRICKS_PYTHON_SCRIPT_PATH\", \"<my-databricks-python-script-path>\") # Databricks python script path\r\n",
-        "\n",
+    "\r\n",
-        "dbPythonInDbfsStep = DatabricksStep(\n",
+    "dbPythonInDbfsStep = DatabricksStep(\r\n",
-        "    name=\"DBPythonInDBFS\",\n",
+    "    name=\"DBPythonInDBFS\",\r\n",
-        "    inputs=[step_1_input],\n",
+    "    inputs=[step_1_input],\r\n",
-        "    num_workers=1,\n",
+    "    num_workers=1,\r\n",
-        "    python_script_path=python_script_path,\n",
+    "    python_script_path=python_script_path,\r\n",
-        "    python_script_params={'arg1', pipeline_param, 'arg2'},\n",
+    "    python_script_params={'arg1', pipeline_param, 'arg2'},\r\n",
-        "    run_name='DB_Python_demo',\n",
+    "    run_name='DB_Python_demo',\r\n",
-        "    compute_target=databricks_compute,\n",
+    "    compute_target=databricks_compute,\r\n",
-        "    allow_reuse=True\n",
+    "    allow_reuse=True\r\n",
    ")"
   ]
  },
@@ -530,9 +531,9 @@
   "metadata": {},
   "outputs": [],
   "source": [
-        "steps = [dbPythonInDbfsStep]\n",
+    "steps = [dbPythonInDbfsStep]\r\n",
-        "pipeline = Pipeline(workspace=ws, steps=steps)\n",
+    "pipeline = Pipeline(workspace=ws, steps=steps)\r\n",
-        "pipeline_run = Experiment(ws, 'DB_Python_demo').submit(pipeline)\n",
+    "pipeline_run = Experiment(ws, 'DB_Python_demo').submit(pipeline)\r\n",
    "pipeline_run.wait_for_completion()"
   ]
  },
@@ -549,7 +550,7 @@
   "metadata": {},
   "outputs": [],
   "source": [
-        "from azureml.widgets import RunDetails\n",
+    "from azureml.widgets import RunDetails\r\n",
    "RunDetails(pipeline_run).show()"
   ]
  },
@@ -573,18 +574,18 @@
   "metadata": {},
   "outputs": [],
   "source": [
-        "python_script_name = \"train-db-local.py\"\n",
+    "python_script_name = \"train-db-local.py\"\r\n",
-        "source_directory = \"./databricks_train\"\n",
+    "source_directory = \"./databricks_train\"\r\n",
-        "\n",
+    "\r\n",
-        "dbPythonInLocalMachineStep = DatabricksStep(\n",
+    "dbPythonInLocalMachineStep = DatabricksStep(\r\n",
-        "    name=\"DBPythonInLocalMachine\",\n",
+    "    name=\"DBPythonInLocalMachine\",\r\n",
-        "    inputs=[step_1_input],\n",
+    "    inputs=[step_1_input],\r\n",
-        "    num_workers=1,\n",
+    "    num_workers=1,\r\n",
-        "    python_script_name=python_script_name,\n",
+    "    python_script_name=python_script_name,\r\n",
-        "    source_directory=source_directory,\n",
+    "    source_directory=source_directory,\r\n",
-        "    run_name='DB_Python_Local_demo',\n",
+    "    run_name='DB_Python_Local_demo',\r\n",
-        "    compute_target=databricks_compute,\n",
+    "    compute_target=databricks_compute,\r\n",
-        "    allow_reuse=True\n",
+    "    allow_reuse=True\r\n",
    ")"
   ]
  },
@@ -601,9 +602,9 @@
   "metadata": {},
   "outputs": [],
   "source": [
-        "steps = [dbPythonInLocalMachineStep]\n",
+    "steps = [dbPythonInLocalMachineStep]\r\n",
-        "pipeline = Pipeline(workspace=ws, steps=steps)\n",
+    "pipeline = Pipeline(workspace=ws, steps=steps)\r\n",
-        "pipeline_run = Experiment(ws, 'DB_Python_Local_demo').submit(pipeline)\n",
+    "pipeline_run = Experiment(ws, 'DB_Python_Local_demo').submit(pipeline)\r\n",
    "pipeline_run.wait_for_completion()"
   ]
  },
@@ -620,7 +621,7 @@
   "metadata": {},
   "outputs": [],
   "source": [
-        "from azureml.widgets import RunDetails\n",
+    "from azureml.widgets import RunDetails\r\n",
    "RunDetails(pipeline_run).show()"
   ]
  },
@@ -646,19 +647,19 @@
   "metadata": {},
   "outputs": [],
   "source": [
-        "main_jar_class_name = \"com.microsoft.aeva.Main\"\n",
+    "main_jar_class_name = \"com.microsoft.aeva.Main\"\r\n",
-        "jar_library_dbfs_path = os.getenv(\"DATABRICKS_JAR_LIB_PATH\", \"<my-databricks-jar-lib-path>\") # Databricks jar library path\n",
+    "jar_library_dbfs_path = os.getenv(\"DATABRICKS_JAR_LIB_PATH\", \"<my-databricks-jar-lib-path>\") # Databricks jar library path\r\n",
-        "\n",
+    "\r\n",
-        "dbJarInDbfsStep = DatabricksStep(\n",
+    "dbJarInDbfsStep = DatabricksStep(\r\n",
-        "    name=\"DBJarInDBFS\",\n",
+    "    name=\"DBJarInDBFS\",\r\n",
-        "    inputs=[step_1_input],\n",
+    "    inputs=[step_1_input],\r\n",
-        "    num_workers=1,\n",
+    "    num_workers=1,\r\n",
-        "    main_class_name=main_jar_class_name,\n",
+    "    main_class_name=main_jar_class_name,\r\n",
-        "    jar_params={'arg1', pipeline_param, 'arg2'},\n",
+    "    jar_params={'arg1', pipeline_param, 'arg2'},\r\n",
-        "    run_name='DB_JAR_demo',\n",
+    "    run_name='DB_JAR_demo',\r\n",
-        "    jar_libraries=[JarLibrary(jar_library_dbfs_path)],\n",
+    "    jar_libraries=[JarLibrary(jar_library_dbfs_path)],\r\n",
-        "    compute_target=databricks_compute,\n",
+    "    compute_target=databricks_compute,\r\n",
-        "    allow_reuse=True\n",
+    "    allow_reuse=True\r\n",
    ")"
   ]
  },
@@ -675,9 +676,9 @@
   "metadata": {},
   "outputs": [],
   "source": [
-        "steps = [dbJarInDbfsStep]\n",
+    "steps = [dbJarInDbfsStep]\r\n",
-        "pipeline = Pipeline(workspace=ws, steps=steps)\n",
+    "pipeline = Pipeline(workspace=ws, steps=steps)\r\n",
-        "pipeline_run = Experiment(ws, 'DB_JAR_demo').submit(pipeline)\n",
+    "pipeline_run = Experiment(ws, 'DB_JAR_demo').submit(pipeline)\r\n",
    "pipeline_run.wait_for_completion()"
   ]
  },
@@ -694,19 +695,19 @@
   "metadata": {},
   "outputs": [],
   "source": [
-        "from azureml.widgets import RunDetails\n",
+    "from azureml.widgets import RunDetails\r\n",
    "RunDetails(pipeline_run).show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5. Running demo notebook already added to the Databricks workspace using existing cluster\n",
    "First you need register DBFS datastore and make sure path_on_datastore does exist in databricks file system, you can browser the files by refering [this](https://docs.azuredatabricks.net/user-guide/dbfs-databricks-file-system.html).\n",
    "\n",
    "Find existing_cluster_id by opeing Azure Databricks UI with Clusters page and in url you will find a string connected with '-' right after \"clusters/\"."
-      ],
+   ]
      "cell_type": "markdown",
      "metadata": {}
  },
  {
   "cell_type": "code",
@@ -714,13 +715,13 @@
   "metadata": {},
   "outputs": [],
   "source": [
-        "try:\n",
+    "try:\r\n",
-        "    dbfs_ds = Datastore.get(workspace=ws, datastore_name='dbfs_datastore')\n",
+    "    dbfs_ds = Datastore.get(workspace=ws, datastore_name='dbfs_datastore')\r\n",
-        "    print('DBFS Datastore already exists')\n",
+    "    print('DBFS Datastore already exists')\r\n",
-        "except Exception as ex:\n",
+    "except Exception as ex:\r\n",
-        "    dbfs_ds = Datastore.register_dbfs(ws, datastore_name='dbfs_datastore')\n",
+    "    dbfs_ds = Datastore.register_dbfs(ws, datastore_name='dbfs_datastore')\r\n",
-        "\n",
+    "\r\n",
-        "step_1_input = DataReference(datastore=dbfs_ds, path_on_datastore=\"FileStore\", data_reference_name=\"input\")\n",
+    "step_1_input = DataReference(datastore=dbfs_ds, path_on_datastore=\"FileStore\", data_reference_name=\"input\")\r\n",
    "step_1_output = PipelineData(\"output\", datastore=dbfs_ds)"
   ]
  },
@@ -730,26 +731,26 @@
   "metadata": {},
   "outputs": [],
   "source": [
-        "dbNbWithExistingClusterStep = DatabricksStep(\n",
+    "dbNbWithExistingClusterStep = DatabricksStep(\r\n",
-        "    name=\"DBFSReferenceWithExisting\",\n",
+    "    name=\"DBFSReferenceWithExisting\",\r\n",
-        "    inputs=[step_1_input],\n",
+    "    inputs=[step_1_input],\r\n",
-        "    outputs=[step_1_output],\n",
+    "    outputs=[step_1_output],\r\n",
-        "    notebook_path=notebook_path,\n",
+    "    notebook_path=notebook_path,\r\n",
-        "    notebook_params={'myparam': 'testparam', \n",
+    "    notebook_params={'myparam': 'testparam', \r\n",
-        "        'myparam2': pipeline_param},\n",
+    "        'myparam2': pipeline_param},\r\n",
-        "    run_name='DBFS_Reference_With_Existing',\n",
+    "    run_name='DBFS_Reference_With_Existing',\r\n",
-        "    compute_target=databricks_compute,\n",
+    "    compute_target=databricks_compute,\r\n",
-        "    existing_cluster_id=\"your existing cluster id\",\n",
+    "    existing_cluster_id=\"your existing cluster id\",\r\n",
-        "    allow_reuse=True\n",
+    "    allow_reuse=True\r\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Build and submit the Experiment"
-      ],
+   ]
      "cell_type": "markdown",
      "metadata": {}
  },
  {
   "cell_type": "code",
@@ -757,18 +758,18 @@
   "metadata": {},
   "outputs": [],
   "source": [
-        "steps = [dbNbWithExistingClusterStep]\n",
+    "steps = [dbNbWithExistingClusterStep]\r\n",
-        "pipeline = Pipeline(workspace=ws, steps=steps)\n",
+    "pipeline = Pipeline(workspace=ws, steps=steps)\r\n",
-        "pipeline_run = Experiment(ws, 'DBFS_Reference_With_Existing').submit(pipeline)\n",
+    "pipeline_run = Experiment(ws, 'DBFS_Reference_With_Existing').submit(pipeline)\r\n",
    "pipeline_run.wait_for_completion()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### View Run Details"
-      ],
+   ]
      "cell_type": "markdown",
      "metadata": {}
  },
  {
   "cell_type": "code",
@@ -776,19 +777,19 @@
   "metadata": {},
   "outputs": [],
   "source": [
-        "from azureml.widgets import RunDetails\n",
+    "from azureml.widgets import RunDetails\r\n",
    "RunDetails(pipeline_run).show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-        "### 6. Running a Python script in Databricks that currenlty is in local computer with existing cluster\n",
+    "### 6. Running a Python script in Databricks that is currently in local computer with existing cluster\r\n",
-        "When you access azure blob or data lake storage from an existing (interactive) cluster, you need to ensure the Spark configuration is set up correctly to access this storage and this set up may require the cluster to be restarted.\n",
+    "When you access azure blob or data lake storage from an existing (interactive) cluster, you need to ensure the Spark configuration is set up correctly to access this storage and this set up may require the cluster to be restarted.\r\n",
-        "\n",
+    "\r\n",
    "If you set permit_cluster_restart to True, AML will check if the spark configuration needs to be updated and restart the cluster for you if required. This will ensure that the storage can be correctly accessed from the Databricks cluster."
-      ],
+   ]
      "cell_type": "markdown",
      "metadata": {}
  },
  {
   "cell_type": "code",
@@ -796,28 +797,28 @@
   "metadata": {},
   "outputs": [],
   "source": [
-        "step_1_input = DataReference(datastore=def_blob_store, path_on_datastore=\"dbtest\",\n",
+    "step_1_input = DataReference(datastore=def_blob_store, path_on_datastore=\"dbtest\",\r\n",
-        "                                     data_reference_name=\"input\")\n",
+    "                                     data_reference_name=\"input\")\r\n",
-        "\n",
+    "\r\n",
-        "dbPythonInLocalWithExistingStep = DatabricksStep(\n",
+    "dbPythonInLocalWithExistingStep = DatabricksStep(\r\n",
-        "    name=\"DBPythonInLocalMachineWithExisting\",\n",
+    "    name=\"DBPythonInLocalMachineWithExisting\",\r\n",
-        "    inputs=[step_1_input],\n",
+    "    inputs=[step_1_input],\r\n",
-        "    python_script_name=python_script_name,\n",
+    "    python_script_name=python_script_name,\r\n",
-        "    source_directory=source_directory,\n",
+    "    source_directory=source_directory,\r\n",
-        "    run_name='DB_Python_Local_existing_demo',\n",
+    "    run_name='DB_Python_Local_existing_demo',\r\n",
-        "    compute_target=databricks_compute,\n",
+    "    compute_target=databricks_compute,\r\n",
-        "    existing_cluster_id=\"your existing cluster id\",\n",
+    "    existing_cluster_id=\"your existing cluster id\",\r\n",
-        "    allow_reuse=False,\n",
+    "    allow_reuse=False,\r\n",
-        "    permit_cluster_restart=True\n",
+    "    permit_cluster_restart=True\r\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Build and submit the Experiment"
-      ],
+   ]
      "cell_type": "markdown",
      "metadata": {}
  },
  {
   "cell_type": "code",
@@ -825,18 +826,18 @@
   "metadata": {},
   "outputs": [],
   "source": [
-        "steps = [dbPythonInLocalWithExistingStep]\n",
+    "steps = [dbPythonInLocalWithExistingStep]\r\n",
-        "pipeline = Pipeline(workspace=ws, steps=steps)\n",
+    "pipeline = Pipeline(workspace=ws, steps=steps)\r\n",
-        "pipeline_run = Experiment(ws, 'DB_Python_Local_existing_demo').submit(pipeline)\n",
+    "pipeline_run = Experiment(ws, 'DB_Python_Local_existing_demo').submit(pipeline)\r\n",
    "pipeline_run.wait_for_completion()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### View Run Details"
-      ],
+   ]
      "cell_type": "markdown",
      "metadata": {}
  },
  {
   "cell_type": "code",
@@ -844,17 +845,70 @@
   "metadata": {},
   "outputs": [],
   "source": [
-        "from azureml.widgets import RunDetails\n",
+    "from azureml.widgets import RunDetails\r\n",
    "RunDetails(pipeline_run).show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### How to get run context in a Databricks interactive cluster\r\n",
    "\r\n",
    "Users are used to being able to use Run.get_context() to retrieve the parent_run_id for a given run_id. In DatabricksStep, however, a little more work is required to achieve this.\r\n",
    "\r\n",
    "The solution is to parse the script arguments and set corresponding environment variables to access the run context from within Databricks.\r\n",
    "Note that this workaround is not required for job clusters. \r\n",
    "\r\n",
    "Here is a code sample:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```python\r\n",
    "from azureml.core import Run\r\n",
    "import argparse\r\n",
    "import os\r\n",
    "\r\n",
    "\r\n",
    "def populate_environ():\r\n",
    "    parser = argparse.ArgumentParser(description='Process arguments passed to script')\r\n",
    "    parser.add_argument('--AZUREML_SCRIPT_DIRECTORY_NAME')\r\n",
    "    parser.add_argument('--AZUREML_RUN_TOKEN')\r\n",
    "    parser.add_argument('--AZUREML_RUN_TOKEN_EXPIRY')\r\n",
    "    parser.add_argument('--AZUREML_RUN_ID')\r\n",
    "    parser.add_argument('--AZUREML_ARM_SUBSCRIPTION')\r\n",
    "    parser.add_argument('--AZUREML_ARM_RESOURCEGROUP')\r\n",
    "    parser.add_argument('--AZUREML_ARM_WORKSPACE_NAME')\r\n",
    "    parser.add_argument('--AZUREML_ARM_PROJECT_NAME')\r\n",
    "    parser.add_argument('--AZUREML_SERVICE_ENDPOINT')\r\n",
    "\r\n",
    "    args = parser.parse_args()\r\n",
    "    os.environ['AZUREML_SCRIPT_DIRECTORY_NAME'] = args.AZUREML_SCRIPT_DIRECTORY_NAME\r\n",
    "    os.environ['AZUREML_RUN_TOKEN'] = args.AZUREML_RUN_TOKEN\r\n",
    "    os.environ['AZUREML_RUN_TOKEN_EXPIRY'] = args.AZUREML_RUN_TOKEN_EXPIRY\r\n",
    "    os.environ['AZUREML_RUN_ID'] = args.AZUREML_RUN_ID\r\n",
    "    os.environ['AZUREML_ARM_SUBSCRIPTION'] = args.AZUREML_ARM_SUBSCRIPTION\r\n",
    "    os.environ['AZUREML_ARM_RESOURCEGROUP'] = args.AZUREML_ARM_RESOURCEGROUP\r\n",
    "    os.environ['AZUREML_ARM_WORKSPACE_NAME'] = args.AZUREML_ARM_WORKSPACE_NAME\r\n",
    "    os.environ['AZUREML_ARM_PROJECT_NAME'] = args.AZUREML_ARM_PROJECT_NAME\r\n",
    "    os.environ['AZUREML_SERVICE_ENDPOINT'] = args.AZUREML_SERVICE_ENDPOINT\r\n",
    "\r\n",
    "populate_environ()\r\n",
    "run = Run.get_context(allow_offline=False)\r\n",
    "print(run._run_dto[\"parent_run_id\"])\r\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Next: ADLA as a Compute Target\n",
    "To use ADLA as a compute target from Azure Machine Learning Pipeline, a AdlaStep is used. This [notebook](https://aka.ms/pl-adla) demonstrates the use of AdlaStep in Azure Machine Learning Pipeline."
-      ],
+   ]
      "cell_type": "markdown",
      "metadata": {}
  }
 ],
 "metadata": {
Author	SHA1	Message	Date
Sharmeelee Bijlani	279a1ba2c0	Update aml-pipelines-use-databricks-as-compute-target.ipynb	2021-09-13 09:32:32 -07:00
Sharmeelee Bijlani	8233533dcd	Update aml-pipelines-use-databricks-as-compute-target.ipynb	2021-09-10 13:20:03 -07:00
Sharmeelee Bijlani	89f23e6d50	add "how to" guidance for common issue in DatabricksStep	2021-09-10 12:51:41 -07:00