MachineLearningNotebooks/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-use-adla-as-compute-target.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copyright (c) Microsoft Corporation. All rights reserved.  \n",
    "Licensed under the MIT License."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# AML Pipeline with AdlaStep\n",
    "This notebook is used to demonstrate the use of AdlaStep in AML Pipeline."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## AML and Pipeline SDK-specific imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import azureml.core\n",
    "from azureml.core.compute import ComputeTarget, DatabricksCompute\n",
    "from azureml.exceptions import ComputeTargetException\n",
    "from azureml.core import Workspace, Run, Experiment\n",
    "from azureml.pipeline.core import Pipeline, PipelineData\n",
    "from azureml.pipeline.steps import AdlaStep\n",
    "from azureml.core.datastore import Datastore\n",
    "from azureml.data.data_reference import DataReference\n",
    "from azureml.core import attach_legacy_compute_target\n",
    "\n",
    "# Check core SDK version number\n",
    "print(\"SDK version:\", azureml.core.VERSION)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Initialize Workspace\n",
    "\n",
    "Initialize a workspace object from persisted configuration. Make sure the config file is present at .\\config.json"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "create workspace"
    ]
   },
   "outputs": [],
   "source": [
    "ws = Workspace.from_config()\n",
    "print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\\n')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "script_folder = '.'\n",
    "experiment_name = \"adla_101_experiment\"\n",
    "ws._initialize_folder(experiment_name=experiment_name, directory=script_folder)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Register Datastore"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "workspace = ws.name\n",
    "datastore_name='MyAdlsDatastore'\n",
    "subscription_id=os.getenv(\"ADL_SUBSCRIPTION_62\" \"<my-subscription-id>\"), # subscription id of ADLS account\n",
    "resource_group=os.getenv(\"ADL_RESOURCE_GROUP_62\" \"<my-resource-group>\"), # resource group of ADLS account\n",
    "store_name=os.getenv(\"ADL_STORENAME_62\", \"<my-datastore-name>\"), # ADLS account name\n",
    "tenant_id=os.getenv(\"ADL_TENANT_62\", \"<my-tenant-id>\") # tenant id of service principal\n",
    "client_id=os.getenv(\"ADL_CLIENTID_62\", \"<my-client-id>\") # client id of service principal\n",
    "client_secret=os.getenv(\"ADL_CLIENT_62_SECRET\", \"<my-client-secret>\") # the secret of service principal\n",
    "\n",
    "try:\n",
    "    adls_datastore = Datastore.get(ws, datastore_name)\n",
    "    print(\"found datastore with name: %s\" % datastore_name)\n",
    "except:\n",
    "    adls_datastore = Datastore.register_azure_data_lake(\n",
    "        workspace=ws,\n",
    "        datastore_name=datastore_name,\n",
    "        subscription_id=subscription_id, # subscription id of ADLS account\n",
    "        resource_group=resource_group, # resource group of ADLS account\n",
    "        store_name=store_name, # ADLS account name\n",
    "        tenant_id=tenant_id, # tenant id of service principal\n",
    "        client_id=client_id, # client id of service principal\n",
    "        client_secret=client_secret) # the secret of service principal\n",
    "    print(\"registered datastore with name: %s\" % datastore_name)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create DataReferences and PipelineData\n",
    "\n",
    "In the code cell below, replace datastorename with your default datastore name. Copy the file `testdata.txt` (located in the pipeline folder that this notebook is in) to the path on the datastore."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "datastorename = \"MyAdlsDatastore\"\n",
    "\n",
    "adls_datastore = Datastore(workspace=ws, name=datastorename)\n",
    "script_input = DataReference(\n",
    "    datastore=adls_datastore,\n",
    "    data_reference_name=\"script_input\",\n",
    "    path_on_datastore=\"testdata/testdata.txt\")\n",
    "\n",
    "script_output = PipelineData(\"script_output\", datastore=adls_datastore)\n",
    "\n",
    "print(\"Created Pipeline Data\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup Data Lake Account\n",
    "\n",
    "ADLA can only use data that is located in the default data store associated with that ADLA account. Through Azure portal, check the name of the default data store corresponding to the ADLA account you are using below. Replace the value associated with `adla_compute_name` in the code cell below accordingly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "adla_compute_name = 'testadl' # Replace this with your default compute\n",
    "\n",
    "from azureml.core.compute import ComputeTarget, AdlaCompute\n",
    "\n",
    "def get_or_create_adla_compute(workspace, compute_name):\n",
    "    try:\n",
    "        return AdlaCompute(workspace, compute_name)\n",
    "    except ComputeTargetException as e:\n",
    "        if 'ComputeTargetNotFound' in e.message:\n",
    "            print('adla compute not found, creating...')\n",
    "            provisioning_config = AdlaCompute.provisioning_configuration()\n",
    "            adla_compute = ComputeTarget.create(workspace, compute_name, provisioning_config)\n",
    "            adla_compute.wait_for_completion()\n",
    "            return adla_compute\n",
    "        else:\n",
    "            raise e\n",
    "            \n",
    "adla_compute = get_or_create_adla_compute(ws, adla_compute_name)\n",
    "\n",
    "# CLI:\n",
    "# Create: az ml computetarget setup adla -n <name>\n",
    "# BYOC: az ml computetarget attach adla -n <name> -i <resource-id>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once the above code cell completes, run the below to check your ADLA compute status:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"ADLA compute state:{}\".format(adla_compute.provisioning_state))\n",
    "print(\"ADLA compute state:{}\".format(adla_compute.provisioning_errors))\n",
    "print(\"Using ADLA compute:{}\".format(adla_compute.cluster_resource_id))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create an AdlaStep"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**AdlaStep** is used to run U-SQL script using Azure Data Lake Analytics.\n",
    "\n",
    "- **name:** Name of module\n",
    "- **script_name:** name of U-SQL script\n",
    "- **inputs:** List of input port bindings\n",
    "- **outputs:** List of output port bindings\n",
    "- **adla_compute:** the ADLA compute to use for this job\n",
    "- **params:** Dictionary of name-value pairs to pass to U-SQL job *(optional)*\n",
    "- **degree_of_parallelism:** the degree of parallelism to use for this job *(optional)*\n",
    "- **priority:** the priority value to use for the current job *(optional)*\n",
    "- **runtime_version:** the runtime version of the Data Lake Analytics engine *(optional)*\n",
    "- **root_folder:** folder that contains the script, assemblies etc. *(optional)*\n",
    "- **hash_paths:** list of paths to hash to detect a change (script file is always hashed) *(optional)*\n",
    "\n",
    "### Remarks\n",
    "\n",
    "You can use `@@name@@` syntax in your script to refer to inputs, outputs, resources, and params.\n",
    "\n",
    "* if `name` is the name of an input or output port binding, any occurences of `@@name@@` in the script\n",
    "are replaced with actual data path of corresponding port binding.\n",
    "* if `name` is the name of a resource input port binding, any occurences of `@@name@@` in the script\n",
    "are replaced with local path of resource after it's downloaded to script directory on a worker node.\n",
    "* if `name` matches any key in `params` dict, any occurences of `@@name@@` will be replaced with\n",
    "corresponding value in dict.\n",
    "\n",
    "#### Sample script\n",
    "\n",
    "```\n",
    "@resourcereader =\n",
    "    EXTRACT query string\n",
    "    FROM \"@@script_input@@\"\n",
    "    USING Extractors.Csv();\n",
    "\n",
    "\n",
    "OUTPUT @resourcereader\n",
    "TO \"@@script_output@@\"\n",
    "USING Outputters.Csv();\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "adla_step = AdlaStep(\n",
    "    name='adla_script_step',\n",
    "    script_name='test_adla_script.usql',\n",
    "    inputs=[script_input],\n",
    "    outputs=[script_output],\n",
    "    compute_target=adla_compute)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Build and Submit the Experiment"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pipeline = Pipeline(\n",
    "    description=\"adla_102\",\n",
    "    workspace=ws, \n",
    "    steps=[adla_step],\n",
    "    default_source_directory=script_folder)\n",
    "\n",
    "pipeline_run = Experiment(workspace, experiment_name).submit(pipeline)\n",
    "pipeline_run.wait_for_completion()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### View Run Details"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from azureml.widgets import RunDetails\n",
    "RunDetails(pipeline_run).show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Examine the run\n",
    "You can cycle through the node_run objects and examine job logs, stdout, and stderr of each of the steps."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "step_runs = pipeline_run.get_children()\n",
    "for step_run in step_runs:\n",
    "    status = step_run.get_status()\n",
    "    print('node', step_run.name, 'status:', status)\n",
    "    if status == \"Failed\":\n",
    "        joblog = step_run.get_job_log()\n",
    "        print('job log:', joblog)\n",
    "        stdout_log = step_run.get_stdout_log()\n",
    "        print('stdout log:', stdout_log)\n",
    "        stderr_log = step_run.get_stderr_log()\n",
    "        print('stderr log:', stderr_log)\n",
    "        with open(\"logs-\" + step_run.name + \".txt\", \"w\") as f:\n",
    "            f.write(joblog)\n",
    "            print(\"Job log written to logs-\"+ step_run.name + \".txt\")\n",
    "    if status == \"Finished\":\n",
    "        stdout_log = step_run.get_stdout_log()\n",
    "        print('stdout log:', stdout_log)"
   ]
  }
 ],
 "metadata": {
  "authors": [
   {
    "name": "diray"
   }
  ],
  "kernelspec": {
   "display_name": "Python 3.6",
   "language": "python",
   "name": "python36"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}