Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/azure-arcadia/spark_session_on_synapse_spark_pool.png)

# Interactive Spark Session on Synapse Spark Pool

### Install package

In [None]:
!pip install -U "azureml-synapse"

For JupyterLab, please additionally run:

In [None]:
!jupyter lab build --minimize=False

## PLEASE restart kernel and then refresh web page before starting spark session.

## 0. How to leverage Spark Magic for interactive Spark experience

In [None]:
# show help
%synapse ?

## 1. Start Synapse Session

In [None]:
synapse_compute_name=os.getenv("SYNAPSE_COMPUTE_NAME", "")

In [None]:
# use Synapse compute linked to the Compute Instance's workspace with an aml envrionment.
# conda dependencies specified in the environment will be installed before the spark session started.

%synapse start -c $synapse_compute_name -e AzureML-Minimal

In [None]:
# use Synapse compute from anther workspace via its config file

# %synapse start -c -f config.json

In [None]:
# use Synapse compute from anther workspace via subscription_id, resource_group and workspace_name

# %synapse start -c -s -r -w 

In [None]:
# start a spark session with an AML environment, 
# %synapse start -c -s -r -w -e AzureML-Minimal

## 2. Data prepration

Three types of datastore are supported in synapse spark, and you have two ways to load the data.


| Datastore Type | Data Acess |
|--------------------|-------------------------------|
| Blob | Credential |
| Adlsgen1 | Credential & Credential-less |
| Adlsgen2 | Credential & Credential-less |

### Example 1: Data loading by HDFS path

**Read data from Blob**

```python
# setup access key or sas token

sc._jsc.hadoopConfiguration().set("fs.azure.account.key..blob.core.windows.net", "")
sc._jsc.hadoopConfiguration().set("fs.azure.sas...blob.core.windows.net", "sas token")

df = spark.read.parquet("wasbs://@.blob.core.windows.net/")
```

**Read data from Adlsgen1**

```python
# setup service pricinpal which has access of the data
# If no data Credential is setup, the user identity will be used to do access control

sc._jsc.hadoopConfiguration().set("fs.adl.account..oauth2.access.token.provider.type","ClientCredential")
sc._jsc.hadoopConfiguration().set("fs.adl.account..oauth2.client.id", "")
sc._jsc.hadoopConfiguration().set("fs.adl.account..oauth2.credential", "")
sc._jsc.hadoopConfiguration().set("fs.adl.account..oauth2.refresh.url", "https://login.microsoftonline.com//oauth2/token")

df = spark.read.csv("adl://.azuredatalakestore.net/")
```

**Read data from Adlsgen2**

```python
# setup service pricinpal which has access of the data
# If no data Credential is setup, the user identity will be used to do access control

sc._jsc.hadoopConfiguration().set("fs.azure.account.auth.type..dfs.core.windows.net","OAuth")
sc._jsc.hadoopConfiguration().set("fs.azure.account.oauth.provider.type..dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
sc._jsc.hadoopConfiguration().set("fs.azure.account.oauth2.client.id..dfs.core.windows.net", "")
sc._jsc.hadoopConfiguration().set("fs.azure.account.oauth2.client.secret..dfs.core.windows.net", "")
sc._jsc.hadoopConfiguration().set("fs.azure.account.oauth2.client.endpoint..dfs.core.windows.net", "https://login.microsoftonline.com//oauth2/token")

df = spark.read.csv("abfss://@.dfs.core.windows.net/")
```

In [None]:
%%synapse

from pyspark.sql.functions import col, desc

df = spark.read.option("header", "true").csv("wasbs://demo@dprepdata.blob.core.windows.net/Titanic.csv")
df.filter(col('Survived') == 1).groupBy('Age').count().orderBy(desc('count')).show(10)

### Example 2: Data loading by AML Dataset

You can create tabular data by following the [guidance](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-create-register-datasets) and use to_spark_dataframe() to load the data.

```text
%%synapse

import azureml.core
print(azureml.core.VERSION)

from azureml.core import Workspace, Dataset
ws = Workspace.get(name='', subscription_id='', resource_group='')
ds = Dataset.get_by_name(ws, "")
df = ds.to_spark_dataframe()

# You can do more data transformation on spark dataframe
```

## 3. Session Metadata
After session started, you can check the session's metadata, find the links to Synapse portal.

In [None]:
%synapse meta

## 4. Stop Session
When current session reach the status timeout, dead or any failure, you must explicitly stop it before start new one. 

In [None]:
%synapse stop