Files

85 lines
4.4 KiB
Markdown
Raw Permalink Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
Azure Synapse Analyticsis a limitless analytics service that brings together data integration, enterprise data warehousing, and big data analytics. It gives you the freedom to query data on your terms, using either serverless or dedicated resources—at scale. Azure Synapse brings these worlds together with a unified experience to ingest, explore, prepare, manage, and serve data for immediate BI and machine learning needs.A coreoffering within Azure Synapse Analyticsare serverlessApache Spark poolsenhanced for big data workloads.
Synapse in Aml integration is for customerswho want to useApacheSparkin AzureSynapse Analyticsto prepare data at scale in Azure ML before training their ML model. This will allow customers to work on their end-to-end ML lifecycle including large-scale data preparation, model training and deployment within Azure ML workspace without having to use suboptimal tools for machine learning or switch between multipletools for data preparation and model training.The ability to perform all ML tasks within Azure ML willreducetimerequired for customersto iterate on a machine learning project which typically includesmultiple rounds ofdata preparation and training.
In the public preview, the capabilities are provided:
- Link Azure Synapse Analytics workspace to Azure Machine Learning workspace (via ARM, UI or SDK)
- Attach Apache Spark pools powered by Azure Synapse Analytics as Azure Machine Learning compute targets (via ARM, UI or SDK)
- Launch Apache Spark sessions in notebooks and perform interactive data exploration and preparation. This interactive experience leverages Apache Spark magic and customers will have session-level Conda support to install packages.
- Productionize ML pipelines by leveraging Apache Spark pools to pre-process big data
# Using Synapse in Azure machine learning
## Create synapse resources
Follow up the documents to create Synapse workspace and resource-setup.sh is available for you to create the resources.
- Create from [Portal](https://docs.microsoft.com/en-us/azure/synapse-analytics/quickstart-create-workspace)
- Create from [Cli](https://docs.microsoft.com/en-us/azure/synapse-analytics/quickstart-create-workspace-cli)
Follow up the documents to create Synapse spark pool
- Create from [Portal](https://docs.microsoft.com/en-us/azure/synapse-analytics/quickstart-create-apache-spark-pool-portal)
- Create from [Cli](https://docs.microsoft.com/en-us/cli/azure/ext/synapse/synapse/spark/pool?view=azure-cli-latest)
## Link Synapse Workspace
Make sure you are the owner of synapse workspace so that you can link synapse workspace into AML.
You can run resource-setup.py to link the synapse workspace and attach compute
```python
from azureml.core import Workspace
ws = Workspace.from_config()
from azureml.core import LinkedService, SynapseWorkspaceLinkedServiceConfiguration
synapse_link_config = SynapseWorkspaceLinkedServiceConfiguration(
subscription_id="<subscription id>",
resource_group="<resource group",
name="<synapse workspace name>"
)
linked_service = LinkedService.register(
workspace=ws,
name='<link name>',
linked_service_config=synapse_link_config)
```
## Attach synapse spark pool as AzureML compute
```python
from azureml.core.compute import SynapseCompute, ComputeTarget
spark_pool_name = "<spark pool name>"
attached_synapse_name = "<attached compute name>"
attach_config = SynapseCompute.attach_configuration(
linked_service,
type="SynapseSpark",
pool_name=spark_pool_name)
synapse_compute=ComputeTarget.attach(
workspace=ws,
name=attached_synapse_name,
attach_configuration=attach_config)
synapse_compute.wait_for_completion()
```
## Set up permission
Grant Spark admin role to system assigned identity of the linked service so that the user can submit experiment run or pipeline run from AML workspace to synapse spark pool.
Grant Spark admin role to the specific user so that the user can start spark session to synapse spark pool.
You can get the system assigned identity information by running
```python
print(linked_service.system_assigned_identity_principal_id)
```
- Launch synapse studio of the synapse workspace and grant linked service MSI "Synapse Apache Spark administrator" role.
- In azure portal grant linked service MSI "Storage Blob Data Contributor" role of the primary adlsgen2 account of synapse workspace to use the library management feature.