mirror of
https://github.com/Azure/MachineLearningNotebooks.git
synced 2025-12-19 17:17:04 -05:00
215 lines
6.3 KiB
Plaintext
215 lines
6.3 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Reading from and Writing to Datastores"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Copyright (c) Microsoft Corporation. All rights reserved.<br>\n",
|
|
"Licensed under the MIT License."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"A datastore is a reference that points to an Azure storage service like a blob container for example. It belongs to a workspace and a workspace can have many datastores.\n",
|
|
"\n",
|
|
"A data path points to a path on the underlying Azure storage service the datastore references. For example, given a datastore named `blob` that points to an Azure blob container, a data path can point to `/test/data/titanic.csv` in the blob container."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Read data from Datastore"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Data Prep supports reading data from a `Datastore` or a `DataPath` or a `DataReference`. \n",
|
|
"\n",
|
|
"Passing in a datastore into all the `read_*` methods of Data Prep will result in reading everything in the underlying Azure storage service. To read a specific folder or file in the underlying storage, you have to pass in a data reference."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.core import Workspace, Datastore\n",
|
|
"from azureml.data.datapath import DataPath\n",
|
|
"\n",
|
|
"import azureml.dataprep as dprep"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"First, get or create a workspace. Feel free to replace `subscription_id`, `resource_group`, and `workspace_name` with other values."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"subscription_id = '35f16a99-532a-4a47-9e93-00305f6c40f2'\n",
|
|
"resource_group = 'DataStoreTest'\n",
|
|
"workspace_name = 'dataprep-centraleuap'\n",
|
|
"\n",
|
|
"workspace = Workspace(subscription_id=subscription_id, resource_group=resource_group, workspace_name=workspace_name)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"workspace.datastores"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"You can now read a crime data set from the datastore. If you are using your own workspace, the `crime0-10.csv` will not be there by default. You will have to upload the data to the datastore yourself."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"datastore = Datastore(workspace=workspace, name='dataprep_blob')\n",
|
|
"dflow = dprep.read_csv(path=datastore.path('crime0-10.csv'))\n",
|
|
"dflow.head(5)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"You can also read from an Azure SQL database. To do that, you will first get an Azure SQL database datastore instance and pass it to Data Prep for reading."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"datastore = Datastore(workspace=workspace, name='test_sql')\n",
|
|
"dflow_sql = dprep.read_sql(data_source=datastore, query='SELECT * FROM team')\n",
|
|
"dflow_sql.head(5)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"You can also read from a PostgreSQL database. To do that, you will first get a PostgreSQL database datastore instance and pass it to Data Prep for reading."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"datastore = Datastore(workspace=workspace, name='postgre_test')\n",
|
|
"dflow_sql = dprep.read_postgresql(data_source=datastore, query='SELECT * FROM public.people')\n",
|
|
"dflow_sql.head(5)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Write data to Datastore"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"You can also write a dataflow to a datastore. The code below will write the file you read in earlier to the folder in the datastore."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"dest_datastore = Datastore(workspace, 'dataprep_blob_key')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"dflow.write_to_csv(directory_path=dest_datastore.path('output/crime0-10')).run_local()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Now you can read all the files in the `dataprep_adls` datastore which references an Azure Data Lake store."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"datastore = Datastore(workspace=workspace, name='dataprep_adls')\n",
|
|
"dflow_adls = dprep.read_csv(path=DataPath(datastore, path_on_datastore='/input/crime0-10.csv'))\n",
|
|
"dflow_adls.head(5)"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"authors": [
|
|
{
|
|
"name": "sihhu"
|
|
}
|
|
],
|
|
"kernelspec": {
|
|
"display_name": "Python 3.6",
|
|
"language": "python",
|
|
"name": "python36"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.6.4"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
} |