diff --git a/work-with-data/dataprep/how-to-guides/datastore.ipynb b/work-with-data/dataprep/how-to-guides/datastore.ipynb deleted file mode 100644 index 76bc0ed4..00000000 --- a/work-with-data/dataprep/how-to-guides/datastore.ipynb +++ /dev/null @@ -1,246 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/work-with-data/dataprep/how-to-guides/datastore.png)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Reading from and Writing to Datastores" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "A datastore is a reference that points to an Azure storage service like a blob container for example. It belongs to a workspace and a workspace can have many datastores.\n", - "\n", - "A data path points to a path on the underlying Azure storage service the datastore references. For example, given a datastore named `blob` that points to an Azure blob container, a data path can point to `/test/data/titanic.csv` in the blob container." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Read data from Datastore" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Data Prep supports reading data from a `Datastore` or a `DataPath` or a `DataReference`. \n", - "\n", - "Passing in a datastore into all the `read_*` methods of Data Prep will result in reading everything in the underlying Azure storage service. To read a specific folder or file in the underlying storage, you have to pass in a data reference." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from azureml.core import Workspace, Datastore\n", - "from azureml.data.datapath import DataPath\n", - "\n", - "import azureml.dataprep as dprep" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "First, get or create a workspace. Feel free to replace `subscription_id`, `resource_group`, and `workspace_name` with other values." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "subscription_id = '35f16a99-532a-4a47-9e93-00305f6c40f2'\n", - "resource_group = 'DataStoreTest'\n", - "workspace_name = 'dataprep-centraleuap'\n", - "\n", - "workspace = Workspace(subscription_id=subscription_id, resource_group=resource_group, workspace_name=workspace_name)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "workspace.datastores" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You can now read a crime data set from the datastore. If you are using your own workspace, the `crime0-10.csv` will not be there by default. You will have to upload the data to the datastore yourself." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "datastore = Datastore(workspace=workspace, name='dataprep_blob')\n", - "dflow = dprep.read_csv(path=datastore.path('crime0-10.csv'))\n", - "dflow.head(5)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You can also read from an Azure SQL database. To do that, you will first get an Azure SQL database datastore instance and pass it to Data Prep for reading." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "datastore = Datastore(workspace=workspace, name='test_sql')\n", - "dflow_sql = dprep.read_sql(data_source=datastore, query='SELECT * FROM team')\n", - "dflow_sql.head(5)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You can also read from a PostgreSQL database. To do that, you will first get a PostgreSQL database datastore instance and pass it to Data Prep for reading." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "datastore = Datastore(workspace=workspace, name='postgre_test')\n", - "dflow_sql = dprep.read_postgresql(data_source=datastore, query='SELECT * FROM public.people')\n", - "dflow_sql.head(5)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Write data to Datastore" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You can also write a dataflow to a datastore. The code below will write the file you read in earlier to the folder in the datastore." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "dest_datastore = Datastore(workspace, 'dataprep_blob_key')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "dflow.write_to_csv(directory_path=dest_datastore.path('output/crime0-10')).run_local()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now you can read all the files in the `dataprep_adls` datastore which references an Azure Data Lake store." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "datastore = Datastore(workspace=workspace, name='dataprep_adls')\n", - "dflow_adls = dprep.read_csv(path=DataPath(datastore, path_on_datastore='/input/crime0-10.csv'))\n", - "dflow_adls.head(5)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now you can read all the files in the `dataprep_adlsgen2` datastore which references an ADLSGen2 Storage account." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# read a file from ADLSGen2\n", - "datastore = Datastore(workspace=workspace, name='adlsgen2')\n", - "dflow_adlsgen2 = dprep.read_csv(path=DataPath(datastore, path_on_datastore='/testfolder/peopletest.csv'))\n", - "dflow_adlsgen2.head(5)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# read all files from ADLSGen2 directory\n", - "datastore = Datastore(workspace=workspace, name='adlsgen2')\n", - "dflow_adlsgen2 = dprep.read_csv(path=DataPath(datastore, path_on_datastore='/testfolder/testdir'))\n", - "dflow_adlsgen2.head()" - ] - } - ], - "metadata": { - "authors": [ - { - "name": "sihhu" - } - ], - "kernelspec": { - "display_name": "Python 3.6", - "language": "python", - "name": "python36" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.6.4" - }, - "notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License." - }, - "nbformat": 4, - "nbformat_minor": 2 -} \ No newline at end of file