Files
MachineLearningNotebooks/how-to-use-azureml/work-with-data/dataprep/how-to-guides/data-ingestion.ipynb
2019-06-26 14:39:09 -04:00

979 lines
30 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/work-with-data/dataprep/how-to-guides/data-ingestion.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data Ingestion\n",
"Copyright (c) Microsoft Corporation. All rights reserved.<br>\n",
"Licensed under the MIT License."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import azureml.dataprep as dprep"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Data Prep has the ability to load different types of input data. You can use auto-reading functionality to detect the type of a file, or directly specify a file type and its parameters."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Table of Contents\n",
"[Read Lines](#lines)<br>\n",
"[Read CSV](#csv)<br>\n",
"[Read Compressed CSV](#compressed-csv)<br>\n",
"[Read Excel](#excel)<br>\n",
"[Read Fixed Width Files](#fixed-width)<br>\n",
"[Read Parquet](#parquet)<br>\n",
"[Read Part Files Using Globbing](#globbing)<br>\n",
"[Read JSON](#json)<br>\n",
"[Read SQL](#sql)<br>\n",
"[Read PostgreSQL](#postgresql)<br>\n",
"[Read From Azure Blob](#azure-blob)<br>\n",
"[Read From ADLS](#adls)<br>\n",
"[Read Pandas DataFrame](#pandas-df)<br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"lines\"></a>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Read Lines"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One of the simplest ways to read data using Data Prep is to just read it as text lines."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dflow = dprep.read_lines(path='../data/crime.txt')\n",
"dflow.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With ingestion done, you can go ahead and start prepping the dataset."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df = dflow.to_pandas_dataframe()\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"csv\"></a>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Read CSV"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When reading delimited files, the only required parameter is `path`. Other parameters (e.g. separator, encoding, whether to use headers, etc.) are available to modify default behavior.\n",
"In this case, you can read a file by specifying only its location, then retrieve the first 5 rows to evaluate the result."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dflow_duplicate_headers = dprep.read_csv(path='../data/crime_duplicate_headers.csv')\n",
"dflow_duplicate_headers.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"From the result, you can see that the delimiter and encoding were correctly detected. Column headers were also detected. However, the first line seems to be a duplicate of the column headers. One of the parameters is a number of lines to skip from the files being read. You can use this to filter out the duplicate line."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dflow_skip_headers = dprep.read_csv(path='../data/crime_duplicate_headers.csv', skip_rows=1)\n",
"dflow_skip_headers.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now the data set contains the correct headers and the extraneous row has been skipped by `read_csv`. Next, look at the data types of the columns."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dflow_skip_headers.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Unfortunately, all of the columns came back as strings. This is because, by default, Data Prep will not change the type of the data. Since the data source is a text file, all values are kept as strings. In this case, however, numeric columns should be parsed as numbers. To do this, set the `infer_column_types` parameter to `True`, which will trigger type inference to be performed.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dflow_inferred_types = dprep.read_csv(path='../data/crime_duplicate_headers.csv',\n",
" skip_rows=1,\n",
" infer_column_types=True)\n",
"dflow_inferred_types.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now several of the columns were correctly detected as numbers and their `FieldType` is Decimal.\n",
"\n",
"With ingestion done, the data set is ready to start preparing."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df = dflow_inferred_types.to_pandas_dataframe()\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"compressed-csv\"></a>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Read Compressed CSV"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Data Prep can also read delimited files compressed in an archive. The `archive_options` parameter specifies the type of archive and glob pattern of entries in the archive.\n",
"\n",
"At this moment, only reading from ZIP archives is supported."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.dataprep import ArchiveOptions, ArchiveType\n",
"\n",
"dflow = dprep.read_csv(path='../data/crime.zip',\n",
" archive_options=ArchiveOptions(archive_type=ArchiveType.ZIP, entry_glob='*10-20.csv'))\n",
"dflow.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"excel\"></a>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Read Excel"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Data Prep can also load Excel files using the `read_excel` method."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dflow_default_sheet = dprep.read_excel(path='../data/crime.xlsx')\n",
"dflow_default_sheet.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here, the first sheet of the Excel document has been loaded. You could achieve the same result by specifying the name of the desired sheet explicitly."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dflow_second_sheet = dprep.read_excel(path='../data/crime.xlsx', sheet_name='Sheet2')\n",
"dflow_second_sheet.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As you can see, the table in the second sheet had headers as well as three empty rows, so you can modify the arguments accordingly."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dflow_skipped_rows = dprep.read_excel(path='../data/crime.xlsx',\n",
" sheet_name='Sheet2',\n",
" use_column_headers=True,\n",
" skip_rows=3)\n",
"dflow_skipped_rows.head(5)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df = dflow_skipped_rows.to_pandas_dataframe()\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"fixed-width\"></a>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Read Fixed Width Files"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For fixed-width files, you can specify a list of offsets. The first column is always assumed to start at offset 0."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dflow_fixed_width = dprep.read_fwf('../data/crime.txt', offsets=[8, 17, 26, 33, 56, 58, 74])\n",
"dflow_fixed_width.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Looking at the data, you can see that the first row was used as headers. In this particular case, however, there are no headers in the file, so the first row should be treated as data.\n",
"\n",
"Passing in `PromoteHeadersMode.NONE` to the `header` keyword argument avoids header detection and gets the correct data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dflow_no_headers = dprep.read_fwf('../data/crime.txt',\n",
" offsets=[8, 17, 26, 33, 56, 58, 74],\n",
" header=dprep.PromoteHeadersMode.NONE)\n",
"dflow_no_headers.head(5)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df = dflow_no_headers.to_pandas_dataframe()\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"parquet\"></a>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Read Parquet"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Data Prep has two different methods for reading data stored as Parquet.\n",
"\n",
"Currently, both methods require the `pyarrow` package to be installed in your Python environment. This can be done via `pip install azureml-dataprep[parquet]`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Read Parquet File"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For reading single `.parquet` files, or a folder full of only Parquet files, use `read_parquet_file`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dflow = dprep.read_parquet_file('../data/crime.parquet')\n",
"dflow.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Parquet data is explicitly typed so no type inference is needed."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dflow.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Read Parquet Dataset"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A Parquet Dataset is different from a Parquet file in that it could be a folder containing a number of Parquet files within a complex directory structure. It may have a hierarchical structure that partitions the data by value of a column. These more complex forms of Parquet data are commonly produced by Spark/HIVE.\n",
"\n",
"For these more complex data sets, you can use `read_parquet_dataset`, which uses pyarrow to handle complex Parquet layouts. This will also handle single Parquet files, though these are better read using `read_parquet_file`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dflow = dprep.read_parquet_dataset('../data/parquet_dataset')\n",
"dflow.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The above data was partitioned by the value of the `Arrest` column. It is a boolean column in the original crime0 data set and hence was partitioned by `Arrest=true` and `Arrest=false`.\n",
"\n",
"The directory structure is printed below for clarity."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"for path, dirs, files in os.walk('../data/parquet_dataset'):\n",
" level = path.replace('../data/parquet_dataset', '').count(os.sep)\n",
" indent = ' ' * (level)\n",
" print(indent + os.path.basename(path) + '/')\n",
" fileindent = ' ' * (level + 1)\n",
" for f in files:\n",
" print(fileindent + f)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"globbing\"></a>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Read Part Files Using Globbing"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Data Prep supports globbing, which allows you to read partitioned files (or any other type of files) in a folder. Globbing is supported by all of the read transformations that take in file paths, such as `read_csv`, `read_lines`, etc. By specifying `../data/crime_partfiles/part-*` in the path, we will read all files start with `part-`in `crime_partfiles` folder and return them in one Dataflow. [`auto_read_file`](./auto-read-file.ipynb) will detect column types of your part files and parse them automatically."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dflow_partfiles = dprep.auto_read_file(path='../data/crime_partfiles/part-*')\n",
"dflow_partfiles.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"json\"></a>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Read JSON"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Data Prep can also load JSON files."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dflow_json = dprep.read_json(path='../data/json.json')\n",
"dflow_json.head(15)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When you use `read_json`, Data Prep will attempt to extract data from the file into a table. You can also control the file encoding Data Prep should use as well as whether Data Prep should flatten nested JSON arrays.\n",
"\n",
"Choosing the option to flatten nested arrays could result in a much larger number of rows."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dflow_flat_arrays = dprep.read_json(path='../data/json.json', flatten_nested_arrays=True)\n",
"dflow_flat_arrays.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"sql\"></a>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Read SQL"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Data Prep can also fetch data from SQL servers. Currently, only Microsoft SQL Server is supported.\n",
"\n",
"To read data from a SQL server, first create a data source object that contains the connection information."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"secret = dprep.register_secret(value=\"dpr3pTestU$er\", id=\"dprepTestUser\")\n",
"ds = dprep.MSSQLDataSource(server_name=\"dprep-sql-test.database.windows.net\",\n",
" database_name=\"dprep-sql-test\",\n",
" user_name=\"dprepTestUser\",\n",
" password=secret)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As you can see, the password parameter of `MSSQLDataSource` accepts a Secret object. You can get a Secret object in two ways:\n",
"1. Register the secret and its value with the execution engine.\n",
"2. Create the secret with just an id (useful if the secret value was already registered in the execution environment).\n",
"\n",
"Now that you have created a data source object, you can proceed to read data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dflow = dprep.read_sql(ds, \"SELECT top 100 * FROM [SalesLT].[Product]\")\n",
"dflow.head(5)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df = dflow.to_pandas_dataframe()\n",
"df.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"postgresql\"></a>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Read PostgreSQL"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Data Prep can also fetch data from Azure PostgreSQL servers.\n",
"\n",
"To read data from a PostgreSQL server, first create a data source object that contains the connection information."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"secret = dprep.register_secret(value=\"dpr3pTestU$er\", id=\"dprepPostgresqlUser\")\n",
"ds = dprep.PostgreSQLDataSource(server_name=\"dprep-postgresql-test.postgres.database.azure.com\",\n",
" database_name=\"dprep-postgresql-testdb\",\n",
" user_name=\"dprepPostgresqlReadOnlyUser@dprep-postgresql-test\",\n",
" password=secret)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As you can see, the password parameter of `PostgreSQLDataSource` accepts a Secret object as well.\n",
"Now that you have created a PostgreSQL data source object, you can proceed to read data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dflow = dprep.read_postgresql(ds, \"SELECT * FROM public.people\")\n",
"dflow.head(5)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dflow.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"azure-blob\"></a>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Read from Azure Blob"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can read files stored in public Azure Blob by directly passing a file url. To read file from a protected Blob, pass SAS (Shared Access Signature) URI with both resource URI and SAS token in the path."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dflow = dprep.read_csv(path='https://dpreptestfiles.blob.core.windows.net/testfiles/read_csv_duplicate_headers.csv', skip_rows=1)\n",
"dflow.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"adls\"></a>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Read from ADLS"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are two ways the Data Prep API can acquire the necessary OAuth token to access Azure DataLake Storage:\n",
"1. Retrieve the access token from a recent login session of the user's [Azure CLI](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest) login.\n",
"2. Use a ServicePrincipal (SP) and a certificate as a secret."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Using Access Token from a recent Azure CLI session"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"On your local machine, run the following command:\n",
"```\n",
"az login\n",
"```\n",
"If your user account is a member of more than one Azure tenant, you need to specify the tenant, either in the AAD url hostname form '<your_domain>.onmicrosoft.com' or the tenantId GUID. The latter can be retrieved as follows:\n",
"```\n",
"az account show --query tenantId\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```python\n",
"dflow = read_csv(path = DataLakeDataSource(path='adl://dpreptestfiles.azuredatalakestore.net/farmers-markets.csv', tenant='microsoft.onmicrosoft.com'))\n",
"head = dflow.head(5)\n",
"head\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create a ServicePrincipal via Azure CLI"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A ServicePrincipal and the corresponding certificate can be created via [Azure CLI](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest).\n",
"This particular SP is configured as Reader, with its scope reduced to just the ADLS account 'dpreptestfiles'.\n",
"```\n",
"az account set --subscription \"Data Wrangling development\"\n",
"az ad sp create-for-rbac -n \"SP-ADLS-dpreptestfiles\" --create-cert --role reader --scopes /subscriptions/35f16a99-532a-4a47-9e93-00305f6c40f2/resourceGroups/dpreptestfiles/providers/Microsoft.DataLakeStore/accounts/dpreptestfiles\n",
"```\n",
"This command emits the appId and the path to the certificate file (usually in the home folder). The .crt file contains both the public certificate and the private key in PEM format.\n",
"\n",
"Extract the thumbprint with:\n",
"```\n",
"openssl x509 -in adls-dpreptestfiles.crt -noout -fingerprint\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Configure ADLS Account for ServicePrincipal"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To configure the ACL for the ADLS filesystem, use the objectId of the user or, here, ServicePrincipal:\n",
"```\n",
"az ad sp show --id \"8dd38f34-1fcb-4ff9-accd-7cd60b757174\" --query objectId\n",
"```\n",
"Configure Read and Execute access for the ADLS file system. Since the underlying HDFS ACL model doesn't support inheritance, folders and files need to be ACL-ed individually.\n",
"```\n",
"az dls fs access set-entry --account dpreptestfiles --acl-spec \"user:e37b9b1f-6a5e-4bee-9def-402b956f4e6f:r-x\" --path /\n",
"az dls fs access set-entry --account dpreptestfiles --acl-spec \"user:e37b9b1f-6a5e-4bee-9def-402b956f4e6f:r--\" --path /farmers-markets.csv\n",
"```\n",
"\n",
"References:\n",
"- [az ad sp](https://docs.microsoft.com/en-us/cli/azure/ad/sp?view=azure-cli-latest)\n",
"- [az dls fs access](https://docs.microsoft.com/en-us/cli/azure/dls/fs/access?view=azure-cli-latest)\n",
"- [ACL model for ADLS](https://github.com/MicrosoftDocs/azure-docs/blob/master/articles/data-lake-store/data-lake-store-access-control.md)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"certThumbprint = 'C2:08:9D:9E:D1:74:FC:EB:E9:7E:63:96:37:1C:13:88:5E:B9:2C:84'\n",
"certificate = ''\n",
"with open('../data/adls-dpreptestfiles.crt', 'rt', encoding='utf-8') as crtFile:\n",
" certificate = crtFile.read()\n",
"\n",
"servicePrincipalAppId = \"8dd38f34-1fcb-4ff9-accd-7cd60b757174\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Acquire an OAuth Access Token"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use the adal package (via: `pip install adal`) to create an authentication context on the MSFT tenant and acquire an OAuth access token. Note that for ADLS, the `resource` in the token request must be for 'datalake.azure.net', which is different from most other Azure resources."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import adal\n",
"from azureml.dataprep.api.datasources import DataLakeDataSource\n",
"\n",
"ctx = adal.AuthenticationContext('https://login.microsoftonline.com/microsoft.onmicrosoft.com')\n",
"token = ctx.acquire_token_with_client_certificate('https://datalake.azure.net/', servicePrincipalAppId, certificate, certThumbprint)\n",
"dflow = dprep.read_csv(path = DataLakeDataSource(path='adl://dpreptestfiles.azuredatalakestore.net/crime-spring.csv', accessToken=token['accessToken']))\n",
"dflow.to_pandas_dataframe().head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"pandas-df\"></a>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Read Pandas DataFrame"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are situations where you may already have some data in the form of a pandas DataFrame.\n",
"The steps taken to get to this DataFrame may be non-trivial or not easy to convert to Data Prep Steps. The `read_pandas_dataframe` reader can take a DataFrame and use it as the data source for a Dataflow.\n",
"\n",
"You can pass in a path to a directory (that doesn't exist yet) for Data Prep to store the contents of the DataFrame; otherwise, a temporary directory will be made in the system's temp folder. The files written to this directory will be named `part-00000` and so on; they are written out in Data Prep's internal row-based file format."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dflow = dprep.read_excel(path='../data/crime.xlsx')\n",
"dflow = dflow.drop_columns(columns=['Column1'])\n",
"df = dflow.to_pandas_dataframe()\n",
"df.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After loading in the data you can now do `read_pandas_dataframe`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import shutil\n",
"cache_dir = 'dflow_df'\n",
"shutil.rmtree(cache_dir, ignore_errors=True)\n",
"dflow_df = dprep.read_pandas_dataframe(df, cache_dir)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dflow_df.head(5)"
]
}
],
"metadata": {
"authors": [
{
"name": "sihhu"
}
],
"kernelspec": {
"display_name": "Python 3.6",
"language": "python",
"name": "python36"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}