Files

183 lines
6.1 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/work-with-data/dataprep/how-to-guides/writing-data.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Writing Data\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It is possible to write out the data at any point in a Dataflow. These writes are added as steps to the resulting Dataflow and will be executed every time the Dataflow is executed. Since there are no limitations to how many write steps there are in a pipeline, this makes it easy to write out intermediate results for troubleshooting or to be picked up by other pipelines.\n",
"\n",
"It is important to note that the execution of each write results in a full pull of the data in the Dataflow. For example, a Dataflow with three write steps will read and process every record in the dataset three times."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import azureml.dataprep as dprep"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Writing to Files"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Data can be written to files in any of our supported locations (Local File System, Azure Blob Storage, and Azure Data Lake Storage). In order to parallelize the write, the data is written to multiple partition files. A sentinel file named SUCCESS is also output once the write has completed. This makes it possible to identify when an intermediate write has completed without having to wait for the whole pipeline to complete.\n",
"\n",
"> When running a Dataflow in Spark, attempting to execute a write to an existing folder will fail. It is important to ensure the folder is empty or use a different target location per execution.\n",
"\n",
"The following file formats are currently supported:\n",
"- Delimited Files (CSV, TSV, etc.)\n",
"- Parquet Files"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We'll start by loading data into a Dataflow which will be re-used with different formats."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dflow = dprep.auto_read_file('../data/crime.txt')\n",
"dflow = dflow.to_number('Column2')\n",
"dflow.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Delimited Files"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we create a dataflow with a write step.\n",
"\n",
"This operation is lazy until we invoke `run_local` (or any operation that forces execution like `to_pandas_dataframe`), only then will we execute the write operation."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dflow_write = dflow.write_to_csv(directory_path=dprep.LocalFileOutput('./test_out/'))\n",
"\n",
"dflow_write.run_local()\n",
"\n",
"dflow_written_files = dprep.read_csv('./test_out/part-*')\n",
"dflow_written_files.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The data we wrote out contains several errors in the numeric columns due to numbers that we were unable to parse. When written out to CSV, these are replaced with the string \"ERROR\" by default. We can parameterize this as part of our write call. In the same vein, it is also possible to set what string to use to represent null values."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dflow_write_errors = dflow.write_to_csv(directory_path=dprep.LocalFileOutput('./test_out/'), \n",
" error='BadData',\n",
" na='NA')\n",
"dflow_write_errors.run_local()\n",
"dflow_written = dprep.read_csv('./test_out/part-*')\n",
"dflow_written.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Parquet Files"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Similar to `write_to_csv`, `write_to_parquet` returns a new Dataflow with a Write Parquet Step which hasn't been executed yet.\n",
"\n",
"Then we run the Dataflow with `run_local`, which executes the write operation."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dflow_write_parquet = dflow.write_to_parquet(directory_path=dprep.LocalFileOutput('./test_parquet_out/'),\n",
" error='MiscreantData')\n",
"\n",
"dflow_write_parquet.run_local()\n",
"\n",
"dflow_written_parquet = dprep.read_parquet_file('./test_parquet_out/part-*')\n",
"dflow_written_parquet.head(5)"
]
}
],
"metadata": {
"authors": [
{
"name": "sihhu"
}
],
"kernelspec": {
"display_name": "Python 3.6",
"language": "python",
"name": "python36"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
},
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License."
},
"nbformat": 4,
"nbformat_minor": 2
}