Files
MachineLearningNotebooks/how-to-use-azureml/work-with-data/dataprep/how-to-guides/open-save-dataflows.ipynb
2019-06-26 14:39:09 -04:00

172 lines
4.3 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/work-with-data/dataprep/how-to-guides/open-save-dataflows.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Opening and Saving Dataflows\n",
"Copyright (c) Microsoft Corporation. All rights reserved.<br>\n",
"Licensed under the MIT License."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once you have built a Dataflow, you can save it to a `.dprep` file. This persists all of the information in your Dataflow including steps you've added, examples and programs from by-example steps, computed aggregations, etc.\n",
"\n",
"You can also open `.dprep` files to access any Dataflows you have previously persisted."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Open\n",
"\n",
"Use the `open()` method of the Dataflow class to load existing `.dprep` files."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"dflow_path = os.path.join(os.getcwd(), '..', 'data', 'crime.dprep')\n",
"print(dflow_path)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.dataprep import Dataflow"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dflow = Dataflow.open(dflow_path)\n",
"head = dflow.head(5)\n",
"head"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Edit\n",
"\n",
"After a Dataflow is loaded, it can be further edited as needed. In this example, a filter is added."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.dataprep import col"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dflow = dflow.filter(col('Description') != 'SIMPLE')\n",
"head = dflow.head(5)\n",
"head"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Save\n",
"\n",
"Use the `save()` method of the Dataflow class to write out the `.dprep` file."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import tempfile\n",
"temp_dir = tempfile._get_default_tempdir()\n",
"temp_file_name = next(tempfile._get_candidate_names())\n",
"temp_dflow_path = os.path.join(temp_dir, temp_file_name + '.dprep')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dflow.save(temp_dflow_path)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Round-trip\n",
"\n",
"This illustrates the ability to load the edited Dataflow back in and use it, in this case to get a pandas DataFrame."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dflow_to_open = Dataflow.open(temp_dflow_path)\n",
"df = dflow_to_open.to_pandas_dataframe()\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"if os.path.isfile(temp_dflow_path):\n",
" os.remove(temp_dflow_path)"
]
}
],
"metadata": {
"authors": [
{
"name": "sihhu"
}
],
"kernelspec": {
"display_name": "Python 3.6",
"language": "python",
"name": "python36"
}
},
"nbformat": 4,
"nbformat_minor": 2
}