mirror of
https://github.com/Azure/MachineLearningNotebooks.git
synced 2025-12-19 17:17:04 -05:00
172 lines
4.3 KiB
Plaintext
172 lines
4.3 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Opening and Saving Dataflows\n",
|
|
"Copyright (c) Microsoft Corporation. All rights reserved.<br>\n",
|
|
"Licensed under the MIT License."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Once you have built a Dataflow, you can save it to a `.dprep` file. This persists all of the information in your Dataflow including steps you've added, examples and programs from by-example steps, computed aggregations, etc.\n",
|
|
"\n",
|
|
"You can also open `.dprep` files to access any Dataflows you have previously persisted."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Open\n",
|
|
"\n",
|
|
"Use the `open()` method of the Dataflow class to load existing `.dprep` files."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import os\n",
|
|
"dflow_path = os.path.join(os.getcwd(), '..', 'data', 'crime.dprep')\n",
|
|
"print(dflow_path)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.dataprep import Dataflow"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"dflow = Dataflow.open(dflow_path)\n",
|
|
"head = dflow.head(5)\n",
|
|
"head"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Edit\n",
|
|
"\n",
|
|
"After a Dataflow is loaded, it can be further edited as needed. In this example, a filter is added."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from azureml.dataprep import col"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"dflow = dflow.filter(col('Description') != 'SIMPLE')\n",
|
|
"head = dflow.head(5)\n",
|
|
"head"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Save\n",
|
|
"\n",
|
|
"Use the `save()` method of the Dataflow class to write out the `.dprep` file."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import tempfile\n",
|
|
"temp_dir = tempfile._get_default_tempdir()\n",
|
|
"temp_file_name = next(tempfile._get_candidate_names())\n",
|
|
"temp_dflow_path = os.path.join(temp_dir, temp_file_name + '.dprep')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"dflow.save(temp_dflow_path)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Round-trip\n",
|
|
"\n",
|
|
"This illustrates the ability to load the edited Dataflow back in and use it, in this case to get a pandas DataFrame."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"dflow_to_open = Dataflow.open(temp_dflow_path)\n",
|
|
"df = dflow_to_open.to_pandas_dataframe()\n",
|
|
"df"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"if os.path.isfile(temp_dflow_path):\n",
|
|
" os.remove(temp_dflow_path)"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"authors": [
|
|
{
|
|
"name": "sihhu"
|
|
}
|
|
],
|
|
"kernelspec": {
|
|
"display_name": "Python 3.6",
|
|
"language": "python",
|
|
"name": "python36"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
} |