mirror of
https://github.com/Azure/MachineLearningNotebooks.git
synced 2025-12-21 10:05:09 -05:00
194 lines
5.5 KiB
Plaintext
194 lines
5.5 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Cache\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"A Dataflow can be cached as a file on your disk during a local run by calling `dflow_cached = dflow.cache(directory_path)`. Doing this will run all the steps in the Dataflow, `dflow`, and save the cached data to the specified `directory_path`. The returned Dataflow, `dflow_cached`, has a Caching Step added at the end. Any subsequent runs on on the Dataflow `dflow_cached` will reuse the cached data, and the steps before the Caching Step will not be run again.\n",
|
|
"\n",
|
|
"Caching avoids running transforms multiple times, which can make local runs more efficient. Here are common places to use Caching:\n",
|
|
"- after reading data from remote\n",
|
|
"- after expensive transforms, such as Sort\n",
|
|
"- after transforms that change the shape of data, such as Sampling, Filter and Summarize\n",
|
|
"\n",
|
|
"Caching Step will be ignored during scale-out run invoked by `to_spark_dataframe()`."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"We will start by reading in a dataset and applying some transforms to the Dataflow."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import azureml.dataprep as dprep\n",
|
|
"dflow = dprep.read_csv(path='../data/crime-spring.csv')\n",
|
|
"dflow = dflow.take_sample(probability=0.2, seed=7)\n",
|
|
"dflow = dflow.sort_asc(columns='Primary Type')\n",
|
|
"dflow = dflow.keep_columns(['ID', 'Case Number', 'Date', 'Primary Type'])\n",
|
|
"dflow.head(5)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Next, we will choose a directory to store the cached data."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import os\n",
|
|
"from pathlib import Path\n",
|
|
"cache_dir = str(Path(os.getcwd(), 'dataflow-cache'))\n",
|
|
"cache_dir"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"We will now call `dflow.cache(directory_path)` to cache the Dataflow to your directory."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"dflow_cached = dflow.cache(directory_path=cache_dir)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Now we will check steps in the `dflow_cached` to see that all of the previous steps were cached."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"[s.step_type for s in dflow_cached._get_steps()]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"We also check the data stored in the cache directory."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"os.listdir(cache_dir)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Running against `dflow_cached` will reuse the cached data and skip running all of the previous steps again."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"dflow_cached.head(5)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Adding additional steps to `dflow_cached` will also reuse the cache data and skip running the steps prior to the Cache Step."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"dflow_cached_take = dflow_cached.take(10)\n",
|
|
"dflow_cached_skip = dflow_cached.skip(10).take(10)\n",
|
|
"\n",
|
|
"df_cached_take = dflow_cached_take.to_pandas_dataframe()\n",
|
|
"df_cached_skip = dflow_cached_skip.to_pandas_dataframe()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# shutil.rmtree will then clean up the cached data \n",
|
|
"import shutil\n",
|
|
"shutil.rmtree(path=cache_dir)"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"authors": [
|
|
{
|
|
"name": "sihhu"
|
|
}
|
|
],
|
|
"kernelspec": {
|
|
"display_name": "Python 3.6",
|
|
"language": "python",
|
|
"name": "python36"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.6.4"
|
|
},
|
|
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License."
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
} |