Files
MachineLearningNotebooks/work-with-data/dataprep/how-to-guides/replace-datasource-replace-reference.ipynb

130 lines
4.8 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/work-with-data/dataprep/how-to-guides/replace-datasource-replace-reference.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Replace DataSource Reference\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A common practice when performing DataPrep is to build up a script or set of cleaning operations on a smaller example file locally. This is quicker and easier than dealing with large amounts of data initially.\n",
"\n",
"After building a Dataflow that performs the desired steps, it's time to run it against the larger dataset, which may be stored in the cloud, or even locally just in a different file. This is where we can use `Dataflow.replace_datasource` to get a Dataflow identical to the one built on the small data, but referencing the newly specified DataSource."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import azureml.dataprep as dprep\n",
"\n",
"dflow = dprep.read_csv('../data/crime-spring.csv')\n",
"df = dflow.to_pandas_dataframe()\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we have the first 10 rows of a dataset called 'Crime'. The original dataset is over 100MB (admittedly not that large of a dataset but this is just an example).\n",
"\n",
"We'll perform a few cleaning operations."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dflow_dropped = dflow.drop_columns(['Location', 'Updated On', 'X Coordinate', 'Y Coordinate', 'Description'])\n",
"sctb = dflow_dropped.builders.set_column_types()\n",
"sctb.learn(inference_arguments=dprep.InferenceArguments(day_first=False))\n",
"dflow_typed = sctb.to_dataflow()\n",
"dflow_typed.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we have a Dataflow with all our desired steps, we're ready to run against the 'full' dataset stored in Azure Blob.\n",
"All we need to do is pass the BlobDataSource into `replace_datasource` and we'll get back an identical Dataflow with the new DataSource substituted in."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dflow_replaced = dflow_typed.replace_datasource(dprep.BlobDataSource('https://dpreptestfiles.blob.core.windows.net/testfiles/crime0.csv'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"'replaced_dflow' will now pull data from the 168MB (729734 rows) version of Crime0.csv stored in Azure Blob!\n",
"\n",
"NOTE: Dataflows can also be created by referencing a different Dataflow. Instead of using `replace_datasource`, there is a corresponding `replace_reference` method.\n",
"\n",
"We should be careful now since pulling all that data down and putting it in a pandas dataframe isn't an ideal way to inspect the result of our Dataflow. So instead, to see that our steps are being applied to all the new data, we can add a `take_sample` step, which will select records at random (based on a given probability) to be returned.\n",
"\n",
"The probability below takes the ~730000 rows down to a more inspectable ~73, though the number will vary each time `to_pandas_dataframe()` is run, since they are being randomly selected based on the probability."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dflow_random_sample= dflow_replaced.take_sample(probability=0.0001)\n",
"sample = dflow_random_sample.to_pandas_dataframe()\n",
"sample"
]
}
],
"metadata": {
"authors": [
{
"name": "sihhu"
}
],
"kernelspec": {
"display_name": "Python 3.6",
"language": "python",
"name": "python36"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
},
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License."
},
"nbformat": 4,
"nbformat_minor": 2
}