mirror of
https://github.com/Azure/MachineLearningNotebooks.git
synced 2025-12-19 17:17:04 -05:00
218 lines
7.2 KiB
Plaintext
218 lines
7.2 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Sampling and Subsetting\n",
|
|
"Copyright (c) Microsoft Corporation. All rights reserved.<br>\n",
|
|
"Licensed under the MIT License."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Once a Dataflow has been created, it is possible to act on only a subset of the records contained in it. This can help when working with very large datasets or when only a portion of the records is truly relevant."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Head\n",
|
|
"\n",
|
|
"The `head` method will take the number of records specified, run them through the transformations in the Dataflow, and then return the result as a Pandas dataframe."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import azureml.dataprep as dprep\n",
|
|
"\n",
|
|
"dflow = dprep.read_csv('../data/crime_duplicate_headers.csv')\n",
|
|
"dflow.head(5)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Take\n",
|
|
"\n",
|
|
"The `take` method adds a step to the Dataflow that will keep the number of records specified (counting from the beginning) and drop the rest. Unlike `head`, which does not modify the Dataflow, all operations applied on a Dataflow on which `take` has been applied will affect only the records kept."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"dflow_top_five = dflow.take(5)\n",
|
|
"dflow_top_five.to_pandas_dataframe()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Skip\n",
|
|
"\n",
|
|
"It is also possible to skip a certain number of records in a Dataflow, such that transformations are only applied after a specific point. Depending on the underlying data source, a Dataflow with a `skip` step might still have to scan through the data in order to skip past the records."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"dflow_skip_top_one = dflow_top_five.skip(1)\n",
|
|
"dflow_skip_top_one.to_pandas_dataframe()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Take Sample\n",
|
|
"\n",
|
|
"In addition to taking records from the top of the dataset, it's also possible to take a random sample of the dataset. This is done through the `take_sample(probability, seed=None)` method. This method will scan through all of the records available in the Dataflow and include them based on the probability specified. The `seed` parameter is optional. If a seed is not provided, a stable one is generated, ensuring that the results for a specific Dataflow remain consistent. Different calls to `take_sample` will receive different seeds."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"dflow_sampled = dflow.take_sample(0.1)\n",
|
|
"dflow_sampled.to_pandas_dataframe()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"`skip`, `take`, and `take_sample` can all be combined. With this, we can achieve behaviors like getting a random 10% sample fo the middle N records of a dataset."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"seed = 1\n",
|
|
"dflow_nested_sample = dflow.skip(1).take(5).take_sample(0.5, seed)\n",
|
|
"dflow_nested_sample.to_pandas_dataframe()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Take Stratified Sample\n",
|
|
"Besides sampling all by a probability, we also have stratified sampling, provided the strata and strata weights, the probability to sample each stratum with.\n",
|
|
"This is done through the `take_stratified_sample(columns, fractions, seed=None)` method.\n",
|
|
"For all records, we will group each record by the columns specified to stratify, and based on the stratum x weight information in `fractions`, include said record.\n",
|
|
"\n",
|
|
"Seed behavior is same as in `take_sample`.\n",
|
|
"\n",
|
|
"If a stratum is not specified or the record cannot be grouped by said stratum, we default the weight to sample by to 0 (it will not be included).\n",
|
|
"\n",
|
|
"The order of `fractions` must match the order of `columns`."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"fractions = {}\n",
|
|
"fractions[('ASSAULT',)] = 0.5\n",
|
|
"fractions[('BATTERY',)] = 0.2\n",
|
|
"fractions[('ARSON',)] = 0.5\n",
|
|
"fractions[('THEFT',)] = 1.0\n",
|
|
"\n",
|
|
"columns = ['Primary Type']\n",
|
|
"\n",
|
|
"single_strata_sample = dflow.take_stratified_sample(columns=columns, fractions = fractions, seed = 42)\n",
|
|
"single_strata_sample.head(5)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Stratified sampling on multiple columns is also supported."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"fractions = {}\n",
|
|
"fractions[('ASSAULT', '560')] = 0.5\n",
|
|
"fractions[('BATTERY', '460')] = 0.2\n",
|
|
"fractions[('ARSON', '1020')] = 0.5\n",
|
|
"fractions[('THEFT', '820')] = 1.0\n",
|
|
"\n",
|
|
"columns = ['Primary Type', 'IUCR']\n",
|
|
"\n",
|
|
"multi_strata_sample = dflow.take_stratified_sample(columns=columns, fractions = fractions, seed = 42)\n",
|
|
"multi_strata_sample.head(5)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Caching\n",
|
|
"It is usually a good idea to cache the sampled Dataflow for later uses.\n",
|
|
"\n",
|
|
"See [here](cache.ipynb) for more details about caching."
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"authors": [
|
|
{
|
|
"name": "sihhu"
|
|
}
|
|
],
|
|
"kernelspec": {
|
|
"display_name": "Python 3.6",
|
|
"language": "python",
|
|
"name": "python36"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.6.4"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
} |