From 7e492cbeb6057bf4a4d5b1e728fa7b66cc79c2f7 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Shan=C3=A9=20Winner?= <43390034+swinner95@users.noreply.github.com> Date: Sun, 28 Jul 2019 00:26:41 -0700 Subject: [PATCH] Delete subsetting-sampling.ipynb --- .../how-to-guides/subsetting-sampling.ipynb | 218 ------------------ 1 file changed, 218 deletions(-) delete mode 100644 how-to-use-azureml/work-with-data/dataprep/how-to-guides/subsetting-sampling.ipynb diff --git a/how-to-use-azureml/work-with-data/dataprep/how-to-guides/subsetting-sampling.ipynb b/how-to-use-azureml/work-with-data/dataprep/how-to-guides/subsetting-sampling.ipynb deleted file mode 100644 index 833ab7e3..00000000 --- a/how-to-use-azureml/work-with-data/dataprep/how-to-guides/subsetting-sampling.ipynb +++ /dev/null @@ -1,218 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/work-with-data/dataprep/how-to-guides/subsetting-sampling.png)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Sampling and Subsetting\n", - "Copyright (c) Microsoft Corporation. All rights reserved.
\n", - "Licensed under the MIT License." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Once a Dataflow has been created, it is possible to act on only a subset of the records contained in it. This can help when working with very large datasets or when only a portion of the records is truly relevant." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Head\n", - "\n", - "The `head` method will take the number of records specified, run them through the transformations in the Dataflow, and then return the result as a Pandas dataframe." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import azureml.dataprep as dprep\n", - "\n", - "dflow = dprep.read_csv('../data/crime_duplicate_headers.csv')\n", - "dflow.head(5)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Take\n", - "\n", - "The `take` method adds a step to the Dataflow that will keep the number of records specified (counting from the beginning) and drop the rest. Unlike `head`, which does not modify the Dataflow, all operations applied on a Dataflow on which `take` has been applied will affect only the records kept." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "dflow_top_five = dflow.take(5)\n", - "dflow_top_five.to_pandas_dataframe()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Skip\n", - "\n", - "It is also possible to skip a certain number of records in a Dataflow, such that transformations are only applied after a specific point. Depending on the underlying data source, a Dataflow with a `skip` step might still have to scan through the data in order to skip past the records." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "dflow_skip_top_one = dflow_top_five.skip(1)\n", - "dflow_skip_top_one.to_pandas_dataframe()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Take Sample\n", - "\n", - "In addition to taking records from the top of the dataset, it's also possible to take a random sample of the dataset. This is done through the `take_sample(probability, seed=None)` method. This method will scan through all of the records available in the Dataflow and include them based on the probability specified. The `seed` parameter is optional. If a seed is not provided, a stable one is generated, ensuring that the results for a specific Dataflow remain consistent. Different calls to `take_sample` will receive different seeds." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "dflow_sampled = dflow.take_sample(0.1)\n", - "dflow_sampled.to_pandas_dataframe()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "`skip`, `take`, and `take_sample` can all be combined. With this, we can achieve behaviors like getting a random 10% sample fo the middle N records of a dataset." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "seed = 1\n", - "dflow_nested_sample = dflow.skip(1).take(5).take_sample(0.5, seed)\n", - "dflow_nested_sample.to_pandas_dataframe()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Take Stratified Sample\n", - "Besides sampling all by a probability, we also have stratified sampling, provided the strata and strata weights, the probability to sample each stratum with.\n", - "This is done through the `take_stratified_sample(columns, fractions, seed=None)` method.\n", - "For all records, we will group each record by the columns specified to stratify, and based on the stratum x weight information in `fractions`, include said record.\n", - "\n", - "Seed behavior is same as in `take_sample`.\n", - "\n", - "If a stratum is not specified or the record cannot be grouped by said stratum, we default the weight to sample by to 0 (it will not be included).\n", - "\n", - "The order of `fractions` must match the order of `columns`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "fractions = {}\n", - "fractions[('ASSAULT',)] = 0.5\n", - "fractions[('BATTERY',)] = 0.2\n", - "fractions[('ARSON',)] = 0.5\n", - "fractions[('THEFT',)] = 1.0\n", - "\n", - "columns = ['Primary Type']\n", - "\n", - "single_strata_sample = dflow.take_stratified_sample(columns=columns, fractions = fractions, seed = 42)\n", - "single_strata_sample.head(5)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Stratified sampling on multiple columns is also supported." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "fractions = {}\n", - "fractions[('ASSAULT', '560')] = 0.5\n", - "fractions[('BATTERY', '460')] = 0.2\n", - "fractions[('ARSON', '1020')] = 0.5\n", - "fractions[('THEFT', '820')] = 1.0\n", - "\n", - "columns = ['Primary Type', 'IUCR']\n", - "\n", - "multi_strata_sample = dflow.take_stratified_sample(columns=columns, fractions = fractions, seed = 42)\n", - "multi_strata_sample.head(5)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Caching\n", - "It is usually a good idea to cache the sampled Dataflow for later uses.\n", - "\n", - "See [here](cache.ipynb) for more details about caching." - ] - } - ], - "metadata": { - "authors": [ - { - "name": "sihhu" - } - ], - "kernelspec": { - "display_name": "Python 3.6", - "language": "python", - "name": "python36" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.6.4" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} \ No newline at end of file