diff --git a/how-to-use-azureml/work-with-data/dataprep/how-to-guides/summarize.ipynb b/how-to-use-azureml/work-with-data/dataprep/how-to-guides/summarize.ipynb deleted file mode 100644 index 3aff2b43..00000000 --- a/how-to-use-azureml/work-with-data/dataprep/how-to-guides/summarize.ipynb +++ /dev/null @@ -1,591 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/work-with-data/dataprep/how-to-guides/summarize.png)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Summarize\n", - "Copyright (c) Microsoft Corporation. All rights reserved.
\n", - "Licensed under the MIT License.
\n", - "\n", - "Azure ML Data Prep can help summarize your data by providing you a synopsis based on aggregates over specific columns.\n", - "\n", - "## Table of Contents\n", - "[Overview](#overview)
\n", - "[Summmary Functions](#summary)
\n", - "* [SummaryFunction.MIN](#min)
\n", - "* [SummaryFunction.MAX](#max)
\n", - "* [SummaryFunction.MEAN](#mean)
\n", - "* [SummaryFunction.MEDIAN](#median)
\n", - "* [SummaryFunction.VAR](#var)
\n", - "* [SummaryFunction.SD](#sd)
\n", - "* [SummaryFunction.COUNT](#count)
\n", - "* [SummaryFunction.SUM](#sum)
\n", - "* [SummaryFunction.SKEWNESS](#skewness)
\n", - "* [SummaryFunction.KURTOSIS](#kurtosis)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Overview\n", - "Before we drill down into each aggregate function, let us observe `summarize` end to end.\n", - "\n", - "We will start by reading some data." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import azureml.dataprep as dprep\n", - "dflow = dprep.auto_read_file(path='../data/crime-dirty.csv')\n", - "dflow.head(10)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Next we count (`SummaryFunction.COUNT`) the number of rows with column ID with non-null values grouped by Primary Type." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "dflow_summarize = dflow.summarize(\n", - " summary_columns=[\n", - " dprep.SummaryColumnsValue(\n", - " column_id='ID',\n", - " summary_column_name='Primary Type ID Counts', \n", - " summary_function=dprep.SummaryFunction.COUNT)],\n", - " group_by_columns=['Primary Type'])\n", - "dflow_summarize.head(10)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "If we choose to not group by anything, we will instead get a single record over the entire dataset. Here we will get the number of rows that have the column ID with non-null values." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "dflow_summarize_nogroup = dflow.summarize(\n", - " summary_columns=[\n", - " dprep.SummaryColumnsValue(\n", - " column_id='ID',\n", - " summary_column_name='ID Count', \n", - " summary_function=dprep.SummaryFunction.COUNT)])\n", - "dflow_summarize_nogroup.head(1)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Conversely, we can group by multiple columns." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "dflow_summarize_2group = dflow.summarize(\n", - " summary_columns=[\n", - " dprep.SummaryColumnsValue(\n", - " column_id='ID',\n", - " summary_column_name='Primary Type & Location Description ID Counts', \n", - " summary_function=dprep.SummaryFunction.COUNT)],\n", - " group_by_columns=['Primary Type', 'Location Description'])\n", - "dflow_summarize_2group.head(10)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "In a similar vein, we can compute multiple aggregates in a single summary. Each aggregate function is independent and it is possible to aggregate the same column multiple times." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "dflow_summarize_multi_agg = dflow.summarize(\n", - " summary_columns=[\n", - " dprep.SummaryColumnsValue(\n", - " column_id='ID',\n", - " summary_column_name='Primary Type ID Counts', \n", - " summary_function=dprep.SummaryFunction.COUNT),\n", - " dprep.SummaryColumnsValue(\n", - " column_id='ID',\n", - " summary_column_name='Primary Type Min ID', \n", - " summary_function=dprep.SummaryFunction.MIN),\n", - " dprep.SummaryColumnsValue(\n", - " column_id='Date',\n", - " summary_column_name='Primary Type Max Date', \n", - " summary_function=dprep.SummaryFunction.MAX)],\n", - " group_by_columns=['Primary Type'])\n", - "dflow_summarize_multi_agg.head(10)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "If we wanted this summary data back into our original data set, we can make use of `join_back` and optionally `join_back_columns_prefix` for easy naming distinctions. Summary columns will be added to the end. `group_by_columns` is not necessary for using `join_back`, however the behavior will be more like an append instead of a join." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "dflow_summarize_join = dflow.summarize(\n", - " summary_columns=[\n", - " dprep.SummaryColumnsValue(\n", - " column_id='ID',\n", - " summary_column_name='Primary Type ID Counts', \n", - " summary_function=dprep.SummaryFunction.COUNT)],\n", - " group_by_columns=['Primary Type'],\n", - " join_back=True,\n", - " join_back_columns_prefix='New_')\n", - "dflow_summarize_join.head(10)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Summary Functions\n", - "Here we will go over all the possible aggregates in Data Prep.\n", - "The most up to date set of functions can be found by enumerating the `SummaryFunction` enum." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import azureml.dataprep as dprep\n", - "[x.name for x in dprep.SummaryFunction]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### SummaryFunction.MIN\n", - "Data Prep can aggregate and find the minimum value of a column." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import azureml.dataprep as dprep\n", - "dflow = dprep.auto_read_file(path='../data/crime-dirty.csv')\n", - "dflow_min = dflow.summarize(\n", - " summary_columns=[\n", - " dprep.SummaryColumnsValue(\n", - " column_id='Date',\n", - " summary_column_name='Primary Type Min Date', \n", - " summary_function=dprep.SummaryFunction.MIN)],\n", - " group_by_columns=['Primary Type'])\n", - "dflow_min.head(10)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### SummaryFunction.MAX\n", - "Data Prep can find the maximum value of a column." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import azureml.dataprep as dprep\n", - "dflow = dprep.auto_read_file(path='../data/crime-dirty.csv')\n", - "dflow_min = dflow.summarize(\n", - " summary_columns=[\n", - " dprep.SummaryColumnsValue(\n", - " column_id='Date',\n", - " summary_column_name='Primary Type Max Date', \n", - " summary_function=dprep.SummaryFunction.MAX)],\n", - " group_by_columns=['Primary Type'])\n", - "dflow_min.head(10)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### SummaryFunction.MEAN\n", - "Data Prep can find the statistical mean of a column." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import azureml.dataprep as dprep\n", - "dflow = dprep.auto_read_file(path='../data/crime-dirty.csv')\n", - "dflow_min = dflow.summarize(\n", - " summary_columns=[\n", - " dprep.SummaryColumnsValue(\n", - " column_id='Latitude',\n", - " summary_column_name='Primary Type Latitude Mean', \n", - " summary_function=dprep.SummaryFunction.MEAN)],\n", - " group_by_columns=['Primary Type'])\n", - "dflow_min.head(10)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### SummaryFunction.MEDIAN\n", - "Data Prep can find the median value of a column." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import azureml.dataprep as dprep\n", - "dflow = dprep.auto_read_file(path='../data/crime-dirty.csv')\n", - "dflow_min = dflow.summarize(\n", - " summary_columns=[\n", - " dprep.SummaryColumnsValue(\n", - " column_id='Latitude',\n", - " summary_column_name='Primary Type Latitude Median', \n", - " summary_function=dprep.SummaryFunction.MEDIAN)],\n", - " group_by_columns=['Primary Type'])\n", - "dflow_min.head(10)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### SummaryFunction.VAR\n", - "Data Prep can find the statistical variance of a column. We will need more than one data point to calculate this, otherwise we will be unable to give results." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import azureml.dataprep as dprep\n", - "dflow = dprep.auto_read_file(path='../data/crime-dirty.csv')\n", - "dflow_min = dflow.summarize(\n", - " summary_columns=[\n", - " dprep.SummaryColumnsValue(\n", - " column_id='Latitude',\n", - " summary_column_name='Primary Type Latitude Variance', \n", - " summary_function=dprep.SummaryFunction.VAR)],\n", - " group_by_columns=['Primary Type'])\n", - "dflow_min.head(10)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Note that despite there being two cases of BATTERY, one of them is missing geographical location, thus only CRIMINAL DAMAGE can yield variance information. " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### SummaryFunction.SD\n", - "Data Prep can find the standard deviation of a column. We will need more than one data point to calculate this, otherwise we will be unable to give results." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import azureml.dataprep as dprep\n", - "dflow = dprep.auto_read_file(path='../data/crime-dirty.csv')\n", - "dflow_min = dflow.summarize(\n", - " summary_columns=[\n", - " dprep.SummaryColumnsValue(\n", - " column_id='Latitude',\n", - " summary_column_name='Primary Type Latitude Standard Deviation', \n", - " summary_function=dprep.SummaryFunction.SD)],\n", - " group_by_columns=['Primary Type'])\n", - "dflow_min.head(10)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Similar to when we calculate variance, despite there being two cases of BATTERY, one of them is missing geographical location, thus only CRIMINAL DAMAGE can yield variance information. " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### SummaryFunction.COUNT\n", - "Data Prep can count the number of rows that have a column with non-null values." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import azureml.dataprep as dprep\n", - "dflow = dprep.auto_read_file(path='../data/crime-dirty.csv')\n", - "dflow_min = dflow.summarize(\n", - " summary_columns=[\n", - " dprep.SummaryColumnsValue(\n", - " column_id='Latitude',\n", - " summary_column_name='Primary Type Latitude Count', \n", - " summary_function=dprep.SummaryFunction.COUNT)],\n", - " group_by_columns=['Primary Type'])\n", - "dflow_min.head(10)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Note that despite there being two cases of BATTERY, one of them is missing geographical location, thus when we group by Primary Type, we only get a count of one for Latitude." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### SummaryFunction.SUM\n", - "Data Prep can aggregate and sum the values of a column. Our dataset does not have many numerical facts, but here we sum IDs grouped by Primary Type." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import azureml.dataprep as dprep\n", - "dflow = dprep.auto_read_file(path='../data/crime-dirty.csv')\n", - "dflow_min = dflow.summarize(\n", - " summary_columns=[\n", - " dprep.SummaryColumnsValue(\n", - " column_id='ID',\n", - " summary_column_name='Primary Type ID Sum', \n", - " summary_function=dprep.SummaryFunction.SUM)],\n", - " group_by_columns=['Primary Type'])\n", - "dflow_min.head(10)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### SummaryFunction.SKEWNESS\n", - "Data Prep can calculate the skewness of data in a column. We will need more than one data point to calculate this, otherwise we will be unable to give results." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import azureml.dataprep as dprep\n", - "dflow = dprep.auto_read_file(path='../data/crime-dirty.csv')\n", - "dflow_min = dflow.summarize(\n", - " summary_columns=[\n", - " dprep.SummaryColumnsValue(\n", - " column_id='Latitude',\n", - " summary_column_name='Primary Type Latitude Skewness', \n", - " summary_function=dprep.SummaryFunction.SKEWNESS)],\n", - " group_by_columns=['Primary Type'])\n", - "dflow_min.head(10)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### SummaryFunction.KURTOSIS\n", - "Data Prep can calculate the kurtosis of data in a column. We will need more than one data point to calculate this, otherwise we will be unable to give results." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import azureml.dataprep as dprep\n", - "dflow = dprep.auto_read_file(path='../data/crime-dirty.csv')\n", - "dflow_min = dflow.summarize(\n", - " summary_columns=[\n", - " dprep.SummaryColumnsValue(\n", - " column_id='Latitude',\n", - " summary_column_name='Primary Type Latitude Kurtosis', \n", - " summary_function=dprep.SummaryFunction.KURTOSIS)],\n", - " group_by_columns=['Primary Type'])\n", - "dflow_min.head(10)" - ] - } - ], - "metadata": { - "authors": [ - { - "name": "sihhu" - } - ], - "kernelspec": { - "display_name": "Python 3.6", - "language": "python", - "name": "python36" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.6.4" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} \ No newline at end of file