mirror of
https://github.com/Azure/MachineLearningNotebooks.git
synced 2025-12-21 01:55:07 -05:00
590 lines
19 KiB
Plaintext
590 lines
19 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Summarize\n",
|
|
"\n",
|
|
"Azure ML Data Prep can help summarize your data by providing you a synopsis based on aggregates over specific columns.\n",
|
|
"\n",
|
|
"## Table of Contents\n",
|
|
"[Overview](#overview)<br>\n",
|
|
"[Summmary Functions](#summary)<br>\n",
|
|
"* [SummaryFunction.MIN](#min)<br>\n",
|
|
"* [SummaryFunction.MAX](#max)<br>\n",
|
|
"* [SummaryFunction.MEAN](#mean)<br>\n",
|
|
"* [SummaryFunction.MEDIAN](#median)<br>\n",
|
|
"* [SummaryFunction.VAR](#var)<br>\n",
|
|
"* [SummaryFunction.SD](#sd)<br>\n",
|
|
"* [SummaryFunction.COUNT](#count)<br>\n",
|
|
"* [SummaryFunction.SUM](#sum)<br>\n",
|
|
"* [SummaryFunction.SKEWNESS](#skewness)<br>\n",
|
|
"* [SummaryFunction.KURTOSIS](#kurtosis)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<a id=\"overview\"></a>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Overview\n",
|
|
"Before we drill down into each aggregate function, let us observe `summarize` end to end.\n",
|
|
"\n",
|
|
"We will start by reading some data."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import azureml.dataprep as dprep\n",
|
|
"dflow = dprep.auto_read_file(path='../data/crime-dirty.csv')\n",
|
|
"dflow.head(10)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Next we count (`SummaryFunction.COUNT`) the number of rows with column ID with non-null values grouped by Primary Type."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"dflow_summarize = dflow.summarize(\n",
|
|
" summary_columns=[\n",
|
|
" dprep.SummaryColumnsValue(\n",
|
|
" column_id='ID',\n",
|
|
" summary_column_name='Primary Type ID Counts', \n",
|
|
" summary_function=dprep.SummaryFunction.COUNT)],\n",
|
|
" group_by_columns=['Primary Type'])\n",
|
|
"dflow_summarize.head(10)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"If we choose to not group by anything, we will instead get a single record over the entire dataset. Here we will get the number of rows that have the column ID with non-null values."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"dflow_summarize_nogroup = dflow.summarize(\n",
|
|
" summary_columns=[\n",
|
|
" dprep.SummaryColumnsValue(\n",
|
|
" column_id='ID',\n",
|
|
" summary_column_name='ID Count', \n",
|
|
" summary_function=dprep.SummaryFunction.COUNT)])\n",
|
|
"dflow_summarize_nogroup.head(1)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Conversely, we can group by multiple columns."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"dflow_summarize_2group = dflow.summarize(\n",
|
|
" summary_columns=[\n",
|
|
" dprep.SummaryColumnsValue(\n",
|
|
" column_id='ID',\n",
|
|
" summary_column_name='Primary Type & Location Description ID Counts', \n",
|
|
" summary_function=dprep.SummaryFunction.COUNT)],\n",
|
|
" group_by_columns=['Primary Type', 'Location Description'])\n",
|
|
"dflow_summarize_2group.head(10)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"In a similar vein, we can compute multiple aggregates in a single summary. Each aggregate function is independent and it is possible to aggregate the same column multiple times."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"dflow_summarize_multi_agg = dflow.summarize(\n",
|
|
" summary_columns=[\n",
|
|
" dprep.SummaryColumnsValue(\n",
|
|
" column_id='ID',\n",
|
|
" summary_column_name='Primary Type ID Counts', \n",
|
|
" summary_function=dprep.SummaryFunction.COUNT),\n",
|
|
" dprep.SummaryColumnsValue(\n",
|
|
" column_id='ID',\n",
|
|
" summary_column_name='Primary Type Min ID', \n",
|
|
" summary_function=dprep.SummaryFunction.MIN),\n",
|
|
" dprep.SummaryColumnsValue(\n",
|
|
" column_id='Date',\n",
|
|
" summary_column_name='Primary Type Max Date', \n",
|
|
" summary_function=dprep.SummaryFunction.MAX)],\n",
|
|
" group_by_columns=['Primary Type'])\n",
|
|
"dflow_summarize_multi_agg.head(10)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"If we wanted this summary data back into our original data set, we can make use of `join_back` and optionally `join_back_columns_prefix` for easy naming distinctions. Summary columns will be added to the end. `group_by_columns` is not necessary for using `join_back`, however the behavior will be more like an append instead of a join."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"dflow_summarize_join = dflow.summarize(\n",
|
|
" summary_columns=[\n",
|
|
" dprep.SummaryColumnsValue(\n",
|
|
" column_id='ID',\n",
|
|
" summary_column_name='Primary Type ID Counts', \n",
|
|
" summary_function=dprep.SummaryFunction.COUNT)],\n",
|
|
" group_by_columns=['Primary Type'],\n",
|
|
" join_back=True,\n",
|
|
" join_back_columns_prefix='New_')\n",
|
|
"dflow_summarize_join.head(10)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<a id=\"summary\"></a>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Summary Functions\n",
|
|
"Here we will go over all the possible aggregates in Data Prep.\n",
|
|
"The most up to date set of functions can be found by enumerating the `SummaryFunction` enum."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import azureml.dataprep as dprep\n",
|
|
"[x.name for x in dprep.SummaryFunction]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<a id=\"min\"></a>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### SummaryFunction.MIN\n",
|
|
"Data Prep can aggregate and find the minimum value of a column."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import azureml.dataprep as dprep\n",
|
|
"dflow = dprep.auto_read_file(path='../data/crime-dirty.csv')\n",
|
|
"dflow_min = dflow.summarize(\n",
|
|
" summary_columns=[\n",
|
|
" dprep.SummaryColumnsValue(\n",
|
|
" column_id='Date',\n",
|
|
" summary_column_name='Primary Type Min Date', \n",
|
|
" summary_function=dprep.SummaryFunction.MIN)],\n",
|
|
" group_by_columns=['Primary Type'])\n",
|
|
"dflow_min.head(10)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<a id=\"max\"></a>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### SummaryFunction.MAX\n",
|
|
"Data Prep can find the maximum value of a column."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import azureml.dataprep as dprep\n",
|
|
"dflow = dprep.auto_read_file(path='../data/crime-dirty.csv')\n",
|
|
"dflow_min = dflow.summarize(\n",
|
|
" summary_columns=[\n",
|
|
" dprep.SummaryColumnsValue(\n",
|
|
" column_id='Date',\n",
|
|
" summary_column_name='Primary Type Max Date', \n",
|
|
" summary_function=dprep.SummaryFunction.MAX)],\n",
|
|
" group_by_columns=['Primary Type'])\n",
|
|
"dflow_min.head(10)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<a id=\"mean\"></a>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### SummaryFunction.MEAN\n",
|
|
"Data Prep can find the statistical mean of a column."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import azureml.dataprep as dprep\n",
|
|
"dflow = dprep.auto_read_file(path='../data/crime-dirty.csv')\n",
|
|
"dflow_min = dflow.summarize(\n",
|
|
" summary_columns=[\n",
|
|
" dprep.SummaryColumnsValue(\n",
|
|
" column_id='Latitude',\n",
|
|
" summary_column_name='Primary Type Latitude Mean', \n",
|
|
" summary_function=dprep.SummaryFunction.MEAN)],\n",
|
|
" group_by_columns=['Primary Type'])\n",
|
|
"dflow_min.head(10)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<a id=\"median\"></a>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### SummaryFunction.MEDIAN\n",
|
|
"Data Prep can find the median value of a column."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import azureml.dataprep as dprep\n",
|
|
"dflow = dprep.auto_read_file(path='../data/crime-dirty.csv')\n",
|
|
"dflow_min = dflow.summarize(\n",
|
|
" summary_columns=[\n",
|
|
" dprep.SummaryColumnsValue(\n",
|
|
" column_id='Latitude',\n",
|
|
" summary_column_name='Primary Type Latitude Median', \n",
|
|
" summary_function=dprep.SummaryFunction.MEDIAN)],\n",
|
|
" group_by_columns=['Primary Type'])\n",
|
|
"dflow_min.head(10)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<a id=\"var\"></a>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### SummaryFunction.VAR\n",
|
|
"Data Prep can find the statistical variance of a column. We will need more than one data point to calculate this, otherwise we will be unable to give results."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import azureml.dataprep as dprep\n",
|
|
"dflow = dprep.auto_read_file(path='../data/crime-dirty.csv')\n",
|
|
"dflow_min = dflow.summarize(\n",
|
|
" summary_columns=[\n",
|
|
" dprep.SummaryColumnsValue(\n",
|
|
" column_id='Latitude',\n",
|
|
" summary_column_name='Primary Type Latitude Variance', \n",
|
|
" summary_function=dprep.SummaryFunction.VAR)],\n",
|
|
" group_by_columns=['Primary Type'])\n",
|
|
"dflow_min.head(10)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Note that despite there being two cases of BATTERY, one of them is missing geographical location, thus only CRIMINAL DAMAGE can yield variance information. "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<a id=\"sd\"></a>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### SummaryFunction.SD\n",
|
|
"Data Prep can find the standard deviation of a column. We will need more than one data point to calculate this, otherwise we will be unable to give results."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import azureml.dataprep as dprep\n",
|
|
"dflow = dprep.auto_read_file(path='../data/crime-dirty.csv')\n",
|
|
"dflow_min = dflow.summarize(\n",
|
|
" summary_columns=[\n",
|
|
" dprep.SummaryColumnsValue(\n",
|
|
" column_id='Latitude',\n",
|
|
" summary_column_name='Primary Type Latitude Standard Deviation', \n",
|
|
" summary_function=dprep.SummaryFunction.SD)],\n",
|
|
" group_by_columns=['Primary Type'])\n",
|
|
"dflow_min.head(10)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Similar to when we calculate variance, despite there being two cases of BATTERY, one of them is missing geographical location, thus only CRIMINAL DAMAGE can yield variance information. "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<a id=\"count\"></a>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### SummaryFunction.COUNT\n",
|
|
"Data Prep can count the number of rows that have a column with non-null values."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import azureml.dataprep as dprep\n",
|
|
"dflow = dprep.auto_read_file(path='../data/crime-dirty.csv')\n",
|
|
"dflow_min = dflow.summarize(\n",
|
|
" summary_columns=[\n",
|
|
" dprep.SummaryColumnsValue(\n",
|
|
" column_id='Latitude',\n",
|
|
" summary_column_name='Primary Type Latitude Count', \n",
|
|
" summary_function=dprep.SummaryFunction.COUNT)],\n",
|
|
" group_by_columns=['Primary Type'])\n",
|
|
"dflow_min.head(10)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Note that despite there being two cases of BATTERY, one of them is missing geographical location, thus when we group by Primary Type, we only get a count of one for Latitude."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<a id=\"sum\"></a>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### SummaryFunction.SUM\n",
|
|
"Data Prep can aggregate and sum the values of a column. Our dataset does not have many numerical facts, but here we sum IDs grouped by Primary Type."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import azureml.dataprep as dprep\n",
|
|
"dflow = dprep.auto_read_file(path='../data/crime-dirty.csv')\n",
|
|
"dflow_min = dflow.summarize(\n",
|
|
" summary_columns=[\n",
|
|
" dprep.SummaryColumnsValue(\n",
|
|
" column_id='ID',\n",
|
|
" summary_column_name='Primary Type ID Sum', \n",
|
|
" summary_function=dprep.SummaryFunction.SUM)],\n",
|
|
" group_by_columns=['Primary Type'])\n",
|
|
"dflow_min.head(10)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<a id=\"skewness\"></a>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### SummaryFunction.SKEWNESS\n",
|
|
"Data Prep can calculate the skewness of data in a column. We will need more than one data point to calculate this, otherwise we will be unable to give results."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import azureml.dataprep as dprep\n",
|
|
"dflow = dprep.auto_read_file(path='../data/crime-dirty.csv')\n",
|
|
"dflow_min = dflow.summarize(\n",
|
|
" summary_columns=[\n",
|
|
" dprep.SummaryColumnsValue(\n",
|
|
" column_id='Latitude',\n",
|
|
" summary_column_name='Primary Type Latitude Skewness', \n",
|
|
" summary_function=dprep.SummaryFunction.SKEWNESS)],\n",
|
|
" group_by_columns=['Primary Type'])\n",
|
|
"dflow_min.head(10)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<a id=\"kurtosis\"></a>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### SummaryFunction.KURTOSIS\n",
|
|
"Data Prep can calculate the kurtosis of data in a column. We will need more than one data point to calculate this, otherwise we will be unable to give results."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import azureml.dataprep as dprep\n",
|
|
"dflow = dprep.auto_read_file(path='../data/crime-dirty.csv')\n",
|
|
"dflow_min = dflow.summarize(\n",
|
|
" summary_columns=[\n",
|
|
" dprep.SummaryColumnsValue(\n",
|
|
" column_id='Latitude',\n",
|
|
" summary_column_name='Primary Type Latitude Kurtosis', \n",
|
|
" summary_function=dprep.SummaryFunction.KURTOSIS)],\n",
|
|
" group_by_columns=['Primary Type'])\n",
|
|
"dflow_min.head(10)"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"authors": [
|
|
{
|
|
"name": "sihhu"
|
|
}
|
|
],
|
|
"kernelspec": {
|
|
"display_name": "Python 3.6",
|
|
"language": "python",
|
|
"name": "python36"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.6.4"
|
|
},
|
|
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License."
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
} |