mirror of
https://github.com/Azure/MachineLearningNotebooks.git
synced 2025-12-19 17:17:04 -05:00
200 lines
5.7 KiB
Plaintext
200 lines
5.7 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Data Profile\n",
|
|
"Copyright (c) Microsoft Corporation. All rights reserved.<br>\n",
|
|
"Licensed under the MIT License."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"A DataProfile collects summary statistics on each column of the data produced by a Dataflow. This can be used to:\n",
|
|
"- Understand the input data.\n",
|
|
"- Determine which columns might need further preparation.\n",
|
|
"- Verify that data preparation operations produced the desired result.\n",
|
|
"\n",
|
|
"`Dataflow.get_profile()` executes the Dataflow, calculates profile information, and returns a newly constructed DataProfile."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import azureml.dataprep as dprep\n",
|
|
"\n",
|
|
"dflow = dprep.auto_read_file('../data/crime-spring.csv')\n",
|
|
"\n",
|
|
"profile = dflow.get_profile()\n",
|
|
"profile"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"A DataProfile contains a collection of ColumnProfiles, indexed by column name. Each ColumnProfile has attributes for the calculated column statistics. For non-numeric columns, profiles include only basic statistics like min, max, and error count. For numeric columns, profiles also include statistical moments and estimated quantiles."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"profile.columns['Beat']"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"You can also extract and filter data from profiles by using list and dict comprehensions."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"variances = [c.variance for c in profile.columns.values() if c.variance]\n",
|
|
"variances"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"column_types = {c.name: c.type for c in profile.columns.values()}\n",
|
|
"column_types"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"If a column has fewer than a thousand unique values, its ColumnProfile contains a summary of values with their respective counts."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"profile.columns['Primary Type'].value_counts"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Numeric ColumnProfiles include an estimated histogram of the data."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"profile.columns['District'].histogram"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"To configure the number of bins in the histogram, you can pass an integer as the `number_of_histogram_bins` parameter."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"profile_more_bins = dflow.get_profile(number_of_histogram_bins=5)\n",
|
|
"profile_more_bins.columns['District'].histogram"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"For columns containing data of mixed types, the ColumnProfile also provides counts of each type."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"profile.columns['X Coordinate'].type_counts"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"#TEST CELL: Profile.Compare\n",
|
|
"import azureml.dataprep as dprep\n",
|
|
"import math\n",
|
|
"\n",
|
|
"lhs_dflow = dprep.auto_read_file('../data/crime-spring.csv')\n",
|
|
"lhs_profile = lhs_dflow.get_profile(number_of_histogram_bins=100)\n",
|
|
"rhs_dflow = dprep.auto_read_file('../data/crime-winter.csv')\n",
|
|
"rhs_profile = rhs_dflow.get_profile(number_of_histogram_bins=100)\n",
|
|
"\n",
|
|
"diff = lhs_profile.compare(rhs_profile)\n",
|
|
"\n",
|
|
"expected_col1 = dprep.ColumnProfileDifference()\n",
|
|
"expected_col1.difference_in_count_in_percent = 0\n",
|
|
"expected_col1.difference_in_histograms = 135349.66146244822\n",
|
|
"\n",
|
|
"for actual, expected in zip(diff.column_profile_difference, [expected_col1]) :\n",
|
|
" assert math.isclose(actual.difference_in_count_in_percent, expected.difference_in_count_in_percent)\n",
|
|
" assert math.isclose(actual.difference_in_histograms, expected.difference_in_histograms)\n",
|
|
" break\n"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"authors": [
|
|
{
|
|
"name": "sihhu"
|
|
}
|
|
],
|
|
"kernelspec": {
|
|
"display_name": "Python 3.6",
|
|
"language": "python",
|
|
"name": "python36"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.6.4"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
} |