Files
MachineLearningNotebooks/how-to-use-azureml/work-with-data/dataprep/how-to-guides/data-profile.ipynb
Roope Astala 2d41c00488 version 1.0.39
2019-05-14 16:01:14 -04:00

200 lines
5.7 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data Profile\n",
"Copyright (c) Microsoft Corporation. All rights reserved.<br>\n",
"Licensed under the MIT License."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A DataProfile collects summary statistics on each column of the data produced by a Dataflow. This can be used to:\n",
"- Understand the input data.\n",
"- Determine which columns might need further preparation.\n",
"- Verify that data preparation operations produced the desired result.\n",
"\n",
"`Dataflow.get_profile()` executes the Dataflow, calculates profile information, and returns a newly constructed DataProfile."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import azureml.dataprep as dprep\n",
"\n",
"dflow = dprep.auto_read_file('../data/crime-spring.csv')\n",
"\n",
"profile = dflow.get_profile()\n",
"profile"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A DataProfile contains a collection of ColumnProfiles, indexed by column name. Each ColumnProfile has attributes for the calculated column statistics. For non-numeric columns, profiles include only basic statistics like min, max, and error count. For numeric columns, profiles also include statistical moments and estimated quantiles."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"profile.columns['Beat']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can also extract and filter data from profiles by using list and dict comprehensions."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"variances = [c.variance for c in profile.columns.values() if c.variance]\n",
"variances"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"column_types = {c.name: c.type for c in profile.columns.values()}\n",
"column_types"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If a column has fewer than a thousand unique values, its ColumnProfile contains a summary of values with their respective counts."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"profile.columns['Primary Type'].value_counts"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Numeric ColumnProfiles include an estimated histogram of the data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"profile.columns['District'].histogram"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To configure the number of bins in the histogram, you can pass an integer as the `number_of_histogram_bins` parameter."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"profile_more_bins = dflow.get_profile(number_of_histogram_bins=5)\n",
"profile_more_bins.columns['District'].histogram"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For columns containing data of mixed types, the ColumnProfile also provides counts of each type."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"profile.columns['X Coordinate'].type_counts"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#TEST CELL: Profile.Compare\n",
"import azureml.dataprep as dprep\n",
"import math\n",
"\n",
"lhs_dflow = dprep.auto_read_file('../data/crime-spring.csv')\n",
"lhs_profile = lhs_dflow.get_profile(number_of_histogram_bins=100)\n",
"rhs_dflow = dprep.auto_read_file('../data/crime-winter.csv')\n",
"rhs_profile = rhs_dflow.get_profile(number_of_histogram_bins=100)\n",
"\n",
"diff = lhs_profile.compare(rhs_profile)\n",
"\n",
"expected_col1 = dprep.ColumnProfileDifference()\n",
"expected_col1.difference_in_count_in_percent = 0\n",
"expected_col1.difference_in_histograms = 135349.66146244822\n",
"\n",
"for actual, expected in zip(diff.column_profile_difference, [expected_col1]) :\n",
" assert math.isclose(actual.difference_in_count_in_percent, expected.difference_in_count_in_percent)\n",
" assert math.isclose(actual.difference_in_histograms, expected.difference_in_histograms)\n",
" break\n"
]
}
],
"metadata": {
"authors": [
{
"name": "sihhu"
}
],
"kernelspec": {
"display_name": "Python 3.6",
"language": "python",
"name": "python36"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}