Files

179 lines
4.8 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/work-with-data/dataprep/how-to-guides/data-profile.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data Profile\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A DataProfile collects summary statistics on each column of the data produced by a Dataflow. This can be used to:\n",
"- Understand the input data.\n",
"- Determine which columns might need further preparation.\n",
"- Verify that data preparation operations produced the desired result.\n",
"\n",
"`Dataflow.get_profile()` executes the Dataflow, calculates profile information, and returns a newly constructed DataProfile."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import azureml.dataprep as dprep\n",
"\n",
"dflow = dprep.auto_read_file('../data/crime-spring.csv')\n",
"\n",
"profile = dflow.get_profile()\n",
"profile"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A DataProfile contains a collection of ColumnProfiles, indexed by column name. Each ColumnProfile has attributes for the calculated column statistics. For non-numeric columns, profiles include only basic statistics like min, max, and error count. For numeric columns, profiles also include statistical moments and estimated quantiles."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"profile.columns['Beat']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can also extract and filter data from profiles by using list and dict comprehensions."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"variances = [c.variance for c in profile.columns.values() if c.variance]\n",
"variances"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"column_types = {c.name: c.type for c in profile.columns.values()}\n",
"column_types"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If a column has fewer than a thousand unique values, its ColumnProfile contains a summary of values with their respective counts."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"profile.columns['Primary Type'].value_counts"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Numeric ColumnProfiles include an estimated histogram of the data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"profile.columns['District'].histogram"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To configure the number of bins in the histogram, you can pass an integer as the `number_of_histogram_bins` parameter."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"profile_more_bins = dflow.get_profile(number_of_histogram_bins=5)\n",
"profile_more_bins.columns['District'].histogram"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For columns containing data of mixed types, the ColumnProfile also provides counts of each type."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"profile.columns['X Coordinate'].type_counts"
]
}
],
"metadata": {
"authors": [
{
"name": "sihhu"
}
],
"kernelspec": {
"display_name": "Python 3.6",
"language": "python",
"name": "python36"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
},
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License."
},
"nbformat": 4,
"nbformat_minor": 2
}