{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/work-with-data/dataprep/how-to-guides/data-profile.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Profile\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A DataProfile collects summary statistics on each column of the data produced by a Dataflow. This can be used to:\n", "- Understand the input data.\n", "- Determine which columns might need further preparation.\n", "- Verify that data preparation operations produced the desired result.\n", "\n", "`Dataflow.get_profile()` executes the Dataflow, calculates profile information, and returns a newly constructed DataProfile." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import azureml.dataprep as dprep\n", "\n", "dflow = dprep.auto_read_file('../data/crime-spring.csv')\n", "\n", "profile = dflow.get_profile()\n", "profile" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A DataProfile contains a collection of ColumnProfiles, indexed by column name. Each ColumnProfile has attributes for the calculated column statistics. For non-numeric columns, profiles include only basic statistics like min, max, and error count. For numeric columns, profiles also include statistical moments and estimated quantiles." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "profile.columns['Beat']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also extract and filter data from profiles by using list and dict comprehensions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "variances = [c.variance for c in profile.columns.values() if c.variance]\n", "variances" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "column_types = {c.name: c.type for c in profile.columns.values()}\n", "column_types" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If a column has fewer than a thousand unique values, its ColumnProfile contains a summary of values with their respective counts." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "profile.columns['Primary Type'].value_counts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Numeric ColumnProfiles include an estimated histogram of the data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "profile.columns['District'].histogram" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To configure the number of bins in the histogram, you can pass an integer as the `number_of_histogram_bins` parameter." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "profile_more_bins = dflow.get_profile(number_of_histogram_bins=5)\n", "profile_more_bins.columns['District'].histogram" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For columns containing data of mixed types, the ColumnProfile also provides counts of each type." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "profile.columns['X Coordinate'].type_counts" ] } ], "metadata": { "authors": [ { "name": "sihhu" } ], "kernelspec": { "display_name": "Python 3.6", "language": "python", "name": "python36" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" }, "notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License." }, "nbformat": 4, "nbformat_minor": 2 }