MachineLearningNotebooks/how-to-use-azureml/work-with-data/dataprep/how-to-guides/data-profile.ipynb

{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Data Profile\n",
        "Copyright (c) Microsoft Corporation. All rights reserved.<br>\n",
        "Licensed under the MIT License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "A DataProfile collects summary statistics on each column of the data produced by a Dataflow. This can be used to:\n",
        "- Understand the input data.\n",
        "- Determine which columns might need further preparation.\n",
        "- Verify that data preparation operations produced the desired result.\n",
        "\n",
        "`Dataflow.get_profile()` executes the Dataflow, calculates profile information, and returns a newly constructed DataProfile."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import azureml.dataprep as dprep\n",
        "\n",
        "dflow = dprep.auto_read_file('../data/crime-spring.csv')\n",
        "\n",
        "profile = dflow.get_profile()\n",
        "profile"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "A DataProfile contains a collection of ColumnProfiles, indexed by column name. Each ColumnProfile has attributes for the calculated column statistics. For non-numeric columns, profiles include only basic statistics like min, max, and error count. For numeric columns, profiles also include statistical moments and estimated quantiles."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "profile.columns['Beat']"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "You can also extract and filter data from profiles by using list and dict comprehensions."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "variances = [c.variance for c in profile.columns.values() if c.variance]\n",
        "variances"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "column_types = {c.name: c.type for c in profile.columns.values()}\n",
        "column_types"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "If a column has fewer than a thousand unique values, its ColumnProfile contains a summary of values with their respective counts."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "profile.columns['Primary Type'].value_counts"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Numeric ColumnProfiles include an estimated histogram of the data."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "profile.columns['District'].histogram"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "To configure the number of bins in the histogram, you can pass an integer as the `number_of_histogram_bins` parameter."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "profile_more_bins = dflow.get_profile(number_of_histogram_bins=5)\n",
        "profile_more_bins.columns['District'].histogram"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "For columns containing data of mixed types, the ColumnProfile also provides counts of each type."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "profile.columns['X Coordinate'].type_counts"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "#TEST CELL: Profile.Compare\n",
        "import azureml.dataprep as dprep\n",
        "import math\n",
        "\n",
        "lhs_dflow = dprep.auto_read_file('../data/crime-spring.csv')\n",
        "lhs_profile = lhs_dflow.get_profile(number_of_histogram_bins=100)\n",
        "rhs_dflow = dprep.auto_read_file('../data/crime-winter.csv')\n",
        "rhs_profile = rhs_dflow.get_profile(number_of_histogram_bins=100)\n",
        "\n",
        "diff = lhs_profile.compare(rhs_profile)\n",
        "\n",
        "expected_col1 = dprep.ColumnProfileDifference()\n",
        "expected_col1.difference_in_count_in_percent = 0\n",
        "expected_col1.difference_in_histograms = 135349.66146244822\n",
        "\n",
        "for actual, expected in zip(diff.column_profile_difference, [expected_col1]) :\n",
        "    assert math.isclose(actual.difference_in_count_in_percent, expected.difference_in_count_in_percent)\n",
        "    assert math.isclose(actual.difference_in_histograms, expected.difference_in_histograms)\n",
        "    break\n"
      ]
    }
  ],
  "metadata": {
    "authors": [
      {
        "name": "sihhu"
      }
    ],
    "kernelspec": {
      "display_name": "Python 3.6",
      "language": "python",
      "name": "python36"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.4"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 2
}