diff --git a/work-with-data/dataprep/how-to-guides/data-profile.ipynb b/work-with-data/dataprep/how-to-guides/data-profile.ipynb deleted file mode 100644 index 97b42ee1..00000000 --- a/work-with-data/dataprep/how-to-guides/data-profile.ipynb +++ /dev/null @@ -1,179 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/work-with-data/dataprep/how-to-guides/data-profile.png)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Data Profile\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "A DataProfile collects summary statistics on each column of the data produced by a Dataflow. This can be used to:\n", - "- Understand the input data.\n", - "- Determine which columns might need further preparation.\n", - "- Verify that data preparation operations produced the desired result.\n", - "\n", - "`Dataflow.get_profile()` executes the Dataflow, calculates profile information, and returns a newly constructed DataProfile." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import azureml.dataprep as dprep\n", - "\n", - "dflow = dprep.auto_read_file('../data/crime-spring.csv')\n", - "\n", - "profile = dflow.get_profile()\n", - "profile" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "A DataProfile contains a collection of ColumnProfiles, indexed by column name. Each ColumnProfile has attributes for the calculated column statistics. For non-numeric columns, profiles include only basic statistics like min, max, and error count. For numeric columns, profiles also include statistical moments and estimated quantiles." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "profile.columns['Beat']" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You can also extract and filter data from profiles by using list and dict comprehensions." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "variances = [c.variance for c in profile.columns.values() if c.variance]\n", - "variances" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "column_types = {c.name: c.type for c in profile.columns.values()}\n", - "column_types" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "If a column has fewer than a thousand unique values, its ColumnProfile contains a summary of values with their respective counts." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "profile.columns['Primary Type'].value_counts" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Numeric ColumnProfiles include an estimated histogram of the data." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "profile.columns['District'].histogram" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "To configure the number of bins in the histogram, you can pass an integer as the `number_of_histogram_bins` parameter." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "profile_more_bins = dflow.get_profile(number_of_histogram_bins=5)\n", - "profile_more_bins.columns['District'].histogram" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "For columns containing data of mixed types, the ColumnProfile also provides counts of each type." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "profile.columns['X Coordinate'].type_counts" - ] - } - ], - "metadata": { - "authors": [ - { - "name": "sihhu" - } - ], - "kernelspec": { - "display_name": "Python 3.6", - "language": "python", - "name": "python36" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.6.4" - }, - "notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License." - }, - "nbformat": 4, - "nbformat_minor": 2 -} \ No newline at end of file