MachineLearningNotebooks/how-to-use-azureml/work-with-data/dataprep/how-to-guides/quantile-transformation.ipynb

{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/work-with-data/dataprep/how-to-guides/quantile-transformation.png)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Quantile Transformation\n",
        "Copyright (c) Microsoft Corporation. All rights reserved.<br>\n",
        "Licensed under the MIT License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "DataPrep has the ability to perform quantile transformation to a numeric column. This transformation can transform the data into a normal or uniform distribution. Values bigger than the learnt boundaries will simply be clipped to the learnt boundaries when applying quantile transformation.\n",
        "\n",
        "Let's load a sample of the median income of california households in different suburbs from the 1990 census data. From the data profile, we can see that the minimum value and maximum value is 0.9946 and 15 respectively."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import azureml.dataprep as dprep\n",
        "\n",
        "dflow = dprep.read_csv(path='../data/median_income.csv').set_column_types(type_conversions={\n",
        "    'median_income': dprep.TypeConverter(dprep.FieldType.DECIMAL)\n",
        "})\n",
        "dflow.get_profile()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Let's now apply quantile transformation to `median_income` and see how that affects the data. We will apply quantile transformation twice, one that maps the data to a Uniform(0, 1) distribution, one that maps it to a Normal(0, 1) distribution.\n",
        "\n",
        "From the data profile, we can see that the min and max of the uniform median income is strictly between 0 and 1 and the mean and standard deviation of the normal median income is close to 0 and 1 respectively.\n",
        "\n",
        "*Note: for normal distribution, we will clip the values at the ends as the 0th percentile and the 100th percentile are -Inf and Inf respectively.*"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "dflow = dflow.quantile_transform(source_column='median_income', new_column='median_income_uniform', quantiles_count=5)\n",
        "dflow = dflow.quantile_transform(source_column='median_income', new_column='median_income_normal', \n",
        "                           quantiles_count=5, output_distribution=\"Normal\")\n",
        "dflow.get_profile()"
      ]
    }
  ],
  "metadata": {
    "authors": [
      {
        "name": "sihhu"
      }
    ],
    "kernelspec": {
      "display_name": "Python 3.6",
      "language": "python",
      "name": "python36"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.4"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 2
}