MachineLearningNotebooks/work-with-data/dataprep/how-to-guides/fuzzy-group.ipynb

{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/work-with-data/dataprep/how-to-guides/fuzzy-group.png)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Fuzzy Grouping\n",
        "\n",
        "Unprepared data often represents the same entity with multiple values; examples include different spellings, varying capitalizations, and abbreviations. This is common when working with data gathered from multiple sources or through human input. One way to canonicalize and reconcile these variants is to use Data Prep's fuzzy_group_column (also known as \"text clustering\") functionality.\n",
        "\n",
        "Data Prep inspects a column to determine clusters of similar values. A new column is added in which clustered values are replaced with the canonical value of its cluster, thus significantly reducing the number of distinct values. You can control the degree of similarity required for values to be clustered together, override canonical form, and set clusters if automatic clustering did not provide the desired results.\n",
        "\n",
        "Let's explore the capabilities of `fuzzy_group_column` by first reading in a dataset and inspecting it."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import azureml.dataprep as dprep"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "dflow = dprep.read_json(path='../data/json.json')\n",
        "dflow.head(5)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "As you can see above, the column `inspections.business.city` contains several forms of the city name \"San Francisco\".\n",
        "Let's add a column with values replaced by the automatically detected canonical form. To do so call fuzzy_group_column() on an existing Dataflow:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "dflow_clean = dflow.fuzzy_group_column(source_column='inspections.business.city',\n",
        "                                       new_column_name='city_grouped',\n",
        "                                       similarity_threshold=0.8,\n",
        "                                       similarity_score_column_name='similarity_score')\n",
        "dflow_clean.head(5)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The arguments `source_column` and `new_column_name` are required, whereas the others are optional.\n",
        "If `similarity_threshold` is provided, it will be used to control the required similarity level for the values to be grouped together.\n",
        "If `similarity_score_column_name` is provided, a second new column will be added to show similarity score between every pair of original and canonical values.\n",
        "\n",
        "In the resulting data set, you can see that all the different variations of representing \"San Francisco\" in the data were normalized to the same string, \"San Francisco\".\n",
        "\n",
        "But what if you want more control over what gets grouped, what doesn't, and what the canonical value should be? \n",
        "\n",
        "To get more control over grouping, canonical values, and exceptions, you need to use the `FuzzyGroupBuilder` class.\n",
        "Let's see what it has to offer below:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "builder = dflow.builders.fuzzy_group_column(source_column='inspections.business.city',\n",
        "                                            new_column_name='city_grouped',\n",
        "                                            similarity_threshold=0.8,\n",
        "                                            similarity_score_column_name='similarity_score')"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# calling learn() to get fuzzy groups\n",
        "builder.learn()\n",
        "builder.groups"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Here you can see that `fuzzy_group_column` detected one group with four values that all map to \"San Francisco\" as the canonical value.\n",
        "You can see the effects of changing the similarity threshold next:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "builder.similarity_threshold = 0.9\n",
        "builder.learn()\n",
        "builder.groups"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now that you are using a similarity threshold of `0.9`, two distinct groups of values are generated.\n",
        "\n",
        "Let's tweak some of the detected groups before completing the builder and getting back the Dataflow with the resulting fuzzy grouped column."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "builder.similarity_threshold = 0.8\n",
        "builder.learn()\n",
        "groups = builder.groups\n",
        "groups"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# change the canonical value for the first group\n",
        "groups[0]['canonicalValue'] = 'SANFRAN'\n",
        "duplicates = groups[0]['duplicates']\n",
        "# remove the last duplicate value from the cluster\n",
        "duplicates = duplicates[:-1]\n",
        "# assign modified duplicate array back\n",
        "groups[0]['duplicates'] = duplicates\n",
        "# assign modified groups back to builder\n",
        "builder.groups = groups\n",
        "builder.groups"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Here, the canonical value is modified to be used for the single fuzzy group and removed 'S.F.' from this group's duplicates list.\n",
        "\n",
        "You can mutate the copy of the `groups` list from the builder (be careful to keep the structure of objects inside this list). After getting the desired groups in the list, you can update the builder with it.\n",
        "\n",
        "Now you can get a dataflow with the FuzzyGroup step in it."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "dflow_clean = builder.to_dataflow()\n",
        "\n",
        "df = dflow_clean.to_pandas_dataframe()\n",
        "df"
      ]
    }
  ],
  "metadata": {
    "authors": [
      {
        "name": "sihhu"
      }
    ],
    "kernelspec": {
      "display_name": "Python 3.6",
      "language": "python",
      "name": "python36"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.4"
    },
    "notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License."
  },
  "nbformat": 4,
  "nbformat_minor": 2
}