MachineLearningNotebooks/how-to-use-azureml/work-with-data/dataprep/how-to-guides/assertions.ipynb

{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/work-with-data/dataprep/how-to-guides/assertions.png)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Assertions\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Frequently, the data we work with while cleaning and preparing data is just a subset of the total data we will need to work with in production. It is also common to be working on a snapshot of a live dataset that is continuously updated and augmented.\n",
        "\n",
        "In these cases, some of the assumptions we make as part of our cleaning might turn out to be false. Columns that originally only contained numbers within a certain range might actually contain a wider range of values in later executions. These errors often result in either broken pipelines or bad data.\n",
        "\n",
        "Azure ML Data Prep supports creating assertions on data, which are evaluated as the pipeline is executed. These assertions enable us to verify that our assumptions on the data continue to be accurate and, when not, to handle failures in a clean way."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "To demonstrate, we will load a dataset and then add some assertions based on what we can see in the first few rows."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.dataprep import auto_read_file\n",
        "\n",
        "dflow = auto_read_file('../data/crime-dirty.csv')\n",
        "dflow.get_profile()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can see there are latitude and longitude columns present in this dataset. By definition, these are constrained to specific ranges of values. We can assert that this is indeed the case so that if any records come through with invalid values, we detect them."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.dataprep import value\n",
        "\n",
        "dflow = dflow.assert_value('Latitude', (value <= 90) & (value >= -90), error_code='InvalidLatitude')\n",
        "dflow = dflow.assert_value('Longitude', (value <= 180) & (value >= -180), error_code='InvalidLongitude')\n",
        "dflow.keep_columns(['Latitude', 'Longitude']).get_profile()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Any assertion failures are represented as Errors in the resulting dataset. From the profile above, you can see that the Error Count for both of these columns is 1. We can use a filter to retrieve the error and see what value caused the assertion to fail."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.dataprep import col\n",
        "\n",
        "dflow_error = dflow.filter(col('Latitude').is_error())\n",
        "error = dflow_error.head(10)['Latitude'][0]\n",
        "print(error.originalValue)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Our assertion failed because we were not removing missing values from our data. At this point, we have two options: we can go back and edit our code to avoid this error in the first place or we can resolve it now. In this case, we will just filter these out."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.dataprep import LocalFileOutput\n",
        "dflow_clean = dflow.filter(~dflow['Latitude'].is_error())\n",
        "dflow_clean.get_profile()"
      ]
    }
  ],
  "metadata": {
    "authors": [
      {
        "name": "sihhu"
      }
    ],
    "kernelspec": {
      "display_name": "Python 3.6",
      "language": "python",
      "name": "python36"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.4"
    },
    "notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License."
  },
  "nbformat": 4,
  "nbformat_minor": 2
}