MachineLearningNotebooks/work-with-data/dataprep/how-to-guides/custom-python-transforms.ipynb

{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/work-with-data/dataprep/how-to-guides/custom-python-transforms.png)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Custom Python Transforms\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "There will be scenarios when the easiest thing for you to do is just to write some Python code. This SDK provides three extension points that you can use.\n",
        "\n",
        "1. New Script Column\n",
        "2. New Script Filter\n",
        "3. Transform Partition\n",
        "\n",
        "Each of these are supported in both the scale-up and the scale-out runtime. A key advantage of using these extension points is that you don't need to pull all of the data in order to create a dataframe. Your custom python code will be run just like other transforms, at scale, by partition, and typically in parallel."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Initial data prep"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We start by loading crime data."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import azureml.dataprep as dprep\n",
        "col = dprep.col\n",
        "\n",
        "dflow = dprep.read_csv(path='../data/crime-spring.csv')\n",
        "dflow.head(5)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We trim the dataset down and keep only the columns we are interested in. "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "dflow = dflow.keep_columns(['Case Number','Primary Type', 'Description', 'Latitude', 'Longitude'])\n",
        "dflow = dflow.replace_na(columns=['Latitude', 'Longitude'], custom_na_list='')\n",
        "dflow.head(5)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We look for null values using a filter. We found some, so now we'll look at a way to fill these missing values."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "dflow.filter(col('Latitude').is_null()).head(5)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Transform Partition"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We want to replace all null values with a 0, so we decide to use a handy pandas function. This code will be run by partition, not on all of the dataset at a time. This means that on a large dataset, this code may run in parallel as the runtime processes the data partition by partition."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "pt_dflow = dflow\n",
        "dflow = pt_dflow.transform_partition(\"\"\"\n",
        "def transform(df, index):\n",
        "    df['Latitude'].fillna('0',inplace=True)\n",
        "    df['Longitude'].fillna('0',inplace=True)\n",
        "    return df\n",
        "\"\"\")\n",
        "dflow.head(5)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Transform Partition With File"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Being able to use any python code to manipulate your data as a pandas DataFrame is extremely useful for complex and specific data operations that DataPrep doesn't handle natively. Though the code isn't very testable unfortunately, it's just sitting inside a string.\n",
        "So to improve code testability and ease of script writing there is another transform_partiton interface that takes the path to a python script which must contain a function matching the 'transform' signature defined above.\n",
        "\n",
        "The `script_path` argument should be a relative path to ensure Dataflow portability. Here `map_func.py` contains the same code as in the previous example."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "dflow = pt_dflow.transform_partition_with_file('../data/map_func.py')\n",
        "dflow.head(5)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## New Script Column"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We want to create a new column that has both the latitude and longitude. We can achieve it easily using [Data Prep expression](./add-column-using-expression.ipynb), which is faster in execution. Alternatively, We can do this using Python code by using the `new_script_column()` method on the dataflow. Note that we use custom Python code here for demo purpose only. In practise, you should always use Data Prep native functions as a preferred method, and use custom Python code when the functionality is not available in Data Prep. "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "dflow = dflow.new_script_column(new_column_name='coordinates', insert_after='Longitude', script=\"\"\"\n",
        "def newvalue(row):\n",
        "    return '(' + row['Latitude'] + ', ' + row['Longitude'] + ')'\n",
        "\"\"\")\n",
        "dflow.head(5)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## New Script Filter"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now we want to filter the dataset down to only the crimes that incurred over $300 in loss. We can build a Python expression that returns True if we want to keep the row, and False to drop the row."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "dflow = dflow.new_script_filter(\"\"\"\n",
        "def includerow(row):\n",
        "    val = row['Description']\n",
        "    return 'OVER $ 300' in val\n",
        "\"\"\")\n",
        "dflow.head(5)"
      ]
    }
  ],
  "metadata": {
    "authors": [
      {
        "name": "sihhu"
      }
    ],
    "kernelspec": {
      "display_name": "Python 3.6",
      "language": "python",
      "name": "python36"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.4"
    },
    "notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License."
  },
  "nbformat": 4,
  "nbformat_minor": 2
}