From 7047f7629940584ab8647290739163cd30da35ae Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Shan=C3=A9=20Winner?= <43390034+swinner95@users.noreply.github.com> Date: Wed, 21 Aug 2019 10:10:56 -0700 Subject: [PATCH] Delete custom-python-transforms.ipynb --- .../custom-python-transforms.ipynb | 231 ------------------ 1 file changed, 231 deletions(-) delete mode 100644 work-with-data/dataprep/how-to-guides/custom-python-transforms.ipynb diff --git a/work-with-data/dataprep/how-to-guides/custom-python-transforms.ipynb b/work-with-data/dataprep/how-to-guides/custom-python-transforms.ipynb deleted file mode 100644 index 43a3ca62..00000000 --- a/work-with-data/dataprep/how-to-guides/custom-python-transforms.ipynb +++ /dev/null @@ -1,231 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/work-with-data/dataprep/how-to-guides/custom-python-transforms.png)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Custom Python Transforms\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "There will be scenarios when the easiest thing for you to do is just to write some Python code. This SDK provides three extension points that you can use.\n", - "\n", - "1. New Script Column\n", - "2. New Script Filter\n", - "3. Transform Partition\n", - "\n", - "Each of these are supported in both the scale-up and the scale-out runtime. A key advantage of using these extension points is that you don't need to pull all of the data in order to create a dataframe. Your custom python code will be run just like other transforms, at scale, by partition, and typically in parallel." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Initial data prep" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We start by loading crime data." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import azureml.dataprep as dprep\n", - "col = dprep.col\n", - "\n", - "dflow = dprep.read_csv(path='../data/crime-spring.csv')\n", - "dflow.head(5)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We trim the dataset down and keep only the columns we are interested in. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "dflow = dflow.keep_columns(['Case Number','Primary Type', 'Description', 'Latitude', 'Longitude'])\n", - "dflow = dflow.replace_na(columns=['Latitude', 'Longitude'], custom_na_list='')\n", - "dflow.head(5)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We look for null values using a filter. We found some, so now we'll look at a way to fill these missing values." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "dflow.filter(col('Latitude').is_null()).head(5)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Transform Partition" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We want to replace all null values with a 0, so we decide to use a handy pandas function. This code will be run by partition, not on all of the dataset at a time. This means that on a large dataset, this code may run in parallel as the runtime processes the data partition by partition." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "pt_dflow = dflow\n", - "dflow = pt_dflow.transform_partition(\"\"\"\n", - "def transform(df, index):\n", - " df['Latitude'].fillna('0',inplace=True)\n", - " df['Longitude'].fillna('0',inplace=True)\n", - " return df\n", - "\"\"\")\n", - "dflow.head(5)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Transform Partition With File" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Being able to use any python code to manipulate your data as a pandas DataFrame is extremely useful for complex and specific data operations that DataPrep doesn't handle natively. Though the code isn't very testable unfortunately, it's just sitting inside a string.\n", - "So to improve code testability and ease of script writing there is another transform_partiton interface that takes the path to a python script which must contain a function matching the 'transform' signature defined above.\n", - "\n", - "The `script_path` argument should be a relative path to ensure Dataflow portability. Here `map_func.py` contains the same code as in the previous example." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "dflow = pt_dflow.transform_partition_with_file('../data/map_func.py')\n", - "dflow.head(5)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## New Script Column" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We want to create a new column that has both the latitude and longitude. We can achieve it easily using [Data Prep expression](./add-column-using-expression.ipynb), which is faster in execution. Alternatively, We can do this using Python code by using the `new_script_column()` method on the dataflow. Note that we use custom Python code here for demo purpose only. In practise, you should always use Data Prep native functions as a preferred method, and use custom Python code when the functionality is not available in Data Prep. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "dflow = dflow.new_script_column(new_column_name='coordinates', insert_after='Longitude', script=\"\"\"\n", - "def newvalue(row):\n", - " return '(' + row['Latitude'] + ', ' + row['Longitude'] + ')'\n", - "\"\"\")\n", - "dflow.head(5)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## New Script Filter" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now we want to filter the dataset down to only the crimes that incurred over $300 in loss. We can build a Python expression that returns True if we want to keep the row, and False to drop the row." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "dflow = dflow.new_script_filter(\"\"\"\n", - "def includerow(row):\n", - " val = row['Description']\n", - " return 'OVER $ 300' in val\n", - "\"\"\")\n", - "dflow.head(5)" - ] - } - ], - "metadata": { - "authors": [ - { - "name": "sihhu" - } - ], - "kernelspec": { - "display_name": "Python 3.6", - "language": "python", - "name": "python36" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.6.4" - }, - "notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License." - }, - "nbformat": 4, - "nbformat_minor": 2 -} \ No newline at end of file