Delete filtering.ipynb

This commit is contained in:
Shané Winner
2019-07-28 00:23:28 -07:00
committed by GitHub
parent 2e245c1691
commit 1f4e4cdda2

View File

@@ -1,222 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/work-with-data/dataprep/how-to-guides/filtering.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Filtering\n",
"Copyright (c) Microsoft Corporation. All rights reserved.<br>\n",
"Licensed under the MIT License."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Azure ML Data Prep has the ability to filter out columns or rows using `Dataflow.drop_columns` or `Dataflow.filter`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# initial set up\n",
"import azureml.dataprep as dprep\n",
"from datetime import datetime\n",
"dflow = dprep.read_csv(path='../data/crime-spring.csv')\n",
"dflow.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Filtering columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To filter columns, use `Dataflow.drop_columns`. This method takes a list of columns to drop or a more complex argument called `ColumnSelector`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Filtering columns with list of strings"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this example, `drop_columns` takes a list of strings. Each string should exactly match the desired column to drop."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dflow = dflow.drop_columns(['ID', 'Location Description', 'Ward', 'Community Area', 'FBI Code'])\n",
"dflow.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Filtering columns with regex"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Alternatively, a `ColumnSelector` can be used to drop columns that match a regex expression. In this example, we drop all the columns that match the expression `Column*|.*longitud|.*latitude`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dflow = dflow.drop_columns(dprep.ColumnSelector('Column*|.*longitud|.*latitude', True, True))\n",
"dflow.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Filtering rows"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To filter rows, use `DataFlow.filter`. This method takes an `Expression` as an argument, and returns a new dataflow with the rows in which the expression evaluates to `True`. Expressions are built by indexing the `Dataflow` with a column name (`dataflow['myColumn']`) and regular operators (`>`, `<`, `>=`, `<=`, `==`, `!=`)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Filtering rows with simple expressions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Index into the Dataflow specifying the column name as a string argument `dataflow['column_name']` and in combination with one of the following standard operators `>, <, >=, <=, ==, !=`, build an expression such as `dataflow['District'] > 9`. Finally, pass the built expression into the `Dataflow.filter` function.\n",
"\n",
"In this example, `dataflow.filter(dataflow['District'] > 9)` returns a new dataflow with the rows in which the value of \"District\" is greater than '10' \n",
"\n",
"*Note that \"District\" is first converted to numeric, which allows us to build an expression comparing it against other numeric values.*"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dflow = dflow.to_number(['District'])\n",
"dflow = dflow.filter(dflow['District'] > 9)\n",
"dflow.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Filtering rows with complex expressions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To filter using complex expressions, combine one or more simple expressions with the operators `&`, `|`, and `~`. Please note that the precedence of these operators is lower than that of the comparison operators; therefore, you'll need to use parentheses to group clauses together. \n",
"\n",
"In this example, `Dataflow.filter` returns a new dataflow with the rows in which \"Primary Type\" equals 'DECEPTIVE PRACTICE' and \"District\" is greater than or equal to '10'."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dflow = dflow.to_number(['District'])\n",
"dflow = dflow.filter((dflow['Primary Type'] == 'DECEPTIVE PRACTICE') & (dflow['District'] >= 10))\n",
"dflow.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It is also possible to filter rows combining more than one expression builder to create a nested expression.\n",
"\n",
"*Note that `'Date'` and `'Updated On'` are first converted to datetime, which allows us to build an expression comparing it against other datetime values.*"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dflow = dflow.to_datetime(['Date', 'Updated On'], ['%Y-%m-%d %H:%M:%S'])\n",
"dflow = dflow.to_number(['District', 'Y Coordinate'])\n",
"comparison_date = datetime(2016,4,13)\n",
"dflow = dflow.filter(\n",
" ((dflow['Date'] > comparison_date) | (dflow['Updated On'] > comparison_date))\n",
" | ((dflow['Y Coordinate'] > 1900000) & (dflow['District'] > 10.0)))\n",
"dflow.head(5)"
]
}
],
"metadata": {
"authors": [
{
"name": "sihhu"
}
],
"kernelspec": {
"display_name": "Python 3.6",
"language": "python",
"name": "python36"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}