{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Add Column using Expression\n",
"Copyright (c) Microsoft Corporation. All rights reserved.
\n",
"Licensed under the MIT License."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With Azure ML Data Prep you can add a new column to data with `Dataflow.add_column` by using a Data Prep expression to calculate the value from existing columns. This is similar to using Python to create a [new script column](./custom-python-transforms.ipynb#New-Script-Column) except the Data Prep expressions are more limited and will execute faster. The expressions used are the same as for [filtering rows](./filtering.ipynb#Filtering-rows) and hence have the same functions and operators available.\n",
"
\n",
"Here we add additional columns. First we get input data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import azureml.dataprep as dprep"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# loading data\n",
"dflow = dprep.auto_read_file('../data/crime-spring.csv')\n",
"dflow.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### `substring(start, length)`\n",
"Add a new column \"Case Category\" using the `substring(start, length)` expression to extract the prefix from the \"Case Number\" column."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"case_category = dflow.add_column(new_column_name='Case Category',\n",
" prior_column='Case Number',\n",
" expression=dflow['Case Number'].substring(0, 2))\n",
"case_category.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### `substring(start)`\n",
"Add a new column \"Case Id\" using the `substring(start)` expression to extract just the number from \"Case Number\" column and then convert it to numeric."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"case_id = dflow.add_column(new_column_name='Case Id',\n",
" prior_column='Case Number',\n",
" expression=dflow['Case Number'].substring(2))\n",
"case_id = case_id.to_number('Case Id')\n",
"case_id.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### `length()`\n",
"Using the length() expression, add a new numeric column \"Length\", which contains the length of the string in \"Primary Type\"."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dflow_length = dflow.add_column(new_column_name='Length',\n",
" prior_column='Primary Type',\n",
" expression=dflow['Primary Type'].length())\n",
"dflow_length.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### `to_upper()`\n",
"Using the to_upper() expression, add a new numeric column \"Upper Case\", which contains the length of the string in \"Primary Type\"."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dflow_to_upper = dflow.add_column(new_column_name='Upper Case',\n",
" prior_column='Primary Type',\n",
" expression=dflow['Primary Type'].to_upper())\n",
"dflow_to_upper.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### `to_lower()`\n",
"Using the to_lower() expression, add a new numeric column \"Lower Case\", which contains the length of the string in \"Primary Type\"."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dflow_to_lower = dflow.add_column(new_column_name='Lower Case',\n",
" prior_column='Primary Type',\n",
" expression=dflow['Primary Type'].to_lower())\n",
"dflow_to_lower.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### `RegEx.extract_record()`\n",
"Using the `RegEx.extract_record()` expression, add a new record column \"Stream Date Record\", which contains the name capturing groups in the regex with value."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dflow_regex_extract_record = dprep.auto_read_file('../data/stream-path.csv')\n",
"regex = dprep.RegEx('\\/(?