update samples from Release-38 as a part of SDK release

This commit is contained in:
vizhur
2020-02-11 20:00:16 +00:00
parent a09a1a16a7
commit ed91e39d7e
80 changed files with 4469 additions and 102 deletions

View File

@@ -0,0 +1,594 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Tabular Time Series Related API Demo with NOAA Weather Data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Copyright (c) Microsoft Corporation. All rights reserved. <br>\n",
"Licensed under the MIT License."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this notebook, you will learn how to use the Tabular Time Series related API to filter the data by time windows for sample data uploaded to Azure blob storage. \n",
"\n",
"The detailed APIs to be demoed in this script are:\n",
"- Create Tabular Dataset instance\n",
"- Assign fine timestamp column and coarse timestamp column for Tabular Dataset to activate Time Series related APIs\n",
"- Clear fine timestamp column and coarse timestamp column\n",
"- Filter in data before a specific time\n",
"- Filter in data after a specific time\n",
"- Filter in data in a specific time range\n",
"- Filter in data for recent time range\n",
"\n",
"Besides above APIs, you'll also see:\n",
"- Create and load a Workspace\n",
"- Load weather data into Azure blob storage\n",
"- Create and register weather data as a Tabular dataset\n",
"- Re-load Tabular Dataset from your Workspace"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Import Dependencies\n",
"\n",
"If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, run the cells below to install the Azure Machine Learning Python SDK and create an Azure ML Workspace that's required for this demo."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Prepare Environment"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Print out your version of the Azure ML Python SDK. Version 1.0.60 or above is required for TabularDataset with timeseries attribute. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import azureml.core\n",
"azureml.data.__version__"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Import Packages"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# import packages\n",
"import os\n",
"\n",
"import pandas as pd\n",
"\n",
"from calendar import monthrange\n",
"from datetime import datetime, timedelta\n",
"\n",
"from azureml.core import Dataset, Datastore, Workspace, Run"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Set up Configuraton and Create Azure ML Workspace\n",
"\n",
"If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, go through the [configuration notebook](https://github.com/Azure/MachineLearningNotebooks/blob/master/configuration.ipynb) first if you haven't already to establish your connection to the Azure ML Workspace."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ws = Workspace.from_config()\n",
"dstore = ws.get_default_datastore()\n",
"\n",
"dset_name = 'weather-data-florida'\n",
"\n",
"print(ws.name, ws.resource_group, ws.location, ws.subscription_id, dstore.name, sep = '\\n')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load Data to Blob Storage\n",
"\n",
"This demo uses 2019 weather data under within weather-data folder. You can replace this data with your own."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Upload data to blob storage so it can be used as a Dataset."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dstore.upload('weather-data', dset_name, overwrite=True, show_progress=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create & Register Tabular Dataset with time-series trait from Blob\n",
"\n",
"The API on Tabular datasets with time-series trait is specially designed to handle Tabular time-series data and time related operations more efficiently. By registering your time-series dataset, you are publishing your dataset to your workspace so that it is accessible to anyone with the same subscription id. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create Tabular Dataset instance from blob storage datapath.\n",
"\n",
"**TIP:** you can set virtual columns in the partition_format. I.e. if you partition the weather data by state and city, the path can be '/{STATE}/{CITY}/{coarse_time:yyy/MM}/data.parquet'. STATE and CITY would then appear as virtual columns in the dataset, allowing for efficient filtering by these grains. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"datastore_path = [(dstore, dset_name + '/*/*/data.parquet')]\n",
"dataset = Dataset.Tabular.from_parquet_files(path=datastore_path, partition_format = dset_name + '/{coarse_time:yyyy/MM}/data.parquet')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Assign fine timestamp column for Tabular Dataset to activate Time Series related APIs. The column to be assigned should be a Date type, otherwise the assigning will fail."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# for this demo, leave out coarse_time so fine_grain_timestamp is used\n",
"tsd = dataset.with_timestamp_columns(fine_grain_timestamp='datetime') # coarse_grain_timestamp='coarse_time')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Register the dataset for easy access from anywhere in Azure ML and to keep track of versions, lineage. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# register dataset to Workspace\n",
"registered_ds = tsd.register(ws, dset_name, create_new_version=True, description='Data for Tabular Dataset with time-series trait demo.', tags={ 'type': 'TabularDataset' })"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Reload the Dataset from Workspace"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# get dataset by dataset name\n",
"tsd = Dataset.get_by_name(ws, name=dset_name)\n",
"tsd.to_pandas_dataframe().head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Filter Data by Time Windows\n",
"\n",
"Once your data has been loaded into the notebook, you can query by time using the time_before(), time_after(), time_between(), and time_recent() functions. You can also choose to drop or keep certain columns. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Before Time Input"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# select data that occurs before a specified date\n",
"tsd2 = tsd.time_before(datetime(2019, 6, 12))\n",
"tsd2.to_pandas_dataframe().tail(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## After Time Input"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# select data that occurs after a specified date\n",
"tsd2 = tsd.time_after(datetime(2019, 5, 30))\n",
"tsd2.to_pandas_dataframe().head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Before & After Time Inputs\n",
"\n",
"You can chain time functions together."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**NOTE:** You must set the coarse_grain_timestamp to None to filter on the fine_grain_timestamp. The below cell will fail unless the second line is uncommented "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# select data that occurs within a given time range\n",
"#tsd = tsd.with_timestamp_columns(fine_grain_timestamp='datetime', coarse_grain_timestamp=None)\n",
"tsd2 = tsd.time_after(datetime(2019, 1, 2)).time_before(datetime(2019, 1, 10))\n",
"tsd2.to_pandas_dataframe().head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Time Range Input"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# another way to select data that occurs within a given time range\n",
"tsd2 = tsd.time_between(start_time=datetime(2019, 1, 31, 23, 59, 59), end_time=datetime(2019, 2, 7))\n",
"tsd2.to_pandas_dataframe().head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Time Recent Input"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This function takes in a datetime.timedelta and returns a dataset containing the data from datetime.now()-timedelta() to datetime.now()."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tsd2 = tsd.time_recent(timedelta(weeks=5, days=0))\n",
"tsd2.to_pandas_dataframe().head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**NOTE:** This will return an empty dataframe there is no data within the last 2 days."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tsd2 = tsd.time_recent(timedelta(days=2))\n",
"tsd2.to_pandas_dataframe().tail(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Drop Columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<font color=red>If a timeseries column is dropped, the corresponding capabilities will be dropped for the returned dataset.</font><br>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"tsd2 = tsd.drop_columns(columns=['snowDepth', 'version', 'datetime'])\n",
"tsd2.take(5).to_pandas_dataframe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The exception is expected because dataset loses timeseries capabilities to do time travel."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"try:\n",
" tsd2.time_before(datetime(2019, 6, 12)).to_pandas_dataframe().tail(5)\n",
"except Exception as e:\n",
" print('Expected exception : {}'.format(str(e)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Drop will return dataset with timeseries capabilities if modify column list to exclude timestamp columns."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tsd2 = tsd.drop_columns(columns=['snowDepth', 'version', 'upload_date'])\n",
"tsd2.take(5).to_pandas_dataframe()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tsd2.time_before(datetime(2019, 6, 12)).to_pandas_dataframe().tail(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Keep Columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<font color=red>If a timeseries column is not included, the timeseries capabilities will be dropped for the returned dataset.</font><br>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tsd2 = tsd.keep_columns(columns=['snowDepth'], validate=False)\n",
"tsd2.to_pandas_dataframe().tail()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The exception is expected because dataset loses timeseries capabilities to do time travel."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"try:\n",
" tsd2.time_before(datetime(2019, 6, 12)).to_pandas_dataframe().tail(5)\n",
"except Exception as e:\n",
" print('Expected exception : {}'.format(str(e)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Keep will return dataset with timeseries capabilities if modify column list to include timestamp columns."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tsd2 = tsd.keep_columns(columns=['snowDepth', 'datetime', 'coarse_time'], validate=False)\n",
"tsd2.to_pandas_dataframe().tail()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tsd2.time_before(datetime(2019, 6, 12)).to_pandas_dataframe().tail(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Resetting Timestamp Columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Rules for reseting are:\n",
"- You cannot assign 'None' to fine_grain_timestamp while assign a valid column name to coarse_grain_timestamp because coarse_grain_timestamp is optional while fine_grain_timestamp is mandatory for Tabular time series data.\n",
"- If you assign 'None' to fine_grain_timestamp, then both fine_grain_timestamp and coarse_grain_timestamp will all be cleared.\n",
"- If you assign only 'None' to coarse_grain_timestamp, then only coarse_grain_timestamp will be cleared."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Illegal clearing, exception is expected.\n",
"try:\n",
" tsd2 = tsd.with_timestamp_columns(fine_grain_timestamp=None, coarse_grain_timestamp='coarse_time')\n",
"except Exception as e:\n",
" print('Cleaning not allowed because {}'.format(str(e)))\n",
"\n",
"# clear both\n",
"tsd2 = tsd.with_timestamp_columns(fine_grain_timestamp=None, coarse_grain_timestamp=None)\n",
"print('after clean both with None/None, timestamp columns are: {}'.format(tsd2.timestamp_columns))\n",
"\n",
"# clear coarse_grain_timestamp only and assign 'datetime' as fine timestamp column\n",
"tsd2 = tsd2.with_timestamp_columns(fine_grain_timestamp='datetime', coarse_grain_timestamp=None)\n",
"print('after clean coarse timestamp column, timestamp columns are: {}'.format(tsd2.timestamp_columns))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/work-with-data/datasets-tutorial/datasets-tutorial.png)"
]
}
],
"metadata": {
"authors": [
{
"name": "ylxiong"
}
],
"category": "tutorial",
"compute": [
"Local"
],
"datasets": [
"NOAA"
],
"deployment": [
"None"
],
"exclude_from_index": false,
"framework": [
"Azure ML"
],
"friendly_name": "Filtering data using Tabular Timeseiries Dataset related API",
"index_order": 1,
"kernelspec": {
"display_name": "Python 3.6",
"language": "python",
"name": "python36"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
},
"notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License.",
"star_tag": [
"featured"
],
"tags": [
"Dataset",
"Tabular Timeseries"
],
"task": "Filtering"
},
"nbformat": 4,
"nbformat_minor": 2
}

View File

@@ -0,0 +1,6 @@
name: tabular-timeseries-dataset-filtering
dependencies:
- pip:
- azureml-sdk
- azureml-dataprep
- pandas